thr3ads.net - R help - [R] string splitting and testing for enrichment [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Iain Gallagher

2009-Jun-20 14:28 UTC

[R] string splitting and testing for enrichment

Hi List

I have data in the following form:

Gene    TFBS
NUDC     PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149) T3R(167)
HLF(191)
RPA2     STAT4(57) HEB(251) 
TAF12     PAX3(53) YY1(92) BRCA(99) GLI(101) 
EIF3I     NERF(10) P300(10) 
TRAPPC3     HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122) 
TRAPPC3     EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89)
RFX(121) ZTA(168)
NDUFS5     WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146) 
TIE1     NRSE(129) 

I would like to test the 2nd column (each value has letters followed by numbers
in brackets) here for enrichment via fisher.test.

To that end I am trying to create two factors made up of column 1 (Gene) and
column 2 (TFBS) where each Gene would have several entries matching each TFBS.

My main problem just now is that I can't split the TFBS column into separate
strings (at the moment that 2nd column is all one string for each Gene).

Here's where I am just now:

test<-as.character(dataIn[,2]) # convert the 2nd column from factor to
character
test2<-unlist(strsplit(test[1], ' ')) # split the first element into
individual strings (only the first element just now because I'm joust trying
to get things working)
test3<-unlist(strsplit(test2, '\\([0-9]\\)')) # get rid of numbers
and brackets

now this does not behave as I hoped - it gives me:
> test3[1] "PPARA"                  "HNF4(20)"              
"HNF4(96)"             
[4] "AHRARNT(104)"           "CACBINDINGPROTEIN(149)"
"T3R(167)"             
[7] "HLF(191)"  

ie it only removes the numbers and brackets from the first entry and not the
others.

Could someone point out my mistake please?

Once I have all the TFBS (letters only) for each Gene I would then count how
often a TFBS occurs and use this data for a fisher.test testing for enrichment
of TFBS in the list I have. I'm a rather muddled here though and would
appreciate advice on whether this is the right approach.

Thanks

Iain
> sessionInfo()R version 2.9.0 (2009-04-17) 
x86_64-pc-linux-gnu 

locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     






	[[alternative HTML version deleted]]

Gabor Grothendieck

2009-Jun-20 15:12 UTC

head link

[R] string splitting and testing for enrichment

Try this.   We read in data and split TFBS on "(" or ") " or
")"
giving s and reform s into a matrix prepending the Gene name as
column 1.  Convert that to a data frame and make the third
column numeric.

Lines <- "Gene,TFBS
NUDC,PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149)
T3R(167) HLF(191)
RPA2,STAT4(57) HEB(251)
TAF12,PAX3(53) YY1(92) BRCA(99) GLI(101)
EIF3I,NERF(10) P300(10)
TRAPPC3,HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122)
TRAPPC3,EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89)
RFX(121) ZTA(168)
NDUFS5,WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146)
TIE1,NRSE(129)"

DF <- read.csv(textConnection(Lines), as.is = TRUE)

s <- strsplit(DF$TFBS, "\\(|\\) |\\)")
f <- function(i) cbind(DF[i, "Gene"], matrix(s[[i]], nc = 2, byrow
= TRUE))
DF2 <- as.data.frame(do.call(rbind, lapply(seq_along(s), f)))
DF2[[3]] <- as.numeric(DF2[[3]])
View(DF2)


On Sat, Jun 20, 2009 at 10:28 AM, Iain
Gallagher<iaingallagher at btopenworld.com> wrote:> Hi List
>
> I have data in the following form:
>
> Gene??? TFBS
> NUDC??? ?PPARA(1) HNF4(20) HNF4(96) AHRARNT(104) CACBINDINGPROTEIN(149)
T3R(167) HLF(191)
> RPA2??? ?STAT4(57) HEB(251)
> TAF12??? ?PAX3(53) YY1(92) BRCA(99) GLI(101)
> EIF3I??? ?NERF(10) P300(10)
> TRAPPC3??? ?HIC1(3) PAX5(17) PAX5(110) NRF1(119) HIC1(122)
> TRAPPC3??? ?EGR(26) ZNF219(27) SP3(32) EGR(32) NFKAPPAB65(89) NFKAPPAB(89)
RFX(121) ZTA(168)
> NDUFS5??? ?WHN(14) ATF(57) EGR3(59) PAX5(99) SF1(108) NRSE(146)
> TIE1??? ?NRSE(129)
>
> I would like to test the 2nd column (each value has letters followed by
numbers in brackets) here for enrichment via fisher.test.
>
> To that end I am trying to create two factors made up of column 1 (Gene)
and column 2 (TFBS) where each Gene would have several entries matching each
TFBS.
>
> My main problem just now is that I can't split the TFBS column into
separate strings (at the moment that 2nd column is all one string for each
Gene).
>
> Here's where I am just now:
>
> test<-as.character(dataIn[,2]) # convert the 2nd column from factor to
character
> test2<-unlist(strsplit(test[1], ' ')) # split the first element
into individual strings (only the first element just now because I'm joust
trying to get things working)
> test3<-unlist(strsplit(test2, '\\([0-9]\\)')) # get rid of
numbers and brackets
>
> now this does not behave as I hoped - it gives me:
>
>> test3
> [1] "PPARA"????????????????? "HNF4(20)"??????????????
"HNF4(96)"
> [4] "AHRARNT(104)"?????????? "CACBINDINGPROTEIN(149)"
"T3R(167)"
> [7] "HLF(191)"
>
> ie it only removes the numbers and brackets from the first entry and not
the others.
>
> Could someone point out my mistake please?
>
> Once I have all the TFBS (letters only) for each Gene I would then count
how often a TFBS occurs and use this data for a fisher.test testing for
enrichment of TFBS in the list I have. I'm a rather muddled here though and
would appreciate advice on whether this is the right approach.
>
> Thanks
>
> Iain
>
>> sessionInfo()
> R version 2.9.0 (2009-04-17)
> x86_64-pc-linux-gnu
>
> locale:
>
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats???? graphics? grDevices utils???? datasets? methods?? base
>
>
>
>
>
>
> ? ? ? ?[[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Jun 2009 - string splitting and testing for enrichment

[R] string splitting and testing for enrichment

[R] string splitting and testing for enrichment

Possibly Parallel Threads