thr3ads.net - R help - [R] splitting multiple data in one column into multiple rows with one entry per column [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Felix Müller-Sarnowski

2009-Jul-26 19:26 UTC

[R] splitting multiple data in one column into multiple rows with one entry per column

Dear R colleagues,

I annotated a list of single nuclotide polymorphiosms (SNP) with the
corresponding genes using biomaRt. The result is the following
data.frame (pasted from R):

snp                                 ensembl_gene_id
1      rs8032583
2      rs1071600                                 ENSG00000101605
3      rs13406898                                 ENSG00000167165
4      rs7030479                                 ENSG00000107249
5      rs1244414                                 ENSG00000165629
6      rs1005636                                 ENSG00000230681
7      rs927913                 ENSG00000151655;ENSG00000227546
8      rs4832680
9      rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
10     rs7035549
11     rs12707538                                 ENSG00000186472

As you can see, the SNP with the identifier rs4435168 corresponds to 3
gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a
join of several data.frames using the ensembl_gene_id later on, I'd
like to split columns with multiple gene identifiers into rows with
only one ensembl gene identifier each. So for the example of rs4435168
it should look like this (faked output):

snp                   ensembl_gene_id
...
9      rs4435168 ENSG00000229164
10    rs4435168 ENSG00000225227
11    rs4435168 ENSG00000211817
...

This is just a simple example. Finally there will be a lot of other
columns, which should be replicated like the snp column.

Does anyone know how to do this? I tried strsplit, which splits nicely
the multiple entries in column ensembl_gene_id. But how to go on?

I'd appreciate any kind of help very much!
Best regards from Munich,
Felix

jim holtman

2009-Jul-26 21:14 UTC

head link

[R] splitting multiple data in one column into multiple rows with one entry per column

Try this:
> x <- read.table(textConnection("snp  ensembl_gene_id+  rs8032583
+  rs1071600                                 ENSG00000101605
+  rs13406898                                 ENSG00000167165
+  rs7030479                                 ENSG00000107249
+  rs1244414                                 ENSG00000165629
+  rs1005636                                 ENSG00000230681
+  rs927913                 ENSG00000151655;ENSG00000227546
+  rs4832680
+  rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
+  rs7035549
+  rs12707538                                 ENSG00000186472"),
header=TRUE, fill=TRUE)> closeAllConnections()
> x.new <- do.call(rbind, apply(x, 1, function(.row){+     .ids <- unlist(strsplit(.row[2], ';'))
+     # check for no data in second column; substitute a blank
+     if (length(.ids) == 0) return(cbind(.row[1], ""))
+     else return(cbind(.row[1], .ids))
+ }))> x.new                              .ids
snp              "rs8032583"  ""
snp              "rs1071600"  "ENSG00000101605"
snp              "rs13406898" "ENSG00000167165"
snp              "rs7030479"  "ENSG00000107249"
snp              "rs1244414"  "ENSG00000165629"
snp              "rs1005636"  "ENSG00000230681"
ensembl_gene_id1 "rs927913"   "ENSG00000151655"
ensembl_gene_id2 "rs927913"   "ENSG00000227546"
snp              "rs4832680"  ""
ensembl_gene_id1 "rs4435168"  "ENSG00000229164"
ensembl_gene_id2 "rs4435168"  "ENSG00000225227"
ensembl_gene_id3 "rs4435168"  "ENSG00000211817"
snp              "rs7035549"  ""
snp              "rs12707538"
"ENSG00000186472">

On Sun, Jul 26, 2009 at 3:26 PM, Felix
M?ller-Sarnowski<drflxms at googlemail.com> wrote:> Dear R colleagues,
>
> I annotated a list of single nuclotide polymorphiosms (SNP) with the
> corresponding genes using biomaRt. The result is the following
> data.frame (pasted from R):
>
> snp ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ensembl_gene_id
> 1 ? ? ?rs8032583
> 2 ? ? ?rs1071600 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000101605
> 3 ? ? ?rs13406898 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000167165
> 4 ? ? ?rs7030479 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000107249
> 5 ? ? ?rs1244414 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000165629
> 6 ? ? ?rs1005636 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000230681
> 7 ? ? ?rs927913 ? ? ? ? ? ? ? ? ENSG00000151655;ENSG00000227546
> 8 ? ? ?rs4832680
> 9 ? ? ?rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817
> 10 ? ? rs7035549
> 11 ? ? rs12707538 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000186472
>
> As you can see, the SNP with the identifier rs4435168 corresponds to 3
> gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a
> join of several data.frames using the ensembl_gene_id later on, I'd
> like to split columns with multiple gene identifiers into rows with
> only one ensembl gene identifier each. So for the example of rs4435168
> it should look like this (faked output):
>
> snp ? ? ? ? ? ? ? ? ? ensembl_gene_id
> ...
> 9 ? ? ?rs4435168 ENSG00000229164
> 10 ? ?rs4435168 ENSG00000225227
> 11 ? ?rs4435168 ENSG00000211817
> ...
>
> This is just a simple example. Finally there will be a lot of other
> columns, which should be replicated like the snp column.
>
> Does anyone know how to do this? I tried strsplit, which splits nicely
> the multiple entries in column ensembl_gene_id. But how to go on?
>
> I'd appreciate any kind of help very much!
> Best regards from Munich,
> Felix
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Maybe Matching Threads

Error using getBM() to query BioMart archives

R help - Jul 2009 - splitting multiple data in one column into multiple rows with one entry per column

[R] splitting multiple data in one column into multiple rows with one entry per column

[R] splitting multiple data in one column into multiple rows with one entry per column

Maybe Matching Threads