Felix Müller-Sarnowski
2009-Jul-26 19:26 UTC
[R] splitting multiple data in one column into multiple rows with one entry per column
Dear R colleagues, I annotated a list of single nuclotide polymorphiosms (SNP) with the corresponding genes using biomaRt. The result is the following data.frame (pasted from R): snp ensembl_gene_id 1 rs8032583 2 rs1071600 ENSG00000101605 3 rs13406898 ENSG00000167165 4 rs7030479 ENSG00000107249 5 rs1244414 ENSG00000165629 6 rs1005636 ENSG00000230681 7 rs927913 ENSG00000151655;ENSG00000227546 8 rs4832680 9 rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817 10 rs7035549 11 rs12707538 ENSG00000186472 As you can see, the SNP with the identifier rs4435168 corresponds to 3 gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a join of several data.frames using the ensembl_gene_id later on, I'd like to split columns with multiple gene identifiers into rows with only one ensembl gene identifier each. So for the example of rs4435168 it should look like this (faked output): snp ensembl_gene_id ... 9 rs4435168 ENSG00000229164 10 rs4435168 ENSG00000225227 11 rs4435168 ENSG00000211817 ... This is just a simple example. Finally there will be a lot of other columns, which should be replicated like the snp column. Does anyone know how to do this? I tried strsplit, which splits nicely the multiple entries in column ensembl_gene_id. But how to go on? I'd appreciate any kind of help very much! Best regards from Munich, Felix
jim holtman
2009-Jul-26 21:14 UTC
[R] splitting multiple data in one column into multiple rows with one entry per column
Try this:> x <- read.table(textConnection("snp ensembl_gene_id+ rs8032583 + rs1071600 ENSG00000101605 + rs13406898 ENSG00000167165 + rs7030479 ENSG00000107249 + rs1244414 ENSG00000165629 + rs1005636 ENSG00000230681 + rs927913 ENSG00000151655;ENSG00000227546 + rs4832680 + rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817 + rs7035549 + rs12707538 ENSG00000186472"), header=TRUE, fill=TRUE)> closeAllConnections() > x.new <- do.call(rbind, apply(x, 1, function(.row){+ .ids <- unlist(strsplit(.row[2], ';')) + # check for no data in second column; substitute a blank + if (length(.ids) == 0) return(cbind(.row[1], "")) + else return(cbind(.row[1], .ids)) + }))> x.new.ids snp "rs8032583" "" snp "rs1071600" "ENSG00000101605" snp "rs13406898" "ENSG00000167165" snp "rs7030479" "ENSG00000107249" snp "rs1244414" "ENSG00000165629" snp "rs1005636" "ENSG00000230681" ensembl_gene_id1 "rs927913" "ENSG00000151655" ensembl_gene_id2 "rs927913" "ENSG00000227546" snp "rs4832680" "" ensembl_gene_id1 "rs4435168" "ENSG00000229164" ensembl_gene_id2 "rs4435168" "ENSG00000225227" ensembl_gene_id3 "rs4435168" "ENSG00000211817" snp "rs7035549" "" snp "rs12707538" "ENSG00000186472">On Sun, Jul 26, 2009 at 3:26 PM, Felix M?ller-Sarnowski<drflxms at googlemail.com> wrote:> Dear R colleagues, > > I annotated a list of single nuclotide polymorphiosms (SNP) with the > corresponding genes using biomaRt. The result is the following > data.frame (pasted from R): > > snp ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ensembl_gene_id > 1 ? ? ?rs8032583 > 2 ? ? ?rs1071600 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000101605 > 3 ? ? ?rs13406898 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000167165 > 4 ? ? ?rs7030479 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000107249 > 5 ? ? ?rs1244414 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000165629 > 6 ? ? ?rs1005636 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000230681 > 7 ? ? ?rs927913 ? ? ? ? ? ? ? ? ENSG00000151655;ENSG00000227546 > 8 ? ? ?rs4832680 > 9 ? ? ?rs4435168 ENSG00000229164;ENSG00000225227;ENSG00000211817 > 10 ? ? rs7035549 > 11 ? ? rs12707538 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ENSG00000186472 > > As you can see, the SNP with the identifier rs4435168 corresponds to 3 > gene ids, rs927913 corresponds to 2 gene ids. As I'd like to perform a > join of several data.frames using the ensembl_gene_id later on, I'd > like to split columns with multiple gene identifiers into rows with > only one ensembl gene identifier each. So for the example of rs4435168 > it should look like this (faked output): > > snp ? ? ? ? ? ? ? ? ? ensembl_gene_id > ... > 9 ? ? ?rs4435168 ENSG00000229164 > 10 ? ?rs4435168 ENSG00000225227 > 11 ? ?rs4435168 ENSG00000211817 > ... > > This is just a simple example. Finally there will be a lot of other > columns, which should be replicated like the snp column. > > Does anyone know how to do this? I tried strsplit, which splits nicely > the multiple entries in column ensembl_gene_id. But how to go on? > > I'd appreciate any kind of help very much! > Best regards from Munich, > Felix > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?