Bogdan Tanasa
2017-Aug-22 23:57 UTC
[R] splitting a dataframe in R based on multiple gene names in a specific column
I would appreciate please a suggestion on how to do the following : i'm working with a dataframe in R that contains in a specific column multiple gene names, eg :> df.sample.gene[15:20,2:8]Chr Start End Ref Alt Func.refGene Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910525 chr2 223777758 223777758 T A exonic AP1S3626 chr3 99794575 99794575 G A exonic COL8A1643 chr3 132601066 132601066 A G exonic ACKR4 How could I obtain a dataframe where each line that has multiple gene names (in the field Gene.refGene) is replicated with only one gene name ? i.e. for the second row : 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 we shall get in the final output (that contains all the rows) : 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 thanks a lot ! -- bogdan [[alternative HTML version deleted]]
Jim Lemon
2017-Aug-23 00:50 UTC
[R] splitting a dataframe in R based on multiple gene names in a specific column
Hi Bogdan, Messy, and very specific to your problem: df.sample.gene<-read.table( text="Chr Start End Ref Alt Func.refGene Gene.refGene 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 525 chr2 223777758 223777758 T A exonic AP1S3 626 chr3 99794575 99794575 G A exonic COL8A1 643 chr3 132601066 132601066 A G exonic ACKR4 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", header=TRUE,stringsAsFactors=FALSE) multgenes<-grep(",",df.sample.gene$Gene.refGene) rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",") ngenes<-unlist(lapply(rep_genes,length)) dup_row<-function(x) { newrows<-x lastcol<-dim(x)[2] rep_genes<-unlist(strsplit(x[,lastcol],",")) for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x) newrows$Gene.refGene<-rep_genes return(newrows) } for(multgene in multgenes) df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,])) df.sample.gene<-df.sample.gene[-multgenes,] df.sample.gene I added a second line with multiple genes to make sure that it would work with more than one line. Jim On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com> wrote:> I would appreciate please a suggestion on how to do the following : > > i'm working with a dataframe in R that contains in a specific column > multiple gene names, eg : > >> df.sample.gene[15:20,2:8] > Chr Start End Ref Alt Func.refGene > Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic > GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic > LINC01191,LOC100499194465 chr2 131279347 131279347 C G > ncRNA_exonic LOC440910525 chr2 223777758 223777758 T > A exonic AP1S3626 chr3 99794575 99794575 G > A exonic COL8A1643 chr3 132601066 132601066 A > G exonic ACKR4 > > How could I obtain a dataframe where each line that has multiple gene names > (in the field Gene.refGene) is replicated with only one gene name ? i.e. > > for the second row : > > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 > > we shall get in the final output (that contains all the rows) : > > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 > 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 > > thanks a lot ! > > -- bogdan > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jeff Newmiller
2017-Aug-25 19:26 UTC
[R] splitting a dataframe in R based on multiple gene names in a specific column
If row numbers can be dispensed with, then tidyr makes this easy with the unnest function: ##### library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union library(purrr) library(tidyr) df.sample.gene<-read.table( text="Chr Start End Ref Alt Func.refGene Gene.refGene 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 525 chr2 223777758 223777758 T A exonic AP1S3 626 chr3 99794575 99794575 G A exonic COL8A1 643 chr3 132601066 132601066 A G exonic ACKR4 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", header=TRUE,stringsAsFactors=FALSE) df.sample.out <- ( df.sample.gene %>% mutate( Gene.refGene = strsplit( Gene.refGene , "," ) ) %>% unnest( Gene.refGene ) ) df.sample.out #> Chr Start End Ref Alt Func.refGene Gene.refGene #> 1 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 #> 2 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 #> 3 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 #> 4 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 #> 5 chr2 223777758 223777758 T A exonic AP1S3 #> 6 chr3 99794575 99794575 G A exonic COL8A1 #> 7 chr3 132601066 132601066 A G exonic ACKR4 #> 8 chr3 132601999 132601999 A G exonic BCDF5 #> 9 chr3 132601999 132601999 A G exonic CDFG6 ##### On Wed, 23 Aug 2017, Jim Lemon wrote:> Hi Bogdan, > Messy, and very specific to your problem: > > df.sample.gene<-read.table( > text="Chr Start End Ref Alt Func.refGene Gene.refGene > 284 chr2 16080996 16080996 C T ncRNA_exonic GACAT3 > 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 > 465 chr2 131279347 131279347 C G ncRNA_exonic LOC440910 > 525 chr2 223777758 223777758 T A exonic AP1S3 > 626 chr3 99794575 99794575 G A exonic COL8A1 > 643 chr3 132601066 132601066 A G exonic ACKR4 > 655 chr3 132601999 132601999 A G exonic BCDF5,CDFG6", > header=TRUE,stringsAsFactors=FALSE) > > multgenes<-grep(",",df.sample.gene$Gene.refGene) > rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",") > ngenes<-unlist(lapply(rep_genes,length)) > dup_row<-function(x) { > newrows<-x > lastcol<-dim(x)[2] > rep_genes<-unlist(strsplit(x[,lastcol],",")) > for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x) > newrows$Gene.refGene<-rep_genes > return(newrows) > } > for(multgene in multgenes) > df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,])) > df.sample.gene<-df.sample.gene[-multgenes,] > df.sample.gene > > I added a second line with multiple genes to make sure that it would > work with more than one line. > > Jim > > > On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com> wrote: >> I would appreciate please a suggestion on how to do the following : >> >> i'm working with a dataframe in R that contains in a specific column >> multiple gene names, eg : >> >>> df.sample.gene[15:20,2:8] >> Chr Start End Ref Alt Func.refGene >> Gene.refGene284 chr2 16080996 16080996 C T ncRNA_exonic >> GACAT3448 chr2 113979920 113979920 C T ncRNA_exonic >> LINC01191,LOC100499194465 chr2 131279347 131279347 C G >> ncRNA_exonic LOC440910525 chr2 223777758 223777758 T >> A exonic AP1S3626 chr3 99794575 99794575 G >> A exonic COL8A1643 chr3 132601066 132601066 A >> G exonic ACKR4 >> >> How could I obtain a dataframe where each line that has multiple gene names >> (in the field Gene.refGene) is replicated with only one gene name ? i.e. >> >> for the second row : >> >> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191,LOC100499194 >> >> we shall get in the final output (that contains all the rows) : >> >> 448 chr2 113979920 113979920 C T ncRNA_exonic LINC01191 >> 448 chr2 113979920 113979920 C T ncRNA_exonic LOC100499194 >> >> thanks a lot ! >> >> -- bogdan >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k