thr3ads.net - R help - [R] splitting a dataframe in R based on multiple gene names in a specific column [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Bogdan Tanasa

2017-Aug-22 23:57 UTC

[R] splitting a dataframe in R based on multiple gene names in a specific column

I would appreciate please a suggestion on how to do the following :

i'm working with a dataframe in R that contains in a specific column
multiple gene names, eg :
> df.sample.gene[15:20,2:8]     Chr     Start       End Ref Alt Func.refGene
Gene.refGene284 chr2  16080996  16080996   C   T ncRNA_exonic
       GACAT3448 chr2 113979920 113979920   C   T ncRNA_exonic
LINC01191,LOC100499194465 chr2 131279347 131279347   C   G
ncRNA_exonic              LOC440910525 chr2 223777758 223777758   T
A       exonic                  AP1S3626 chr3  99794575  99794575   G
 A       exonic                 COL8A1643 chr3 132601066 132601066   A
  G       exonic                  ACKR4

How could I obtain a dataframe where each line that has multiple gene names
(in the field Gene.refGene) is replicated with only one gene name ? i.e.

for the second row :

  448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191,LOC100499194

we shall get in the final output (that contains all the rows) :

  448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191
  448 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194

thanks a lot !

-- bogdan

	[[alternative HTML version deleted]]

Jim Lemon

2017-Aug-23 00:50 UTC

head link

[R] splitting a dataframe in R based on multiple gene names in a specific column

Hi Bogdan,
Messy, and very specific to your problem:

df.sample.gene<-read.table(
 text="Chr     Start       End Ref Alt Func.refGene  Gene.refGene
 284 chr2  16080996  16080996   C   T ncRNA_exonic  GACAT3
 448 chr2 113979920 113979920   C   T ncRNA_exonic  LINC01191,LOC100499194
 465 chr2 131279347 131279347   C   G ncRNA_exonic  LOC440910
 525 chr2 223777758 223777758   T   A       exonic  AP1S3
 626 chr3  99794575  99794575   G   A       exonic  COL8A1
 643 chr3 132601066 132601066   A   G       exonic  ACKR4
 655 chr3 132601999 132601999   A   G       exonic  BCDF5,CDFG6",
 header=TRUE,stringsAsFactors=FALSE)

multgenes<-grep(",",df.sample.gene$Gene.refGene)
rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",")
ngenes<-unlist(lapply(rep_genes,length))
dup_row<-function(x) {
 newrows<-x
 lastcol<-dim(x)[2]
 rep_genes<-unlist(strsplit(x[,lastcol],","))
 for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x)
 newrows$Gene.refGene<-rep_genes
 return(newrows)
}
for(multgene in multgenes)
 df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,]))
df.sample.gene<-df.sample.gene[-multgenes,]
df.sample.gene

I added a second line with multiple genes to make sure that it would
work with more than one line.

Jim


On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com>
wrote:> I would appreciate please a suggestion on how to do the following :
>
> i'm working with a dataframe in R that contains in a specific column
> multiple gene names, eg :
>
>> df.sample.gene[15:20,2:8]
>      Chr     Start       End Ref Alt Func.refGene
> Gene.refGene284 chr2  16080996  16080996   C   T ncRNA_exonic
>        GACAT3448 chr2 113979920 113979920   C   T ncRNA_exonic
> LINC01191,LOC100499194465 chr2 131279347 131279347   C   G
> ncRNA_exonic              LOC440910525 chr2 223777758 223777758   T
> A       exonic                  AP1S3626 chr3  99794575  99794575   G
>  A       exonic                 COL8A1643 chr3 132601066 132601066   A
>   G       exonic                  ACKR4
>
> How could I obtain a dataframe where each line that has multiple gene names
> (in the field Gene.refGene) is replicated with only one gene name ? i.e.
>
> for the second row :
>
>   448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191,LOC100499194
>
> we shall get in the final output (that contains all the rows) :
>
>   448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191
>   448 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194
>
> thanks a lot !
>
> -- bogdan
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2017-Aug-25 19:26 UTC

head link

[R] splitting a dataframe in R based on multiple gene names in a specific column

If row numbers can be dispensed with, then tidyr makes this easy with 
the unnest function:

#####
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#>     filter, lag
#> The following objects are masked from 'package:base':
#>
#>     intersect, setdiff, setequal, union
library(purrr)
library(tidyr)

df.sample.gene<-read.table(
  text="Chr     Start       End Ref Alt Func.refGene  Gene.refGene
  284 chr2  16080996  16080996   C   T ncRNA_exonic  GACAT3
  448 chr2 113979920 113979920   C   T ncRNA_exonic  LINC01191,LOC100499194
  465 chr2 131279347 131279347   C   G ncRNA_exonic  LOC440910
  525 chr2 223777758 223777758   T   A       exonic  AP1S3
  626 chr3  99794575  99794575   G   A       exonic  COL8A1
  643 chr3 132601066 132601066   A   G       exonic  ACKR4
  655 chr3 132601999 132601999   A   G       exonic  BCDF5,CDFG6",
  header=TRUE,stringsAsFactors=FALSE)

df.sample.out <- (   df.sample.gene
                  %>% mutate( Gene.refGene = strsplit( Gene.refGene
                                                     , ","
                                                     )
                            )
                  %>% unnest( Gene.refGene )
                  )
df.sample.out
#>    Chr     Start       End Ref Alt Func.refGene Gene.refGene
#> 1 chr2  16080996  16080996   C   T ncRNA_exonic       GACAT3
#> 2 chr2 113979920 113979920   C   T ncRNA_exonic    LINC01191
#> 3 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194
#> 4 chr2 131279347 131279347   C   G ncRNA_exonic    LOC440910
#> 5 chr2 223777758 223777758   T   A       exonic        AP1S3
#> 6 chr3  99794575  99794575   G   A       exonic       COL8A1
#> 7 chr3 132601066 132601066   A   G       exonic        ACKR4
#> 8 chr3 132601999 132601999   A   G       exonic        BCDF5
#> 9 chr3 132601999 132601999   A   G       exonic        CDFG6
#####


On Wed, 23 Aug 2017, Jim Lemon wrote:
> Hi Bogdan,
> Messy, and very specific to your problem:
>
> df.sample.gene<-read.table(
> text="Chr     Start       End Ref Alt Func.refGene  Gene.refGene
> 284 chr2  16080996  16080996   C   T ncRNA_exonic  GACAT3
> 448 chr2 113979920 113979920   C   T ncRNA_exonic  LINC01191,LOC100499194
> 465 chr2 131279347 131279347   C   G ncRNA_exonic  LOC440910
> 525 chr2 223777758 223777758   T   A       exonic  AP1S3
> 626 chr3  99794575  99794575   G   A       exonic  COL8A1
> 643 chr3 132601066 132601066   A   G       exonic  ACKR4
> 655 chr3 132601999 132601999   A   G       exonic  BCDF5,CDFG6",
> header=TRUE,stringsAsFactors=FALSE)
>
> multgenes<-grep(",",df.sample.gene$Gene.refGene)
>
rep_genes<-strsplit(df.sample.gene$Gene.refGene[multgenes],",")
> ngenes<-unlist(lapply(rep_genes,length))
> dup_row<-function(x) {
> newrows<-x
> lastcol<-dim(x)[2]
> rep_genes<-unlist(strsplit(x[,lastcol],","))
> for(i in 2:length(rep_genes)) newrows<-rbind(newrows,x)
> newrows$Gene.refGene<-rep_genes
> return(newrows)
> }
> for(multgene in multgenes)
> df.sample.gene<-rbind(df.sample.gene,dup_row(df.sample.gene[multgene,]))
> df.sample.gene<-df.sample.gene[-multgenes,]
> df.sample.gene
>
> I added a second line with multiple genes to make sure that it would
> work with more than one line.
>
> Jim
>
>
> On Wed, Aug 23, 2017 at 9:57 AM, Bogdan Tanasa <tanasa at gmail.com>
wrote:
>> I would appreciate please a suggestion on how to do the following :
>>
>> i'm working with a dataframe in R that contains in a specific
column
>> multiple gene names, eg :
>>
>>> df.sample.gene[15:20,2:8]
>>      Chr     Start       End Ref Alt Func.refGene
>> Gene.refGene284 chr2  16080996  16080996   C   T ncRNA_exonic
>>        GACAT3448 chr2 113979920 113979920   C   T ncRNA_exonic
>> LINC01191,LOC100499194465 chr2 131279347 131279347   C   G
>> ncRNA_exonic              LOC440910525 chr2 223777758 223777758   T
>> A       exonic                  AP1S3626 chr3  99794575  99794575   G
>>  A       exonic                 COL8A1643 chr3 132601066 132601066   A
>>   G       exonic                  ACKR4
>>
>> How could I obtain a dataframe where each line that has multiple gene
names
>> (in the field Gene.refGene) is replicated with only one gene name ?
i.e.
>>
>> for the second row :
>>
>>   448 chr2 113979920 113979920   C   T ncRNA_exonic
LINC01191,LOC100499194
>>
>> we shall get in the final output (that contains all the rows) :
>>
>>   448 chr2 113979920 113979920   C   T ncRNA_exonic LINC01191
>>   448 chr2 113979920 113979920   C   T ncRNA_exonic LOC100499194
>>
>> thanks a lot !
>>
>> -- bogdan
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

R help - Aug 2017 - splitting a dataframe in R based on multiple gene names in a specific column

[R] splitting a dataframe in R based on multiple gene names in a specific column

[R] splitting a dataframe in R based on multiple gene names in a specific column

[R] splitting a dataframe in R based on multiple gene names in a specific column

Maybe Matching Threads