thr3ads.net - R help - [R] Sorting Data Frame Without Loop [Jun 2011]

If this information is useful, please help other people find it:
Share via:

hi Berven

2011-Jun-10 08:35 UTC

[R] Sorting Data Frame Without Loop

Hello all!
 
I am currently trying to sort a data frame in a particular way, but I am having
some difficulties with this. Specifically I want to sort the below dataset in
such a way that there is only one line per ProteinID and if there are multiple
GeneID or GeneName entries for a single proteinID, that they be concatenated
with a comma separating them. The way I have done it earlier worked fine for
small datasets, but as I am working with around 30,000 entries, it proved too
slow and I'm not sure how to do it in another way.
Here is an example of the input.  
 




ProteinID

GeneID

GeneName


M-012847-00

NM_057175

NARG1


M-012847-00

NM_057175

TBDN100


M-015544-00

NM_153008

FLJ30277


M-015544-00

NM_153027

FLJ31659


M-024202-00

NM_207330

NIPAL1


M-024202-00

NM_207330

NPAL1
 
Here is an example showing what I want: 




ProteinID

GeneID

GeneName


M-012847-00

NM_057175

NARG1, TBDN100


M-015544-00

NM_153008, NM_153027

FLJ30277, FLJ31659 


M-024202-00

NM_207330

NIPAL1, NPAL1
 
Here is the code I have been using so far. I have only managed to get this to
work by using a loop, which I know is not the best way, but at the moment
I'm stuck.
 
colA <- c("M-012847-00" ,"M-012847-00"
,"M-015544-00", "M-015544-00",
"M-024202-00","M-024202-00")
colB <- c("NM_057175", "NM_057175",
"NM_153008", "NM_153027", "NM_207330",
"NM_207330")
colC <- c( 'NARG1', 'TBDN100', 'FLJ30277',
'FLJ31659', 'NIPAL1', 'NPAL1')
 
dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
 
idVec <- character()
geneIDVec <- character()
geneNameVec <- character()
dataType <- "ProteinID"
annotation <- data.frame()
 
for (id in unique(dupes[[dataType]])) {
       print (id)
       idVec <- c(idVec, id)
       geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]]
== id]), collapse=", "))
       geneNameVec <- c(geneNameVec,
paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", "))
       annotation[[dataType]][annotation[[dataType]] == id] <- NA
}
filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec,
GeneName=geneNameVec)
 
 
Thanks! 		 	   		  
	[[alternative HTML version deleted]]

jim holtman

2011-Jun-10 17:12 UTC

head link

[R] Sorting Data Frame Without Loop

Is this what you want:
> colA <- c("M-012847-00" ,"M-012847-00"
,"M-015544-00", "M-015544-00",
"M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175",
"NM_153008", "NM_153027", "NM_207330",
"NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277',
'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
> require(sqldf)
> sqldf("+     select ProteinID
+             , group_concat(GeneID) as GeneID
+             , group_concat(GeneName) as GeneName
+         from dupes
+         group by ProteinID
+ ", method = 'raw')
    ProteinID              GeneID          GeneName
1 M-012847-00 NM_057175,NM_057175     NARG1,TBDN100
2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659
3 M-024202-00 NM_207330,NM_207330      NIPAL1,NPAL1>

On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <theberve at hotmail.com>
wrote:>
> Hello all!
>
> I am currently trying to sort a data frame in a particular way, but I am
having some difficulties with this. Specifically I want to sort the below
dataset in such a way that there is only one line per ProteinID and if there are
multiple GeneID or GeneName entries for a single proteinID, that they be
concatenated with a comma separating them. The way I have done it earlier worked
fine for small datasets, but as I am working with around 30,000 entries, it
proved too slow and I'm not sure how to do it in another way.
> Here is an example of the input.
>
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1
>
>
> M-012847-00
>
> NM_057175
>
> TBDN100
>
>
> M-015544-00
>
> NM_153008
>
> FLJ30277
>
>
> M-015544-00
>
> NM_153027
>
> FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1
>
>
> M-024202-00
>
> NM_207330
>
> NPAL1
>
> Here is an example showing what I want:
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1, TBDN100
>
>
> M-015544-00
>
> NM_153008, NM_153027
>
> FLJ30277, FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1, NPAL1
>
> Here is the code I have been using so far. I have only managed to get this
to work by using a loop, which I know is not the best way, but at the moment
I'm stuck.
>
> colA <- c("M-012847-00" ,"M-012847-00"
,"M-015544-00", "M-015544-00",
"M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175",
"NM_153008", "NM_153027", "NM_207330",
"NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277',
'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
>
> idVec <- character()
> geneIDVec <- character()
> geneNameVec <- character()
> dataType <- "ProteinID"
> annotation <- data.frame()
>
> for (id in unique(dupes[[dataType]])) {
> ? ? ? print (id)
> ? ? ? idVec <- c(idVec, id)
> ? ? ? geneIDVec <- c(geneIDVec,
paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", "))
> ? ? ? geneNameVec <- c(geneNameVec,
paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", "))
> ? ? ? annotation[[dataType]][annotation[[dataType]] == id] <- NA
> }
> filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec,
GeneName=geneNameVec)
>
>
> Thanks!
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

jim holtman

2011-Jun-10 17:14 UTC

head link

[R] Sorting Data Frame Without Loop

This will give you the unique gene values:
> colA <- c("M-012847-00" ,"M-012847-00"
,"M-015544-00", "M-015544-00",
"M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175",
"NM_153008", "NM_153027", "NM_207330",
"NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277',
'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
> require(sqldf)
> sqldf("+     select ProteinID
+             , group_concat(distinct GeneID) as GeneID
+             , group_concat(distinct GeneName) as GeneName
+         from dupes
+         group by ProteinID
+ ", method = 'raw')
    ProteinID              GeneID          GeneName
1 M-012847-00           NM_057175     NARG1,TBDN100
2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659
3 M-024202-00           NM_207330      NIPAL1,NPAL1>

On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <theberve at hotmail.com>
wrote:>
> Hello all!
>
> I am currently trying to sort a data frame in a particular way, but I am
having some difficulties with this. Specifically I want to sort the below
dataset in such a way that there is only one line per ProteinID and if there are
multiple GeneID or GeneName entries for a single proteinID, that they be
concatenated with a comma separating them. The way I have done it earlier worked
fine for small datasets, but as I am working with around 30,000 entries, it
proved too slow and I'm not sure how to do it in another way.
> Here is an example of the input.
>
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1
>
>
> M-012847-00
>
> NM_057175
>
> TBDN100
>
>
> M-015544-00
>
> NM_153008
>
> FLJ30277
>
>
> M-015544-00
>
> NM_153027
>
> FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1
>
>
> M-024202-00
>
> NM_207330
>
> NPAL1
>
> Here is an example showing what I want:
>
>
>
>
> ProteinID
>
> GeneID
>
> GeneName
>
>
> M-012847-00
>
> NM_057175
>
> NARG1, TBDN100
>
>
> M-015544-00
>
> NM_153008, NM_153027
>
> FLJ30277, FLJ31659
>
>
> M-024202-00
>
> NM_207330
>
> NIPAL1, NPAL1
>
> Here is the code I have been using so far. I have only managed to get this
to work by using a loop, which I know is not the best way, but at the moment
I'm stuck.
>
> colA <- c("M-012847-00" ,"M-012847-00"
,"M-015544-00", "M-015544-00",
"M-024202-00","M-024202-00")
> colB <- c("NM_057175", "NM_057175",
"NM_153008", "NM_153027", "NM_207330",
"NM_207330")
> colC <- c( 'NARG1', 'TBDN100', 'FLJ30277',
'FLJ31659', 'NIPAL1', 'NPAL1')
>
> dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC)
>
> idVec <- character()
> geneIDVec <- character()
> geneNameVec <- character()
> dataType <- "ProteinID"
> annotation <- data.frame()
>
> for (id in unique(dupes[[dataType]])) {
> ? ? ? print (id)
> ? ? ? idVec <- c(idVec, id)
> ? ? ? geneIDVec <- c(geneIDVec,
paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", "))
> ? ? ? geneNameVec <- c(geneNameVec,
paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", "))
> ? ? ? annotation[[dataType]][annotation[[dataType]] == id] <- NA
> }
> filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec,
GeneName=geneNameVec)
>
>
> Thanks!
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

R help - Jun 2011 - Sorting Data Frame Without Loop

[R] Sorting Data Frame Without Loop

[R] Sorting Data Frame Without Loop

[R] Sorting Data Frame Without Loop