Hello all! I am currently trying to sort a data frame in a particular way, but I am having some difficulties with this. Specifically I want to sort the below dataset in such a way that there is only one line per ProteinID and if there are multiple GeneID or GeneName entries for a single proteinID, that they be concatenated with a comma separating them. The way I have done it earlier worked fine for small datasets, but as I am working with around 30,000 entries, it proved too slow and I'm not sure how to do it in another way. Here is an example of the input. ProteinID GeneID GeneName M-012847-00 NM_057175 NARG1 M-012847-00 NM_057175 TBDN100 M-015544-00 NM_153008 FLJ30277 M-015544-00 NM_153027 FLJ31659 M-024202-00 NM_207330 NIPAL1 M-024202-00 NM_207330 NPAL1 Here is an example showing what I want: ProteinID GeneID GeneName M-012847-00 NM_057175 NARG1, TBDN100 M-015544-00 NM_153008, NM_153027 FLJ30277, FLJ31659 M-024202-00 NM_207330 NIPAL1, NPAL1 Here is the code I have been using so far. I have only managed to get this to work by using a loop, which I know is not the best way, but at the moment I'm stuck. colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00") colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330") colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) idVec <- character() geneIDVec <- character() geneNameVec <- character() dataType <- "ProteinID" annotation <- data.frame() for (id in unique(dupes[[dataType]])) { print (id) idVec <- c(idVec, id) geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", ")) geneNameVec <- c(geneNameVec, paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", ")) annotation[[dataType]][annotation[[dataType]] == id] <- NA } filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, GeneName=geneNameVec) Thanks! [[alternative HTML version deleted]]
Is this what you want:> colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > require(sqldf) > sqldf("+ select ProteinID + , group_concat(GeneID) as GeneID + , group_concat(GeneName) as GeneName + from dupes + group by ProteinID + ", method = 'raw') ProteinID GeneID GeneName 1 M-012847-00 NM_057175,NM_057175 NARG1,TBDN100 2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659 3 M-024202-00 NM_207330,NM_207330 NIPAL1,NPAL1>On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <theberve at hotmail.com> wrote:> > Hello all! > > I am currently trying to sort a data frame in a particular way, but I am having some difficulties with this. Specifically I want to sort the below dataset in such a way that there is only one line per ProteinID and if there are multiple GeneID or GeneName entries for a single proteinID, that they be concatenated with a comma separating them. The way I have done it earlier worked fine for small datasets, but as I am working with around 30,000 entries, it proved too slow and I'm not sure how to do it in another way. > Here is an example of the input. > > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1 > > > M-012847-00 > > NM_057175 > > TBDN100 > > > M-015544-00 > > NM_153008 > > FLJ30277 > > > M-015544-00 > > NM_153027 > > FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1 > > > M-024202-00 > > NM_207330 > > NPAL1 > > Here is an example showing what I want: > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1, TBDN100 > > > M-015544-00 > > NM_153008, NM_153027 > > FLJ30277, FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1, NPAL1 > > Here is the code I have been using so far. I have only managed to get this to work by using a loop, which I know is not the best way, but at the moment I'm stuck. > > colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > > idVec <- character() > geneIDVec <- character() > geneNameVec <- character() > dataType <- "ProteinID" > annotation <- data.frame() > > for (id in unique(dupes[[dataType]])) { > ? ? ? print (id) > ? ? ? idVec <- c(idVec, id) > ? ? ? geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", ")) > ? ? ? geneNameVec <- c(geneNameVec, paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", ")) > ? ? ? annotation[[dataType]][annotation[[dataType]] == id] <- NA > } > filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, GeneName=geneNameVec) > > > Thanks! > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
This will give you the unique gene values:> colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > require(sqldf) > sqldf("+ select ProteinID + , group_concat(distinct GeneID) as GeneID + , group_concat(distinct GeneName) as GeneName + from dupes + group by ProteinID + ", method = 'raw') ProteinID GeneID GeneName 1 M-012847-00 NM_057175 NARG1,TBDN100 2 M-015544-00 NM_153008,NM_153027 FLJ30277,FLJ31659 3 M-024202-00 NM_207330 NIPAL1,NPAL1>On Fri, Jun 10, 2011 at 4:35 AM, hi Berven <theberve at hotmail.com> wrote:> > Hello all! > > I am currently trying to sort a data frame in a particular way, but I am having some difficulties with this. Specifically I want to sort the below dataset in such a way that there is only one line per ProteinID and if there are multiple GeneID or GeneName entries for a single proteinID, that they be concatenated with a comma separating them. The way I have done it earlier worked fine for small datasets, but as I am working with around 30,000 entries, it proved too slow and I'm not sure how to do it in another way. > Here is an example of the input. > > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1 > > > M-012847-00 > > NM_057175 > > TBDN100 > > > M-015544-00 > > NM_153008 > > FLJ30277 > > > M-015544-00 > > NM_153027 > > FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1 > > > M-024202-00 > > NM_207330 > > NPAL1 > > Here is an example showing what I want: > > > > > ProteinID > > GeneID > > GeneName > > > M-012847-00 > > NM_057175 > > NARG1, TBDN100 > > > M-015544-00 > > NM_153008, NM_153027 > > FLJ30277, FLJ31659 > > > M-024202-00 > > NM_207330 > > NIPAL1, NPAL1 > > Here is the code I have been using so far. I have only managed to get this to work by using a loop, which I know is not the best way, but at the moment I'm stuck. > > colA <- c("M-012847-00" ,"M-012847-00" ,"M-015544-00", "M-015544-00", "M-024202-00","M-024202-00") > colB <- c("NM_057175", "NM_057175", "NM_153008", "NM_153027", "NM_207330", "NM_207330") > colC <- c( 'NARG1', 'TBDN100', 'FLJ30277', 'FLJ31659', 'NIPAL1', 'NPAL1') > > dupes <- data.frame(ProteinID=colA, GeneID=colB, GeneName=colC) > > idVec <- character() > geneIDVec <- character() > geneNameVec <- character() > dataType <- "ProteinID" > annotation <- data.frame() > > for (id in unique(dupes[[dataType]])) { > ? ? ? print (id) > ? ? ? idVec <- c(idVec, id) > ? ? ? geneIDVec <- c(geneIDVec, paste(unique(dupes$GeneID[dupes[[dataType]] == id]), collapse=", ")) > ? ? ? geneNameVec <- c(geneNameVec, paste(unique(dupes$GeneName[dupes[[dataType]] == id]), collapse=", ")) > ? ? ? annotation[[dataType]][annotation[[dataType]] == id] <- NA > } > filtered <- data.frame(ProteinID=idVec, GeneID=geneIDVec, GeneName=geneNameVec) > > > Thanks! > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?