Dear useRs, I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates. Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated! Example data: gene_id sample_1 sample_2 FL_EARLY FL_LATE 763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605 763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202 1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202 764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275 1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275 1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169 764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770 764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647 764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492 1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492 764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273 1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273 764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870 1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870 764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756 764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320 764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565 764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127 764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918 --Kelly V. [[alternative HTML version deleted]]
You will have to split() the data and unsplit() it after making the alterations. Have a look at the plyr package for such functions.> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Vining, Kelly > Sent: Friday 12 October 2012 5:42 > To: r-help at r-project.org > Subject: [R] average duplicated rows? > > Dear useRs, > > I have a slightly complicated data structure and am stuck trying to > extract what I need. I'm pasting an example of this data below. In some > cases, there are duplicates in the "gene_id" column because there are > two different "sample 1" values for a given "sample 2" value. Where > these duplicates exist, I need to average the corresponding "FL_EARLY" > values and retain the "FL_LATE" value and replace those two rows with a > row containing the "FL_EARLY" average so that I no longer have any > "gene_id" duplicates. > > Seems like this is a job for some version of the apply function, but > searching and puzzling over this has not gotten me anywhere. Any help > will be much appreciated! > > Example data: > > > gene_id sample_1 sample_2 FL_EARLY FL_LATE > 763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605 > 763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202 > 1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202 > 764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275 > 1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275 > 1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169 > 764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770 > 1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770 > 764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647 > 764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492 > 1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492 > 764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273 > 1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273 > 764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870 > 1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870 > 764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756 > 764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320 > 764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565 > 764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127 > 764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918 > > > > --Kelly V. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Hello, It could be a job for tapply, but I find it more suited for ?ave. dat <- read.table(text = " gene_id sample_1 sample_2 FL_EARLY FL_LATE 763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605 763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202 1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202 764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275 1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275 1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169 764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770 764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647 764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492 1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492 764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273 1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273 764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870 1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870 764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756 764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320 764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565 764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127 764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918 ", header = TRUE) av <- ave(dat$FL_EARLY, dat$gene_id) dat$FLY_EARLY <- av Hope this helps, Rui Barradas Em 12-10-2012 16:41, Vining, Kelly escreveu:> Dear useRs, > > I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates. > > Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated! > > Example data: > > > gene_id sample_1 sample_2 FL_EARLY FL_LATE > 763938 Eucgr.A00054 fl_S1E fl_S1L 13.170800 22.2605 > 763979 Eucgr.A00101 fl_S1E fl_S1L 0.367960 14.1202 > 1273243 Eucgr.A00101 fl_S2 fl_S1L 0.356625 14.1202 > 764169 Eucgr.A00350 fl_S1E fl_S1L 7.381070 43.9275 > 1273433 Eucgr.A00350 fl_S2 fl_S1L 10.674500 43.9275 > 1273669 Eucgr.A00650 fl_S2 fl_S1L 33.669100 50.0169 > 764480 Eucgr.A00744 fl_S1E fl_S1L 132.429000 747.2770 > 1273744 Eucgr.A00744 fl_S2 fl_S1L 142.659000 747.2770 > 764595 Eucgr.A00890 fl_S1E fl_S1L 2.937760 14.9647 > 764683 Eucgr.A00990 fl_S1E fl_S1L 8.681250 48.5492 > 1273947 Eucgr.A00990 fl_S2 fl_S1L 10.553300 48.5492 > 764710 Eucgr.A01020 fl_S1E fl_S1L 0.000000 57.9273 > 1273974 Eucgr.A01020 fl_S2 fl_S1L 0.000000 57.9273 > 764756 Eucgr.A01073 fl_S1E fl_S1L 8.504710 101.1870 > 1274020 Eucgr.A01073 fl_S2 fl_S1L 5.400010 101.1870 > 764773 Eucgr.A01091 fl_S1E fl_S1L 3.448910 15.7756 > 764826 Eucgr.A01152 fl_S1E fl_S1L 69.565700 198.2320 > 764831 Eucgr.A01158 fl_S1E fl_S1L 7.265640 30.9565 > 764845 Eucgr.A01172 fl_S1E fl_S1L 3.248020 16.9127 > 764927 Eucgr.A01269 fl_S1E fl_S1L 18.710200 76.6918 > > > > --Kelly V. > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
HI, You can also try this: dat1<-read.table(text=" ??????????? gene_id sample_1 sample_2?? FL_EARLY? FL_LATE 763938? Eucgr.A00054?? fl_S1E?? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101?? fl_S1E?? fl_S1L?? 0.367960? 14.1202 1273243 Eucgr.A00101??? fl_S2?? fl_S1L?? 0.356625? 14.1202 764169? Eucgr.A00350?? fl_S1E?? fl_S1L?? 7.381070? 43.9275 1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650??? fl_S2?? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744?? fl_S1E?? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744??? fl_S2?? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890?? fl_S1E?? fl_S1L?? 2.937760? 14.9647 764683? Eucgr.A00990?? fl_S1E?? fl_S1L?? 8.681250? 48.5492 1273947 Eucgr.A00990??? fl_S2?? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020?? fl_S1E?? fl_S1L?? 0.000000? 57.9273 1273974 Eucgr.A01020??? fl_S2?? fl_S1L?? 0.000000? 57.9273 764756? Eucgr.A01073?? fl_S1E?? fl_S1L?? 8.504710 101.1870 1274020 Eucgr.A01073??? fl_S2?? fl_S1L?? 5.400010 101.1870 764773? Eucgr.A01091?? fl_S1E?? fl_S1L?? 3.448910? 15.7756 764826? Eucgr.A01152?? fl_S1E?? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158?? fl_S1E?? fl_S1L?? 7.265640? 30.9565 764845? Eucgr.A01172?? fl_S1E?? fl_S1L?? 3.248020? 16.9127 764927? Eucgr.A01269?? fl_S1E?? fl_S1L? 18.710200? 76.6918 ",sep="",header=TRUE,stringsAsFactors=FALSE) do.call(rbind,lapply(lapply(split(dat1,dat1$gene_id),`[`,4:5),colMeans)) ??????????????? FL_EARLY? FL_LATE #Eucgr.A00054? 13.1708000? 22.2605 #Eucgr.A00101?? 0.3622925? 14.1202 #Eucgr.A00350?? 9.0277850? 43.9275 #Eucgr.A00650? 33.6691000? 50.0169 #Eucgr.A00744 137.5440000 747.2770 #Eucgr.A00890?? 2.9377600? 14.9647 #Eucgr.A00990?? 9.6172750? 48.5492 #Eucgr.A01020?? 0.0000000? 57.9273 #Eucgr.A01073?? 6.9523600 101.1870 #Eucgr.A01091?? 3.4489100? 15.7756 #Eucgr.A01152? 69.5657000 198.2320 #Eucgr.A01158?? 7.2656400? 30.9565 #Eucgr.A01172?? 3.2480200? 16.9127 #Eucgr.A01269? 18.7102000? 76.6918 In addition to aggregate(), ddply() etc. library(data.table) dat2<-data.table(dat1) dat3<-dat2[,list(FL_EARLY=mean(FL_EARLY),FL_LATE=mean(FL_LATE)),list(gene_id)] #aggregate() dat4<-with(dat1,aggregate(cbind(FL_EARLY,FL_LATE),list(gene_id),FUN=mean)) colnames(dat4)<-colnames(dat1)[c(1,4,5)] #ddply() library(plyr) dat5<-ddply(dat1,.(gene_id),colwise(mean,c("FL_EARLY","FL_LATE"))) A.K. ----- Original Message ----- From: "Vining, Kelly" <Kelly.Vining at oregonstate.edu> To: "r-help at r-project.org" <r-help at r-project.org> Cc: Sent: Friday, October 12, 2012 11:41 AM Subject: [R] average duplicated rows? Dear useRs, I have a slightly complicated data structure and am stuck trying to extract what I need. I'm pasting an example of this data below. In some cases, there are duplicates in the "gene_id" column because there are two different "sample 1" values for a given "sample 2" value. Where these duplicates exist, I need to average the corresponding "FL_EARLY" values and retain the "FL_LATE" value and replace those two rows with a row containing the "FL_EARLY" average so that I no longer have any "gene_id" duplicates. Seems like this is a job for some version of the apply function, but searching and puzzling over this has not gotten me anywhere. Any help will be much appreciated! Example data: ? ? ? ? ? ? gene_id sample_1 sample_2? FL_EARLY? FL_LATE 763938? Eucgr.A00054? fl_S1E? fl_S1L? 13.170800? 22.2605 763979? Eucgr.A00101? fl_S1E? fl_S1L? 0.367960? 14.1202 1273243 Eucgr.A00101? ? fl_S2? fl_S1L? 0.356625? 14.1202 764169? Eucgr.A00350? fl_S1E? fl_S1L? 7.381070? 43.9275 1273433 Eucgr.A00350? ? fl_S2? fl_S1L? 10.674500? 43.9275 1273669 Eucgr.A00650? ? fl_S2? fl_S1L? 33.669100? 50.0169 764480? Eucgr.A00744? fl_S1E? fl_S1L 132.429000 747.2770 1273744 Eucgr.A00744? ? fl_S2? fl_S1L 142.659000 747.2770 764595? Eucgr.A00890? fl_S1E? fl_S1L? 2.937760? 14.9647 764683? Eucgr.A00990? fl_S1E? fl_S1L? 8.681250? 48.5492 1273947 Eucgr.A00990? ? fl_S2? fl_S1L? 10.553300? 48.5492 764710? Eucgr.A01020? fl_S1E? fl_S1L? 0.000000? 57.9273 1273974 Eucgr.A01020? ? fl_S2? fl_S1L? 0.000000? 57.9273 764756? Eucgr.A01073? fl_S1E? fl_S1L? 8.504710 101.1870 1274020 Eucgr.A01073? ? fl_S2? fl_S1L? 5.400010 101.1870 764773? Eucgr.A01091? fl_S1E? fl_S1L? 3.448910? 15.7756 764826? Eucgr.A01152? fl_S1E? fl_S1L? 69.565700 198.2320 764831? Eucgr.A01158? fl_S1E? fl_S1L? 7.265640? 30.9565 764845? Eucgr.A01172? fl_S1E? fl_S1L? 3.248020? 16.9127 764927? Eucgr.A01269? fl_S1E? fl_S1L? 18.710200? 76.6918 --Kelly V. ??? [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.