thr3ads.net - R help - [R] average duplicated rows? [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Vining, Kelly

2012-Oct-12 15:41 UTC

[R] average duplicated rows?

Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what
I need. I'm pasting an example of this data below. In some cases, there are
duplicates in the "gene_id" column because there are two different
"sample 1" values for a given "sample 2" value. Where these
duplicates exist, I need to average the corresponding "FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with
a row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.

Seems like this is a job for some version of the apply function, but searching
and puzzling over this has not gotten me anywhere. Any help will be much
appreciated!

Example data:


             gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918



--Kelly V.


	[[alternative HTML version deleted]]

Pieter Schoonees

2012-Oct-12 16:18 UTC

head link

[R] average duplicated rows?

You will have to split() the data and unsplit() it after making the alterations.
Have a look at the plyr package for such functions.
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Vining, Kelly
> Sent: Friday 12 October 2012 5:42
> To: r-help at r-project.org
> Subject: [R] average duplicated rows?
> 
> Dear useRs,
> 
> I have a slightly complicated data structure and am stuck trying to
> extract what I need. I'm pasting an example of this data below. In some
> cases, there are duplicates in the "gene_id" column because there
are
> two different "sample 1" values for a given "sample 2"
value. Where
> these duplicates exist, I need to average the corresponding
"FL_EARLY"
> values and retain the "FL_LATE" value and replace those two rows
with a
> row containing the "FL_EARLY" average so that I no longer have
any
> "gene_id" duplicates.
> 
> Seems like this is a job for some version of the apply function, but
> searching and puzzling over this has not gotten me anywhere. Any help
> will be much appreciated!
> 
> Example data:
> 
> 
>              gene_id sample_1 sample_2   FL_EARLY  FL_LATE
> 763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
> 763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
> 1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
> 764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
> 1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
> 1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
> 764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
> 1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
> 764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
> 764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
> 1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
> 764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
> 1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
> 764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
> 1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
> 764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
> 764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
> 764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
> 764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
> 764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918
> 
> 
> 
> --Kelly V.
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2012-Oct-12 17:10 UTC

head link

[R] average duplicated rows?

Hello,

It could be a job for tapply, but I find it more suited for ?ave.


dat <- read.table(text = "
  gene_id sample_1 sample_2   FL_EARLY  FL_LATE
763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918
", header = TRUE)

av <- ave(dat$FL_EARLY, dat$gene_id)
dat$FLY_EARLY <- av


Hope this helps,

Rui Barradas
Em 12-10-2012 16:41, Vining, Kelly escreveu:> Dear useRs,
>
> I have a slightly complicated data structure and am stuck trying to extract
what I need. I'm pasting an example of this data below. In some cases, there
are duplicates in the "gene_id" column because there are two different
"sample 1" values for a given "sample 2" value. Where these
duplicates exist, I need to average the corresponding "FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with
a row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.
>
> Seems like this is a job for some version of the apply function, but
searching and puzzling over this has not gotten me anywhere. Any help will be
much appreciated!
>
> Example data:
>
>
>               gene_id sample_1 sample_2   FL_EARLY  FL_LATE
> 763938  Eucgr.A00054   fl_S1E   fl_S1L  13.170800  22.2605
> 763979  Eucgr.A00101   fl_S1E   fl_S1L   0.367960  14.1202
> 1273243 Eucgr.A00101    fl_S2   fl_S1L   0.356625  14.1202
> 764169  Eucgr.A00350   fl_S1E   fl_S1L   7.381070  43.9275
> 1273433 Eucgr.A00350    fl_S2   fl_S1L  10.674500  43.9275
> 1273669 Eucgr.A00650    fl_S2   fl_S1L  33.669100  50.0169
> 764480  Eucgr.A00744   fl_S1E   fl_S1L 132.429000 747.2770
> 1273744 Eucgr.A00744    fl_S2   fl_S1L 142.659000 747.2770
> 764595  Eucgr.A00890   fl_S1E   fl_S1L   2.937760  14.9647
> 764683  Eucgr.A00990   fl_S1E   fl_S1L   8.681250  48.5492
> 1273947 Eucgr.A00990    fl_S2   fl_S1L  10.553300  48.5492
> 764710  Eucgr.A01020   fl_S1E   fl_S1L   0.000000  57.9273
> 1273974 Eucgr.A01020    fl_S2   fl_S1L   0.000000  57.9273
> 764756  Eucgr.A01073   fl_S1E   fl_S1L   8.504710 101.1870
> 1274020 Eucgr.A01073    fl_S2   fl_S1L   5.400010 101.1870
> 764773  Eucgr.A01091   fl_S1E   fl_S1L   3.448910  15.7756
> 764826  Eucgr.A01152   fl_S1E   fl_S1L  69.565700 198.2320
> 764831  Eucgr.A01158   fl_S1E   fl_S1L   7.265640  30.9565
> 764845  Eucgr.A01172   fl_S1E   fl_S1L   3.248020  16.9127
> 764927  Eucgr.A01269   fl_S1E   fl_S1L  18.710200  76.6918
>
>
>
> --Kelly V.
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

arun

2012-Oct-12 18:22 UTC

head link

[R] average duplicated rows?

HI,

You can also try this:
dat1<-read.table(text="
??????????? gene_id sample_1 sample_2?? FL_EARLY? FL_LATE
763938? Eucgr.A00054?? fl_S1E?? fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?? fl_S1E?? fl_S1L?? 0.367960? 14.1202
1273243 Eucgr.A00101??? fl_S2?? fl_S1L?? 0.356625? 14.1202
764169? Eucgr.A00350?? fl_S1E?? fl_S1L?? 7.381070? 43.9275
1273433 Eucgr.A00350??? fl_S2?? fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650??? fl_S2?? fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?? fl_S1E?? fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744??? fl_S2?? fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?? fl_S1E?? fl_S1L?? 2.937760? 14.9647
764683? Eucgr.A00990?? fl_S1E?? fl_S1L?? 8.681250? 48.5492
1273947 Eucgr.A00990??? fl_S2?? fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?? fl_S1E?? fl_S1L?? 0.000000? 57.9273
1273974 Eucgr.A01020??? fl_S2?? fl_S1L?? 0.000000? 57.9273
764756? Eucgr.A01073?? fl_S1E?? fl_S1L?? 8.504710 101.1870
1274020 Eucgr.A01073??? fl_S2?? fl_S1L?? 5.400010 101.1870
764773? Eucgr.A01091?? fl_S1E?? fl_S1L?? 3.448910? 15.7756
764826? Eucgr.A01152?? fl_S1E?? fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?? fl_S1E?? fl_S1L?? 7.265640? 30.9565
764845? Eucgr.A01172?? fl_S1E?? fl_S1L?? 3.248020? 16.9127
764927? Eucgr.A01269?? fl_S1E?? fl_S1L? 18.710200? 76.6918 
",sep="",header=TRUE,stringsAsFactors=FALSE)


do.call(rbind,lapply(lapply(split(dat1,dat1$gene_id),`[`,4:5),colMeans))
??????????????? FL_EARLY? FL_LATE
#Eucgr.A00054? 13.1708000? 22.2605
#Eucgr.A00101?? 0.3622925? 14.1202
#Eucgr.A00350?? 9.0277850? 43.9275
#Eucgr.A00650? 33.6691000? 50.0169
#Eucgr.A00744 137.5440000 747.2770
#Eucgr.A00890?? 2.9377600? 14.9647
#Eucgr.A00990?? 9.6172750? 48.5492
#Eucgr.A01020?? 0.0000000? 57.9273
#Eucgr.A01073?? 6.9523600 101.1870
#Eucgr.A01091?? 3.4489100? 15.7756
#Eucgr.A01152? 69.5657000 198.2320
#Eucgr.A01158?? 7.2656400? 30.9565
#Eucgr.A01172?? 3.2480200? 16.9127
#Eucgr.A01269? 18.7102000? 76.6918


In addition to aggregate(), ddply() etc.

library(data.table)
dat2<-data.table(dat1)
dat3<-dat2[,list(FL_EARLY=mean(FL_EARLY),FL_LATE=mean(FL_LATE)),list(gene_id)]
#aggregate()
dat4<-with(dat1,aggregate(cbind(FL_EARLY,FL_LATE),list(gene_id),FUN=mean))
colnames(dat4)<-colnames(dat1)[c(1,4,5)]
#ddply()
library(plyr)
dat5<-ddply(dat1,.(gene_id),colwise(mean,c("FL_EARLY","FL_LATE")))

A.K.

----- Original Message -----
From: "Vining, Kelly" <Kelly.Vining at oregonstate.edu>
To: "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Friday, October 12, 2012 11:41 AM
Subject: [R] average duplicated rows?

Dear useRs,

I have a slightly complicated data structure and am stuck trying to extract what
I need. I'm pasting an example of this data below. In some cases, there are
duplicates in the "gene_id" column because there are two different
"sample 1" values for a given "sample 2" value. Where these
duplicates exist, I need to average the corresponding "FL_EARLY"
values and retain the "FL_LATE" value and replace those two rows with
a row containing the "FL_EARLY" average so that I no longer have any
"gene_id" duplicates.

Seems like this is a job for some version of the apply function, but searching
and puzzling over this has not gotten me anywhere. Any help will be much
appreciated!

Example data:


? ? ? ? ? ?  gene_id sample_1 sample_2?  FL_EARLY? FL_LATE
763938? Eucgr.A00054?  fl_S1E?  fl_S1L? 13.170800? 22.2605
763979? Eucgr.A00101?  fl_S1E?  fl_S1L?  0.367960? 14.1202
1273243 Eucgr.A00101? ? fl_S2?  fl_S1L?  0.356625? 14.1202
764169? Eucgr.A00350?  fl_S1E?  fl_S1L?  7.381070? 43.9275
1273433 Eucgr.A00350? ? fl_S2?  fl_S1L? 10.674500? 43.9275
1273669 Eucgr.A00650? ? fl_S2?  fl_S1L? 33.669100? 50.0169
764480? Eucgr.A00744?  fl_S1E?  fl_S1L 132.429000 747.2770
1273744 Eucgr.A00744? ? fl_S2?  fl_S1L 142.659000 747.2770
764595? Eucgr.A00890?  fl_S1E?  fl_S1L?  2.937760? 14.9647
764683? Eucgr.A00990?  fl_S1E?  fl_S1L?  8.681250? 48.5492
1273947 Eucgr.A00990? ? fl_S2?  fl_S1L? 10.553300? 48.5492
764710? Eucgr.A01020?  fl_S1E?  fl_S1L?  0.000000? 57.9273
1273974 Eucgr.A01020? ? fl_S2?  fl_S1L?  0.000000? 57.9273
764756? Eucgr.A01073?  fl_S1E?  fl_S1L?  8.504710 101.1870
1274020 Eucgr.A01073? ? fl_S2?  fl_S1L?  5.400010 101.1870
764773? Eucgr.A01091?  fl_S1E?  fl_S1L?  3.448910? 15.7756
764826? Eucgr.A01152?  fl_S1E?  fl_S1L? 69.565700 198.2320
764831? Eucgr.A01158?  fl_S1E?  fl_S1L?  7.265640? 30.9565
764845? Eucgr.A01172?  fl_S1E?  fl_S1L?  3.248020? 16.9127
764927? Eucgr.A01269?  fl_S1E?  fl_S1L? 18.710200? 76.6918



--Kelly V.


??? [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

R help - Oct 2012 - average duplicated rows?

[R] average duplicated rows?

[R] average duplicated rows?

[R] average duplicated rows?

[R] average duplicated rows?