Hi-- This is a question with a trivial and obvious answer, I'm sure, but I can't seem to find it in the help files and books that I have handy. I have a dataframe consisting of two columns, "Gene_Name," a list of gene symbols, and "Number," a numeric measure of how frequently a tag representing that gene showed up in a SAGE library. Several of the genes are represented by multiple tags, and therefore are present more than once in the list, e.g.: 1167 Zcchc8 6 1168 Zcwpw1 5 1169 Zdhhc18 6 1170 Zdhhc20 5 1171 Zdhhc3 6 1172 Zdhhc3 5 1173 Zeb2 9 1174 Zeb2 6 What I want is to collapse the list by gene name, such that duplicates are summed up and appear only once in the final version: Zcchc8 6 Zcwpw1 5 Zdhhc18 6 Zdhhc20 5 Zdhhc3 11 Zeb2 15 The only way I can figure out to do this is via rowsum:> rowsum (Number,Gene_Name)gives me exactly what I want, *except* that in the end, I am left with a matrix containing the Number values and with the Gene_Names used as row names (the output therefore looks exactly as printed above) -- what I want is a dataframe equivalent to the starting table, with numbered rows and separate, accessible columns containing the Gene_Name and Number values. I was able to put such a dataframe together manually, by cobbling together the row names of the above list with the values:> genes.unique <- data.frame (rownames (rowsum(Number,Gene_Name)), rowsum(Number,Gene_Name))but then I have to manually replace the row names of the dataframe with numbers, to get back to what I wanted in the first place. I hope this makes some sort of sense. Is there an easier way to do this? Thanks in advance! Charlie Murtaugh ==== L. Charles Murtaugh Assistant Professor University of Utah Dept. of Human Genetics 15 N. 2030 E. Rm. 2100 Salt Lake City, UT 84112 tel 801-581-5958 fax 801-581-6463 email murtaugh@genetics.utah.edu [[alternative HTML version deleted]]
Try this: aggregate(list(Number=x$Number), by=list(Gene_Name=x$Gene_Name), sum) On 25/03/2008, Charles Murtaugh <murtaugh at genetics.utah.edu> wrote:> Hi-- > > This is a question with a trivial and obvious answer, I'm sure, but I can't seem to find it in the help files and books that I have handy. I have a dataframe consisting of two columns, "Gene_Name," a list of gene symbols, and "Number," a numeric measure of how frequently a tag representing that gene showed up in a SAGE library. Several of the genes are represented by multiple tags, and therefore are present more than once in the list, e.g.: > > 1167 Zcchc8 6 > 1168 Zcwpw1 5 > 1169 Zdhhc18 6 > 1170 Zdhhc20 5 > 1171 Zdhhc3 6 > 1172 Zdhhc3 5 > 1173 Zeb2 9 > 1174 Zeb2 6 > > What I want is to collapse the list by gene name, such that duplicates are summed up and appear only once in the final version: > > > > Zcchc8 6 > > Zcwpw1 5 > > Zdhhc18 6 > Zdhhc20 5 > > Zdhhc3 11 > > Zeb2 15 > > > > The only way I can figure out to do this is via rowsum: > > > > > rowsum (Number,Gene_Name) > > > > gives me exactly what I want, *except* that in the end, I am left with a matrix containing the Number values and with the Gene_Names used as row names (the output therefore looks exactly as printed above) -- what I want is a dataframe equivalent to the starting table, with numbered rows and separate, accessible columns containing the Gene_Name and Number values. > > > > I was able to put such a dataframe together manually, by cobbling together the row names of the above list with the values: > > > > > genes.unique <- data.frame (rownames (rowsum(Number,Gene_Name)), rowsum(Number,Gene_Name)) > > > > but then I have to manually replace the row names of the dataframe with numbers, to get back to what I wanted in the first place. > > > > I hope this makes some sort of sense. Is there an easier way to do this? Thanks in advance! > > > > Charlie Murtaugh > > > > > > > > ====> > L. Charles Murtaugh > Assistant Professor > > University of Utah > Dept. of Human Genetics > 15 N. 2030 E. Rm. 2100 > Salt Lake City, UT 84112 > > tel 801-581-5958 > fax 801-581-6463 > email murtaugh at genetics.utah.edu > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O
See the reshape package.
library(reshape)
yy <- melt(xx, id=c("Gene.name"))
cast(yy, Gene.name~variable, sum)
--- Charles Murtaugh <murtaugh at genetics.utah.edu>
wrote:
> Hi--
>
> This is a question with a trivial and obvious
> answer, I'm sure, but I can't seem to find it in the
> help files and books that I have handy. I have a
> dataframe consisting of two columns, "Gene_Name," a
> list of gene symbols, and "Number," a numeric
> measure of how frequently a tag representing that
> gene showed up in a SAGE library. Several of the
> genes are represented by multiple tags, and
> therefore are present more than once in the list,
> e.g.:
>
> 1167 Zcchc8 6
> 1168 Zcwpw1 5
> 1169 Zdhhc18 6
> 1170 Zdhhc20 5
> 1171 Zdhhc3 6
> 1172 Zdhhc3 5
> 1173 Zeb2 9
> 1174 Zeb2 6
>
> What I want is to collapse the list by gene name,
> such that duplicates are summed up and appear only
> once in the final version:
>
>
>
> Zcchc8 6
>
> Zcwpw1 5
>
> Zdhhc18 6
> Zdhhc20 5
>
> Zdhhc3 11
>
> Zeb2 15
>
>
>
> The only way I can figure out to do this is via
> rowsum:
>
>
>
> > rowsum (Number,Gene_Name)
>
>
>
> gives me exactly what I want, *except* that in the
> end, I am left with a matrix containing the Number
> values and with the Gene_Names used as row names
> (the output therefore looks exactly as printed
> above) -- what I want is a dataframe equivalent to
> the starting table, with numbered rows and separate,
> accessible columns containing the Gene_Name and
> Number values.
>
>
>
> I was able to put such a dataframe together
> manually, by cobbling together the row names of the
> above list with the values:
>
>
>
> > genes.unique <- data.frame (rownames
> (rowsum(Number,Gene_Name)),
> rowsum(Number,Gene_Name))
>
>
>
> but then I have to manually replace the row names of
> the dataframe with numbers, to get back to what I
> wanted in the first place.
>
>
>
> I hope this makes some sort of sense. Is there an
> easier way to do this? Thanks in advance!
>
>
>
> Charlie Murtaugh
>
>
>
>
>
>
>
> ====>
> L. Charles Murtaugh
> Assistant Professor
>
> University of Utah
> Dept. of Human Genetics
> 15 N. 2030 E. Rm. 2100
> Salt Lake City, UT 84112
>
> tel 801-581-5958
> fax 801-581-6463
> email murtaugh at genetics.utah.edu
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>
__________________________________________________________________
[[elided trailing spam]]