Ana Marija
2019-Nov-08 15:02 UTC
[R] how to find number of unique rows for combination of r columns
I tried it but I got this error:> udt <- unique(dt[c("chr", "pos", "gene_id")])Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM. On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner <gerrit.eichner at math.uni-giessen.de> wrote:> > Hi, Ana, > > doesn't > > udt <- unique(dt[c("chr", "pos", "gene_id")]) > nrow(udt) > > get close to what you want? > > Hth -- Gerrit > > --------------------------------------------------------------------- > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > http://www.uni-giessen.de/eichner > --------------------------------------------------------------------- > > Am 08.11.2019 um 15:38 schrieb Ana Marija: > > Hello, > > > > I have a data frame like this: > > > >> head(dt,20) > > chr pos gene_id pval_nominal pval_ret wl wr > > 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 > > 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 > > 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 > > 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 > > 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 > > 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 > > 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 > > 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 > > 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 > > 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 > > > > it is very big, > >> dim(dt) > > [1] 73719122 8 > > > > To count number of unique rows for all 3 columns: chr, pos and gene_id > > I could just join those 3 columns and than count. But how would I find > > unique number of rows for these 4 columns without joining them? > > > > Thanks > > Ana > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Gerrit Eichner
2019-Nov-08 15:19 UTC
[R] how to find number of unique rows for combination of r columns
It seems as if dt is not a (base R) data frame but a data table. I assume, you will have to transform dt into a data frame (maybe with as.data.frame) to be able to apply unique in the suggested way. However, I am not familiar with data tables. Perhaps somebody else can provide a more profound guess. Regards -- Gerrit --------------------------------------------------------------------- Dr. Gerrit Eichner Mathematical Institute, Room 212 gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany http://www.uni-giessen.de/eichner --------------------------------------------------------------------- Am 08.11.2019 um 16:02 schrieb Ana Marija:> I tried it but I got this error: >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > When i is a data.table (or character vector), the columns to join by > must be specified using 'on=' argument (see ?data.table), by keying x > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > column names between x and i (i.e., a natural join). Keyed joins might > have further speed benefits on very large data due to x being sorted > in RAM. > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > <gerrit.eichner at math.uni-giessen.de> wrote: >> >> Hi, Ana, >> >> doesn't >> >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >> nrow(udt) >> >> get close to what you want? >> >> Hth -- Gerrit >> >> --------------------------------------------------------------------- >> Dr. Gerrit Eichner Mathematical Institute, Room 212 >> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >> http://www.uni-giessen.de/eichner >> --------------------------------------------------------------------- >> >> Am 08.11.2019 um 15:38 schrieb Ana Marija: >>> Hello, >>> >>> I have a data frame like this: >>> >>>> head(dt,20) >>> chr pos gene_id pval_nominal pval_ret wl wr >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 >>> >>> it is very big, >>>> dim(dt) >>> [1] 73719122 8 >>> >>> To count number of unique rows for all 3 columns: chr, pos and gene_id >>> I could just join those 3 columns and than count. But how would I find >>> unique number of rows for these 4 columns without joining them? >>> >>> Thanks >>> Ana >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2019-Nov-08 15:30 UTC
[R] how to find number of unique rows for combination of r columns
Thank you so much! Converting it to data frame resolved the issue! On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner <gerrit.eichner at math.uni-giessen.de> wrote:> > It seems as if dt is not a (base R) data frame but a > data table. I assume, you will have to transform dt > into a data frame (maybe with as.data.frame) to be > able to apply unique in the suggested way. However, > I am not familiar with data tables. Perhaps somebody > else can provide a more profound guess. > > Regards -- Gerrit > > --------------------------------------------------------------------- > Dr. Gerrit Eichner Mathematical Institute, Room 212 > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > http://www.uni-giessen.de/eichner > --------------------------------------------------------------------- > > Am 08.11.2019 um 16:02 schrieb Ana Marija: > > I tried it but I got this error: > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > > When i is a data.table (or character vector), the columns to join by > > must be specified using 'on=' argument (see ?data.table), by keying x > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > > column names between x and i (i.e., a natural join). Keyed joins might > > have further speed benefits on very large data due to x being sorted > > in RAM. > > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > > <gerrit.eichner at math.uni-giessen.de> wrote: > >> > >> Hi, Ana, > >> > >> doesn't > >> > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >> nrow(udt) > >> > >> get close to what you want? > >> > >> Hth -- Gerrit > >> > >> --------------------------------------------------------------------- > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >> http://www.uni-giessen.de/eichner > >> --------------------------------------------------------------------- > >> > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: > >>> Hello, > >>> > >>> I have a data frame like this: > >>> > >>>> head(dt,20) > >>> chr pos gene_id pval_nominal pval_ret wl wr > >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 > >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 > >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 > >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 > >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 > >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 > >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 > >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 > >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 > >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 > >>> > >>> it is very big, > >>>> dim(dt) > >>> [1] 73719122 8 > >>> > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id > >>> I could just join those 3 columns and than count. But how would I find > >>> unique number of rows for these 4 columns without joining them? > >>> > >>> Thanks > >>> Ana > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code.