Boris Steipe
2019-Nov-08 15:49 UTC
[R] how to find number of unique rows for combination of r columns
Are you trying to eliminate duplicated rows from your dataframe? Because that would be better achieved with duplicated(). B.> On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at gmail.com> wrote: > > would you know how would I extract from my original data frame, just > these unique rows? > because this gives me only those 3 columns, and I want all columns > from the original data frame > >> head(udt) > chr pos gene_id > 1 chr1 54490 ENSG00000227232 > 2 chr1 58814 ENSG00000227232 > 3 chr1 60351 ENSG00000227232 > 4 chr1 61920 ENSG00000227232 > 5 chr1 63671 ENSG00000227232 > 6 chr1 64931 ENSG00000227232 > >> head(dt) > chr pos gene_id pval_nominal pval_ret wl wr META > 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480 > 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214 > 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182 > 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200 > 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519 > 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800 > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >> >> Thank you so much! Converting it to data frame resolved the issue! >> >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >> <gerrit.eichner at math.uni-giessen.de> wrote: >>> >>> It seems as if dt is not a (base R) data frame but a >>> data table. I assume, you will have to transform dt >>> into a data frame (maybe with as.data.frame) to be >>> able to apply unique in the suggested way. However, >>> I am not familiar with data tables. Perhaps somebody >>> else can provide a more profound guess. >>> >>> Regards -- Gerrit >>> >>> --------------------------------------------------------------------- >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> http://www.uni-giessen.de/eichner >>> --------------------------------------------------------------------- >>> >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: >>>> I tried it but I got this error: >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>>> Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : >>>> When i is a data.table (or character vector), the columns to join by >>>> must be specified using 'on=' argument (see ?data.table), by keying x >>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing >>>> column names between x and i (i.e., a natural join). Keyed joins might >>>> have further speed benefits on very large data due to x being sorted >>>> in RAM. >>>> >>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner >>>> <gerrit.eichner at math.uni-giessen.de> wrote: >>>>> >>>>> Hi, Ana, >>>>> >>>>> doesn't >>>>> >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>>>> nrow(udt) >>>>> >>>>> get close to what you want? >>>>> >>>>> Hth -- Gerrit >>>>> >>>>> --------------------------------------------------------------------- >>>>> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>>>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>>>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>>>> http://www.uni-giessen.de/eichner >>>>> --------------------------------------------------------------------- >>>>> >>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija: >>>>>> Hello, >>>>>> >>>>>> I have a data frame like this: >>>>>> >>>>>>> head(dt,20) >>>>>> chr pos gene_id pval_nominal pval_ret wl wr >>>>>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 >>>>>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 >>>>>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 >>>>>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 >>>>>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 >>>>>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 >>>>>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 >>>>>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 >>>>>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 >>>>>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 >>>>>> >>>>>> it is very big, >>>>>>> dim(dt) >>>>>> [1] 73719122 8 >>>>>> >>>>>> To count number of unique rows for all 3 columns: chr, pos and gene_id >>>>>> I could just join those 3 columns and than count. But how would I find >>>>>> unique number of rows for these 4 columns without joining them? >>>>>> >>>>>> Thanks >>>>>> Ana >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Ana Marija
2019-Nov-08 16:30 UTC
[R] how to find number of unique rows for combination of r columns
I am trying to first identify how many duplicate rows are there determined by the unique values in the first 3 columns. Now I know that is about 20000 rows which are non unique. But I would like to extract all 8 columns for those non unique rows and see what is going on with META value I have in them. About duplicated() function I know as well as about unique On Fri, 8 Nov 2019 at 10:08, Boris Steipe <boris.steipe at utoronto.ca> wrote:> Are you trying to eliminate duplicated rows from your dataframe? Because > that would be better achieved with duplicated(). > > > B. > > > > > > On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > would you know how would I extract from my original data frame, just > > these unique rows? > > because this gives me only those 3 columns, and I want all columns > > from the original data frame > > > >> head(udt) > > chr pos gene_id > > 1 chr1 54490 ENSG00000227232 > > 2 chr1 58814 ENSG00000227232 > > 3 chr1 60351 ENSG00000227232 > > 4 chr1 61920 ENSG00000227232 > > 5 chr1 63671 ENSG00000227232 > > 6 chr1 64931 ENSG00000227232 > > > >> head(dt) > > chr pos gene_id pval_nominal pval_ret wl wr > META > > 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 > 0.7475480 > > 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 > 0.6031214 > > 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 > 0.6907182 > > 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 > 0.4032200 > > 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 > 0.7482519 > > 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 > 0.5974800 > > > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> > wrote: > >> > >> Thank you so much! Converting it to data frame resolved the issue! > >> > >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > >> <gerrit.eichner at math.uni-giessen.de> wrote: > >>> > >>> It seems as if dt is not a (base R) data frame but a > >>> data table. I assume, you will have to transform dt > >>> into a data frame (maybe with as.data.frame) to be > >>> able to apply unique in the suggested way. However, > >>> I am not familiar with data tables. Perhaps somebody > >>> else can provide a more profound guess. > >>> > >>> Regards -- Gerrit > >>> > >>> --------------------------------------------------------------------- > >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> http://www.uni-giessen.de/eichner > >>> --------------------------------------------------------------------- > >>> > >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: > >>>> I tried it but I got this error: > >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>>> Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > >>>> When i is a data.table (or character vector), the columns to join by > >>>> must be specified using 'on=' argument (see ?data.table), by keying x > >>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > >>>> column names between x and i (i.e., a natural join). Keyed joins might > >>>> have further speed benefits on very large data due to x being sorted > >>>> in RAM. > >>>> > >>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > >>>> <gerrit.eichner at math.uni-giessen.de> wrote: > >>>>> > >>>>> Hi, Ana, > >>>>> > >>>>> doesn't > >>>>> > >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>>>> nrow(udt) > >>>>> > >>>>> get close to what you want? > >>>>> > >>>>> Hth -- Gerrit > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>>>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University > Giessen > >>>>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>>>> http://www.uni-giessen.de/eichner > >>>>> --------------------------------------------------------------------- > >>>>> > >>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija: > >>>>>> Hello, > >>>>>> > >>>>>> I have a data frame like this: > >>>>>> > >>>>>>> head(dt,20) > >>>>>> chr pos gene_id pval_nominal pval_ret wl > wr > >>>>>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 > 21.2838 > >>>>>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 > 21.2838 > >>>>>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 > 21.2838 > >>>>>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 > 21.2838 > >>>>>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 > 21.2838 > >>>>>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 > 21.2838 > >>>>>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 > 21.2838 > >>>>>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 > 21.2838 > >>>>>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 > 21.2838 > >>>>>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 > 21.2838 > >>>>>> > >>>>>> it is very big, > >>>>>>> dim(dt) > >>>>>> [1] 73719122 8 > >>>>>> > >>>>>> To count number of unique rows for all 3 columns: chr, pos and > gene_id > >>>>>> I could just join those 3 columns and than count. But how would I > find > >>>>>> unique number of rows for these 4 columns without joining them? > >>>>>> > >>>>>> Thanks > >>>>>> Ana > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>> > >>>>> ______________________________________________ > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Boris Steipe
2019-Nov-08 16:49 UTC
[R] how to find number of unique rows for combination of r columns
Good. Duplicated returns a boolean index vector that you can use to extract the non-unique rows. B.> On 2019-11-08, at 11:30, Ana Marija <sokovic.anamarija at gmail.com> wrote: > > I am trying to first identify how many duplicate rows are there determined by the unique values in the first 3 columns. Now I know that is about 20000 rows which are non unique. But I would like to extract all 8 columns for those non unique rows and see what is going on with META value I have in them. > > About duplicated() function I know as well as about unique > > On Fri, 8 Nov 2019 at 10:08, Boris Steipe <boris.steipe at utoronto.ca> wrote: > Are you trying to eliminate duplicated rows from your dataframe? Because that would be better achieved with duplicated(). > > > B. > > > > > > On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at gmail.com> wrote: > > > > would you know how would I extract from my original data frame, just > > these unique rows? > > because this gives me only those 3 columns, and I want all columns > > from the original data frame > > > >> head(udt) > > chr pos gene_id > > 1 chr1 54490 ENSG00000227232 > > 2 chr1 58814 ENSG00000227232 > > 3 chr1 60351 ENSG00000227232 > > 4 chr1 61920 ENSG00000227232 > > 5 chr1 63671 ENSG00000227232 > > 6 chr1 64931 ENSG00000227232 > > > >> head(dt) > > chr pos gene_id pval_nominal pval_ret wl wr META > > 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480 > > 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214 > > 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182 > > 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200 > > 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519 > > 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800 > > > > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > >> > >> Thank you so much! Converting it to data frame resolved the issue! > >> > >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > >> <gerrit.eichner at math.uni-giessen.de> wrote: > >>> > >>> It seems as if dt is not a (base R) data frame but a > >>> data table. I assume, you will have to transform dt > >>> into a data frame (maybe with as.data.frame) to be > >>> able to apply unique in the suggested way. However, > >>> I am not familiar with data tables. Perhaps somebody > >>> else can provide a more profound guess. > >>> > >>> Regards -- Gerrit > >>> > >>> --------------------------------------------------------------------- > >>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> http://www.uni-giessen.de/eichner > >>> --------------------------------------------------------------------- > >>> > >>> Am 08.11.2019 um 16:02 schrieb Ana Marija: > >>>> I tried it but I got this error: > >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>>> Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > >>>> When i is a data.table (or character vector), the columns to join by > >>>> must be specified using 'on=' argument (see ?data.table), by keying x > >>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > >>>> column names between x and i (i.e., a natural join). Keyed joins might > >>>> have further speed benefits on very large data due to x being sorted > >>>> in RAM. > >>>> > >>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > >>>> <gerrit.eichner at math.uni-giessen.de> wrote: > >>>>> > >>>>> Hi, Ana, > >>>>> > >>>>> doesn't > >>>>> > >>>>> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>>>> nrow(udt) > >>>>> > >>>>> get close to what you want? > >>>>> > >>>>> Hth -- Gerrit > >>>>> > >>>>> --------------------------------------------------------------------- > >>>>> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>>>> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >>>>> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>>>> http://www.uni-giessen.de/eichner > >>>>> --------------------------------------------------------------------- > >>>>> > >>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija: > >>>>>> Hello, > >>>>>> > >>>>>> I have a data frame like this: > >>>>>> > >>>>>>> head(dt,20) > >>>>>> chr pos gene_id pval_nominal pval_ret wl wr > >>>>>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 > >>>>>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 > >>>>>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 > >>>>>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 > >>>>>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 > >>>>>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 > >>>>>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 > >>>>>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 > >>>>>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 > >>>>>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 > >>>>>> > >>>>>> it is very big, > >>>>>>> dim(dt) > >>>>>> [1] 73719122 8 > >>>>>> > >>>>>> To count number of unique rows for all 3 columns: chr, pos and gene_id > >>>>>> I could just join those 3 columns and than count. But how would I find > >>>>>> unique number of rows for these 4 columns without joining them? > >>>>>> > >>>>>> Thanks > >>>>>> Ana > >>>>>> > >>>>>> ______________________________________________ > >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>>> > >>>>> > >>>>> ______________________________________________ > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.