thr3ads.net - R help - [R] how to find number of unique rows for combination of r columns [Nov 2019]

If this information is useful, please help other people find it:
Share via:

Boris Steipe

2019-Nov-08 15:49 UTC

[R] how to find number of unique rows for combination of r columns

Are you trying to eliminate duplicated rows from your dataframe? Because that
would be better achieved with duplicated().


B.



> On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at gmail.com>
wrote:
> 
> would you know how would I extract from my original data frame, just
> these unique rows?
> because this gives me only those 3 columns, and I want all columns
> from the original data frame
> 
>> head(udt)
>   chr   pos         gene_id
> 1 chr1 54490 ENSG00000227232
> 2 chr1 58814 ENSG00000227232
> 3 chr1 60351 ENSG00000227232
> 4 chr1 61920 ENSG00000227232
> 5 chr1 63671 ENSG00000227232
> 6 chr1 64931 ENSG00000227232
> 
>> head(dt)
>    chr   pos         gene_id pval_nominal pval_ret       wl      wr     
META
> 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278 21.2838
0.7475480
> 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278 21.2838
0.6031214
> 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278 21.2838
0.6907182
> 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278 21.2838
0.4032200
> 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278 21.2838
0.7482519
> 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278 21.2838
0.5974800
> 
> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
>> 
>> Thank you so much! Converting it to data frame resolved the issue!
>> 
>> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
>> <gerrit.eichner at math.uni-giessen.de> wrote:
>>> 
>>> It seems as if dt is not a (base R) data frame but a
>>> data table. I assume, you will have to transform dt
>>> into a data frame (maybe with as.data.frame) to be
>>> able to apply unique in the suggested way. However,
>>> I am not familiar with data tables. Perhaps somebody
>>> else can provide a more profound guess.
>>> 
>>>  Regards  --  Gerrit
>>> 
>>>
---------------------------------------------------------------------
>>> Dr. Gerrit Eichner                   Mathematical Institute, Room
212
>>> gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University
Giessen
>>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen,
Germany
>>> http://www.uni-giessen.de/eichner
>>>
---------------------------------------------------------------------
>>> 
>>> Am 08.11.2019 um 16:02 schrieb Ana Marija:
>>>> I tried it but I got this error:
>>>>> udt <- unique(dt[c("chr", "pos",
"gene_id")])
>>>> Error in `[.data.table`(dt, c("chr", "pos",
"gene_id")) :
>>>>   When i is a data.table (or character vector), the columns to
join by
>>>> must be specified using 'on=' argument (see
?data.table), by keying x
>>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by
sharing
>>>> column names between x and i (i.e., a natural join). Keyed
joins might
>>>> have further speed benefits on very large data due to x being
sorted
>>>> in RAM.
>>>> 
>>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
>>>> <gerrit.eichner at math.uni-giessen.de> wrote:
>>>>> 
>>>>> Hi, Ana,
>>>>> 
>>>>> doesn't
>>>>> 
>>>>> udt <- unique(dt[c("chr", "pos",
"gene_id")])
>>>>> nrow(udt)
>>>>> 
>>>>> get close to what you want?
>>>>> 
>>>>>   Hth  --  Gerrit
>>>>> 
>>>>>
---------------------------------------------------------------------
>>>>> Dr. Gerrit Eichner                   Mathematical
Institute, Room 212
>>>>> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
>>>>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392
Giessen, Germany
>>>>> http://www.uni-giessen.de/eichner
>>>>>
---------------------------------------------------------------------
>>>>> 
>>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija:
>>>>>> Hello,
>>>>>> 
>>>>>> I have a data frame like this:
>>>>>> 
>>>>>>> head(dt,20)
>>>>>>       chr    pos         gene_id pval_nominal  pval_ret
wl      wr
>>>>>>   1: chr1  54490 ENSG00000227232    0.6084950 0.7837780
31.62278 21.2838
>>>>>>   2: chr1  58814 ENSG00000227232    0.2952110 0.8975820
31.62278 21.2838
>>>>>>   3: chr1  60351 ENSG00000227232    0.4397880 0.8679590
31.62278 21.2838
>>>>>>   4: chr1  61920 ENSG00000227232    0.3195280 0.6018090
31.62278 21.2838
>>>>>>   5: chr1  63671 ENSG00000227232    0.2377390 0.9880390
31.62278 21.2838
>>>>>>   6: chr1  64931 ENSG00000227232    0.2766790 0.9070370
31.62278 21.2838
>>>>>>   7: chr1  81587 ENSG00000227232    0.6057930 0.6167630
31.62278 21.2838
>>>>>>   8: chr1 115746 ENSG00000227232    0.4078770 0.7799110
31.62278 21.2838
>>>>>>   9: chr1 135203 ENSG00000227232    0.4078770 0.9299130
31.62278 21.2838
>>>>>> 10: chr1 138593 ENSG00000227232    0.8464560 0.5696060
31.62278 21.2838
>>>>>> 
>>>>>> it is very big,
>>>>>>> dim(dt)
>>>>>> [1] 73719122        8
>>>>>> 
>>>>>> To count number of unique rows for all 3 columns: chr,
pos and gene_id
>>>>>> I could just join those 3 columns and than count. But
how would I find
>>>>>> unique number of rows for these 4 columns without
joining them?
>>>>>> 
>>>>>> Thanks
>>>>>> Ana
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>>>> 
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Ana Marija

2019-Nov-08 16:30 UTC

head link

[R] how to find number of unique rows for combination of r columns

I am trying to first identify how many duplicate rows are there determined
by the unique values in the first 3 columns. Now I know that is about 20000
rows which are non unique. But I would like to extract all 8 columns for
those non unique rows and see what is going on with META value I have in
them.

About duplicated() function I know as well as about unique

On Fri, 8 Nov 2019 at 10:08, Boris Steipe <boris.steipe at utoronto.ca>
wrote:
> Are you trying to eliminate duplicated rows from your dataframe? Because
> that would be better achieved with duplicated().
>
>
> B.
>
>
>
>
> > On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> >
> > would you know how would I extract from my original data frame, just
> > these unique rows?
> > because this gives me only those 3 columns, and I want all columns
> > from the original data frame
> >
> >> head(udt)
> >   chr   pos         gene_id
> > 1 chr1 54490 ENSG00000227232
> > 2 chr1 58814 ENSG00000227232
> > 3 chr1 60351 ENSG00000227232
> > 4 chr1 61920 ENSG00000227232
> > 5 chr1 63671 ENSG00000227232
> > 6 chr1 64931 ENSG00000227232
> >
> >> head(dt)
> >    chr   pos         gene_id pval_nominal pval_ret       wl      wr
> META
> > 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278 21.2838
> 0.7475480
> > 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278 21.2838
> 0.6031214
> > 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278 21.2838
> 0.6907182
> > 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278 21.2838
> 0.4032200
> > 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278 21.2838
> 0.7482519
> > 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278 21.2838
> 0.5974800
> >
> > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at
gmail.com>
> wrote:
> >>
> >> Thank you so much! Converting it to data frame resolved the issue!
> >>
> >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
> >> <gerrit.eichner at math.uni-giessen.de> wrote:
> >>>
> >>> It seems as if dt is not a (base R) data frame but a
> >>> data table. I assume, you will have to transform dt
> >>> into a data frame (maybe with as.data.frame) to be
> >>> able to apply unique in the suggested way. However,
> >>> I am not familiar with data tables. Perhaps somebody
> >>> else can provide a more profound guess.
> >>>
> >>>  Regards  --  Gerrit
> >>>
> >>>
---------------------------------------------------------------------
> >>> Dr. Gerrit Eichner                   Mathematical Institute,
Room 212
> >>> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
> >>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen,
Germany
> >>> http://www.uni-giessen.de/eichner
> >>>
---------------------------------------------------------------------
> >>>
> >>> Am 08.11.2019 um 16:02 schrieb Ana Marija:
> >>>> I tried it but I got this error:
> >>>>> udt <- unique(dt[c("chr",
"pos", "gene_id")])
> >>>> Error in `[.data.table`(dt, c("chr",
"pos", "gene_id")) :
> >>>>   When i is a data.table (or character vector), the
columns to join by
> >>>> must be specified using 'on=' argument (see
?data.table), by keying x
> >>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by
sharing
> >>>> column names between x and i (i.e., a natural join). Keyed
joins might
> >>>> have further speed benefits on very large data due to x
being sorted
> >>>> in RAM.
> >>>>
> >>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
> >>>> <gerrit.eichner at math.uni-giessen.de> wrote:
> >>>>>
> >>>>> Hi, Ana,
> >>>>>
> >>>>> doesn't
> >>>>>
> >>>>> udt <- unique(dt[c("chr",
"pos", "gene_id")])
> >>>>> nrow(udt)
> >>>>>
> >>>>> get close to what you want?
> >>>>>
> >>>>>   Hth  --  Gerrit
> >>>>>
> >>>>>
---------------------------------------------------------------------
> >>>>> Dr. Gerrit Eichner                   Mathematical
Institute, Room 212
> >>>>> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University
> Giessen
> >>>>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392
Giessen, Germany
> >>>>> http://www.uni-giessen.de/eichner
> >>>>>
---------------------------------------------------------------------
> >>>>>
> >>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija:
> >>>>>> Hello,
> >>>>>>
> >>>>>> I have a data frame like this:
> >>>>>>
> >>>>>>> head(dt,20)
> >>>>>>       chr    pos         gene_id pval_nominal 
pval_ret       wl
>   wr
> >>>>>>   1: chr1  54490 ENSG00000227232    0.6084950
0.7837780 31.62278
> 21.2838
> >>>>>>   2: chr1  58814 ENSG00000227232    0.2952110
0.8975820 31.62278
> 21.2838
> >>>>>>   3: chr1  60351 ENSG00000227232    0.4397880
0.8679590 31.62278
> 21.2838
> >>>>>>   4: chr1  61920 ENSG00000227232    0.3195280
0.6018090 31.62278
> 21.2838
> >>>>>>   5: chr1  63671 ENSG00000227232    0.2377390
0.9880390 31.62278
> 21.2838
> >>>>>>   6: chr1  64931 ENSG00000227232    0.2766790
0.9070370 31.62278
> 21.2838
> >>>>>>   7: chr1  81587 ENSG00000227232    0.6057930
0.6167630 31.62278
> 21.2838
> >>>>>>   8: chr1 115746 ENSG00000227232    0.4078770
0.7799110 31.62278
> 21.2838
> >>>>>>   9: chr1 135203 ENSG00000227232    0.4078770
0.9299130 31.62278
> 21.2838
> >>>>>> 10: chr1 138593 ENSG00000227232    0.8464560
0.5696060 31.62278
> 21.2838
> >>>>>>
> >>>>>> it is very big,
> >>>>>>> dim(dt)
> >>>>>> [1] 73719122        8
> >>>>>>
> >>>>>> To count number of unique rows for all 3 columns:
chr, pos and
> gene_id
> >>>>>> I could just join those 3 columns and than count.
But how would I
> find
> >>>>>> unique number of rows for these 4 columns without
joining them?
> >>>>>>
> >>>>>> Thanks
> >>>>>> Ana
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>>>>>
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained,
reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Boris Steipe

2019-Nov-08 16:49 UTC

head link

[R] how to find number of unique rows for combination of r columns

Good. Duplicated returns a boolean index vector that you can use to extract the
non-unique rows.

B.


> On 2019-11-08, at 11:30, Ana Marija <sokovic.anamarija at gmail.com>
wrote:
> 
> I am trying to first identify how many duplicate rows are there determined
by the unique values in the first 3 columns. Now I know that is about 20000 rows
which are non unique. But I would like to extract all 8 columns for those non
unique rows and see what is going on with META value I have in them.
> 
> About duplicated() function I know as well as about unique
> 
> On Fri, 8 Nov 2019 at 10:08, Boris Steipe <boris.steipe at
utoronto.ca> wrote:
> Are you trying to eliminate duplicated rows from your dataframe? Because
that would be better achieved with duplicated().
> 
> 
> B.
> 
> 
> 
> 
> > On 2019-11-08, at 10:32, Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> > 
> > would you know how would I extract from my original data frame, just
> > these unique rows?
> > because this gives me only those 3 columns, and I want all columns
> > from the original data frame
> > 
> >> head(udt)
> >   chr   pos         gene_id
> > 1 chr1 54490 ENSG00000227232
> > 2 chr1 58814 ENSG00000227232
> > 3 chr1 60351 ENSG00000227232
> > 4 chr1 61920 ENSG00000227232
> > 5 chr1 63671 ENSG00000227232
> > 6 chr1 64931 ENSG00000227232
> > 
> >> head(dt)
> >    chr   pos         gene_id pval_nominal pval_ret       wl      wr   
META
> > 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278 21.2838
0.7475480
> > 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278 21.2838
0.6031214
> > 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278 21.2838
0.6907182
> > 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278 21.2838
0.4032200
> > 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278 21.2838
0.7482519
> > 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278 21.2838
0.5974800
> > 
> > On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
> >> 
> >> Thank you so much! Converting it to data frame resolved the issue!
> >> 
> >> On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
> >> <gerrit.eichner at math.uni-giessen.de> wrote:
> >>> 
> >>> It seems as if dt is not a (base R) data frame but a
> >>> data table. I assume, you will have to transform dt
> >>> into a data frame (maybe with as.data.frame) to be
> >>> able to apply unique in the suggested way. However,
> >>> I am not familiar with data tables. Perhaps somebody
> >>> else can provide a more profound guess.
> >>> 
> >>>  Regards  --  Gerrit
> >>> 
> >>>
---------------------------------------------------------------------
> >>> Dr. Gerrit Eichner                   Mathematical Institute,
Room 212
> >>> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
> >>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen,
Germany
> >>> http://www.uni-giessen.de/eichner
> >>>
---------------------------------------------------------------------
> >>> 
> >>> Am 08.11.2019 um 16:02 schrieb Ana Marija:
> >>>> I tried it but I got this error:
> >>>>> udt <- unique(dt[c("chr",
"pos", "gene_id")])
> >>>> Error in `[.data.table`(dt, c("chr",
"pos", "gene_id")) :
> >>>>   When i is a data.table (or character vector), the
columns to join by
> >>>> must be specified using 'on=' argument (see
?data.table), by keying x
> >>>> (i.e. sorted, and, marked as sorted, see ?setkey), or by
sharing
> >>>> column names between x and i (i.e., a natural join). Keyed
joins might
> >>>> have further speed benefits on very large data due to x
being sorted
> >>>> in RAM.
> >>>> 
> >>>> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
> >>>> <gerrit.eichner at math.uni-giessen.de> wrote:
> >>>>> 
> >>>>> Hi, Ana,
> >>>>> 
> >>>>> doesn't
> >>>>> 
> >>>>> udt <- unique(dt[c("chr",
"pos", "gene_id")])
> >>>>> nrow(udt)
> >>>>> 
> >>>>> get close to what you want?
> >>>>> 
> >>>>>   Hth  --  Gerrit
> >>>>> 
> >>>>>
---------------------------------------------------------------------
> >>>>> Dr. Gerrit Eichner                   Mathematical
Institute, Room 212
> >>>>> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
> >>>>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392
Giessen, Germany
> >>>>> http://www.uni-giessen.de/eichner
> >>>>>
---------------------------------------------------------------------
> >>>>> 
> >>>>> Am 08.11.2019 um 15:38 schrieb Ana Marija:
> >>>>>> Hello,
> >>>>>> 
> >>>>>> I have a data frame like this:
> >>>>>> 
> >>>>>>> head(dt,20)
> >>>>>>       chr    pos         gene_id pval_nominal 
pval_ret       wl      wr
> >>>>>>   1: chr1  54490 ENSG00000227232    0.6084950
0.7837780 31.62278 21.2838
> >>>>>>   2: chr1  58814 ENSG00000227232    0.2952110
0.8975820 31.62278 21.2838
> >>>>>>   3: chr1  60351 ENSG00000227232    0.4397880
0.8679590 31.62278 21.2838
> >>>>>>   4: chr1  61920 ENSG00000227232    0.3195280
0.6018090 31.62278 21.2838
> >>>>>>   5: chr1  63671 ENSG00000227232    0.2377390
0.9880390 31.62278 21.2838
> >>>>>>   6: chr1  64931 ENSG00000227232    0.2766790
0.9070370 31.62278 21.2838
> >>>>>>   7: chr1  81587 ENSG00000227232    0.6057930
0.6167630 31.62278 21.2838
> >>>>>>   8: chr1 115746 ENSG00000227232    0.4078770
0.7799110 31.62278 21.2838
> >>>>>>   9: chr1 135203 ENSG00000227232    0.4078770
0.9299130 31.62278 21.2838
> >>>>>> 10: chr1 138593 ENSG00000227232    0.8464560
0.5696060 31.62278 21.2838
> >>>>>> 
> >>>>>> it is very big,
> >>>>>>> dim(dt)
> >>>>>> [1] 73719122        8
> >>>>>> 
> >>>>>> To count number of unique rows for all 3 columns:
chr, pos and gene_id
> >>>>>> I could just join those 3 columns and than count.
But how would I find
> >>>>>> unique number of rows for these 4 columns without
joining them?
> >>>>>> 
> >>>>>> Thanks
> >>>>>> Ana
> >>>>>> 
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>>>>> 
> >>>>> 
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained,
reproducible code.
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Nov 2019 - how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns