thr3ads.net - R help - [R] how to find number of unique rows for combination of r columns [Nov 2019]

If this information is useful, please help other people find it:
Share via:

Ana Marija

2019-Nov-08 15:02 UTC

[R] how to find number of unique rows for combination of r columns

I tried it but I got this error:> udt <- unique(dt[c("chr", "pos",
"gene_id")])Error in `[.data.table`(dt, c("chr", "pos",
"gene_id")) :
  When i is a data.table (or character vector), the columns to join by
must be specified using 'on=' argument (see ?data.table), by keying x
(i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
column names between x and i (i.e., a natural join). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.

On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
<gerrit.eichner at math.uni-giessen.de> wrote:>
> Hi, Ana,
>
> doesn't
>
> udt <- unique(dt[c("chr", "pos",
"gene_id")])
> nrow(udt)
>
> get close to what you want?
>
>   Hth  --  Gerrit
>
> ---------------------------------------------------------------------
> Dr. Gerrit Eichner                   Mathematical Institute, Room 212
> gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University Giessen
> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
> http://www.uni-giessen.de/eichner
> ---------------------------------------------------------------------
>
> Am 08.11.2019 um 15:38 schrieb Ana Marija:
> > Hello,
> >
> > I have a data frame like this:
> >
> >> head(dt,20)
> >       chr    pos         gene_id pval_nominal  pval_ret       wl     
wr
> >   1: chr1  54490 ENSG00000227232    0.6084950 0.7837780 31.62278
21.2838
> >   2: chr1  58814 ENSG00000227232    0.2952110 0.8975820 31.62278
21.2838
> >   3: chr1  60351 ENSG00000227232    0.4397880 0.8679590 31.62278
21.2838
> >   4: chr1  61920 ENSG00000227232    0.3195280 0.6018090 31.62278
21.2838
> >   5: chr1  63671 ENSG00000227232    0.2377390 0.9880390 31.62278
21.2838
> >   6: chr1  64931 ENSG00000227232    0.2766790 0.9070370 31.62278
21.2838
> >   7: chr1  81587 ENSG00000227232    0.6057930 0.6167630 31.62278
21.2838
> >   8: chr1 115746 ENSG00000227232    0.4078770 0.7799110 31.62278
21.2838
> >   9: chr1 135203 ENSG00000227232    0.4078770 0.9299130 31.62278
21.2838
> > 10: chr1 138593 ENSG00000227232    0.8464560 0.5696060 31.62278
21.2838
> >
> > it is very big,
> >> dim(dt)
> > [1] 73719122        8
> >
> > To count number of unique rows for all 3 columns: chr, pos and gene_id
> > I could just join those 3 columns and than count. But how would I find
> > unique number of rows for these 4 columns without joining them?
> >
> > Thanks
> > Ana
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Gerrit Eichner

2019-Nov-08 15:19 UTC

head link

[R] how to find number of unique rows for combination of r columns

It seems as if dt is not a (base R) data frame but a
data table. I assume, you will have to transform dt
into a data frame (maybe with as.data.frame) to be
able to apply unique in the suggested way. However,
I am not familiar with data tables. Perhaps somebody
else can provide a more profound guess.

  Regards  --  Gerrit

---------------------------------------------------------------------
Dr. Gerrit Eichner                   Mathematical Institute, Room 212
gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University Giessen
Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
http://www.uni-giessen.de/eichner
---------------------------------------------------------------------

Am 08.11.2019 um 16:02 schrieb Ana Marija:> I tried it but I got this error:
>> udt <- unique(dt[c("chr", "pos",
"gene_id")])
> Error in `[.data.table`(dt, c("chr", "pos",
"gene_id")) :
>    When i is a data.table (or character vector), the columns to join by
> must be specified using 'on=' argument (see ?data.table), by keying
x
> (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
> column names between x and i (i.e., a natural join). Keyed joins might
> have further speed benefits on very large data due to x being sorted
> in RAM.
> 
> On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
> <gerrit.eichner at math.uni-giessen.de> wrote:
>>
>> Hi, Ana,
>>
>> doesn't
>>
>> udt <- unique(dt[c("chr", "pos",
"gene_id")])
>> nrow(udt)
>>
>> get close to what you want?
>>
>>    Hth  --  Gerrit
>>
>> ---------------------------------------------------------------------
>> Dr. Gerrit Eichner                   Mathematical Institute, Room 212
>> gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University
Giessen
>> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
>> http://www.uni-giessen.de/eichner
>> ---------------------------------------------------------------------
>>
>> Am 08.11.2019 um 15:38 schrieb Ana Marija:
>>> Hello,
>>>
>>> I have a data frame like this:
>>>
>>>> head(dt,20)
>>>        chr    pos         gene_id pval_nominal  pval_ret       wl  
wr
>>>    1: chr1  54490 ENSG00000227232    0.6084950 0.7837780 31.62278
21.2838
>>>    2: chr1  58814 ENSG00000227232    0.2952110 0.8975820 31.62278
21.2838
>>>    3: chr1  60351 ENSG00000227232    0.4397880 0.8679590 31.62278
21.2838
>>>    4: chr1  61920 ENSG00000227232    0.3195280 0.6018090 31.62278
21.2838
>>>    5: chr1  63671 ENSG00000227232    0.2377390 0.9880390 31.62278
21.2838
>>>    6: chr1  64931 ENSG00000227232    0.2766790 0.9070370 31.62278
21.2838
>>>    7: chr1  81587 ENSG00000227232    0.6057930 0.6167630 31.62278
21.2838
>>>    8: chr1 115746 ENSG00000227232    0.4078770 0.7799110 31.62278
21.2838
>>>    9: chr1 135203 ENSG00000227232    0.4078770 0.9299130 31.62278
21.2838
>>> 10: chr1 138593 ENSG00000227232    0.8464560 0.5696060 31.62278
21.2838
>>>
>>> it is very big,
>>>> dim(dt)
>>> [1] 73719122        8
>>>
>>> To count number of unique rows for all 3 columns: chr, pos and
gene_id
>>> I could just join those 3 columns and than count. But how would I
find
>>> unique number of rows for these 4 columns without joining them?
>>>
>>> Thanks
>>> Ana
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

Ana Marija

2019-Nov-08 15:30 UTC

head link

[R] how to find number of unique rows for combination of r columns

Thank you so much! Converting it to data frame resolved the issue!

On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
<gerrit.eichner at math.uni-giessen.de> wrote:>
> It seems as if dt is not a (base R) data frame but a
> data table. I assume, you will have to transform dt
> into a data frame (maybe with as.data.frame) to be
> able to apply unique in the suggested way. However,
> I am not familiar with data tables. Perhaps somebody
> else can provide a more profound guess.
>
>   Regards  --  Gerrit
>
> ---------------------------------------------------------------------
> Dr. Gerrit Eichner                   Mathematical Institute, Room 212
> gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University Giessen
> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen, Germany
> http://www.uni-giessen.de/eichner
> ---------------------------------------------------------------------
>
> Am 08.11.2019 um 16:02 schrieb Ana Marija:
> > I tried it but I got this error:
> >> udt <- unique(dt[c("chr", "pos",
"gene_id")])
> > Error in `[.data.table`(dt, c("chr", "pos",
"gene_id")) :
> >    When i is a data.table (or character vector), the columns to join
by
> > must be specified using 'on=' argument (see ?data.table), by
keying x
> > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing
> > column names between x and i (i.e., a natural join). Keyed joins might
> > have further speed benefits on very large data due to x being sorted
> > in RAM.
> >
> > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
> > <gerrit.eichner at math.uni-giessen.de> wrote:
> >>
> >> Hi, Ana,
> >>
> >> doesn't
> >>
> >> udt <- unique(dt[c("chr", "pos",
"gene_id")])
> >> nrow(udt)
> >>
> >> get close to what you want?
> >>
> >>    Hth  --  Gerrit
> >>
> >>
---------------------------------------------------------------------
> >> Dr. Gerrit Eichner                   Mathematical Institute, Room
212
> >> gerrit.eichner at math.uni-giessen.de   Justus-Liebig-University
Giessen
> >> Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392 Giessen,
Germany
> >> http://www.uni-giessen.de/eichner
> >>
---------------------------------------------------------------------
> >>
> >> Am 08.11.2019 um 15:38 schrieb Ana Marija:
> >>> Hello,
> >>>
> >>> I have a data frame like this:
> >>>
> >>>> head(dt,20)
> >>>        chr    pos         gene_id pval_nominal  pval_ret      
wl      wr
> >>>    1: chr1  54490 ENSG00000227232    0.6084950 0.7837780
31.62278 21.2838
> >>>    2: chr1  58814 ENSG00000227232    0.2952110 0.8975820
31.62278 21.2838
> >>>    3: chr1  60351 ENSG00000227232    0.4397880 0.8679590
31.62278 21.2838
> >>>    4: chr1  61920 ENSG00000227232    0.3195280 0.6018090
31.62278 21.2838
> >>>    5: chr1  63671 ENSG00000227232    0.2377390 0.9880390
31.62278 21.2838
> >>>    6: chr1  64931 ENSG00000227232    0.2766790 0.9070370
31.62278 21.2838
> >>>    7: chr1  81587 ENSG00000227232    0.6057930 0.6167630
31.62278 21.2838
> >>>    8: chr1 115746 ENSG00000227232    0.4078770 0.7799110
31.62278 21.2838
> >>>    9: chr1 135203 ENSG00000227232    0.4078770 0.9299130
31.62278 21.2838
> >>> 10: chr1 138593 ENSG00000227232    0.8464560 0.5696060
31.62278 21.2838
> >>>
> >>> it is very big,
> >>>> dim(dt)
> >>> [1] 73719122        8
> >>>
> >>> To count number of unique rows for all 3 columns: chr, pos and
gene_id
> >>> I could just join those 3 columns and than count. But how
would I find
> >>> unique number of rows for these 4 columns without joining
them?
> >>>
> >>> Thanks
> >>> Ana
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >>>
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.

R help - Nov 2019 - how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns