thr3ads.net - R help - [R] how to find number of unique rows for combination of r columns [Nov 2019]

If this information is useful, please help other people find it:
Share via:

Ana Marija

2019-Nov-08 18:29 UTC

[R] how to find number of unique rows for combination of r columns

Thank you so much!!!

On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at gmail.com>
wrote:>
> Correction:
> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)])
> df[!duplicated(df[,2:3]), ]  ## Note the ! sign
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>
>> Sorry, but you ask basic questions.You really need to spend some more
time with an R tutorial or two. This list is not meant to replace your own
learning efforts.
>>
>> You also do not seem to be reading the docs carefully. Under ?unique,
it links ?duplicated and tells you that it gives indices of duplicated rows of a
data frame. These then can be used by subscripting to remove those rows from the
data frame. Here is a reproducible example:
>>
>> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d =
LETTERS[c(1,1,2)])
>> df[-duplicated(df[,2:3]), ]  ## Note the - sign
>>
>> If you prefer, the "Tidyverse" world has what are purported
to be more user-friendly versions of such data handling functionality that you
can use instead.
>>
>>
>> Bert
>>
>> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
>>>
>>> would you know how would I extract from my original data frame,
just
>>> these unique rows?
>>> because this gives me only those 3 columns, and I want all columns
>>> from the original data frame
>>>
>>> > head(udt)
>>>    chr   pos         gene_id
>>> 1 chr1 54490 ENSG00000227232
>>> 2 chr1 58814 ENSG00000227232
>>> 3 chr1 60351 ENSG00000227232
>>> 4 chr1 61920 ENSG00000227232
>>> 5 chr1 63671 ENSG00000227232
>>> 6 chr1 64931 ENSG00000227232
>>>
>>> > head(dt)
>>>     chr   pos         gene_id pval_nominal pval_ret       wl     
wr      META
>>> 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278
21.2838 0.7475480
>>> 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278
21.2838 0.6031214
>>> 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278
21.2838 0.6907182
>>> 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278
21.2838 0.4032200
>>> 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278
21.2838 0.7482519
>>> 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278
21.2838 0.5974800
>>>
>>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
>>> >
>>> > Thank you so much! Converting it to data frame resolved the
issue!
>>> >
>>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
>>> > <gerrit.eichner at math.uni-giessen.de> wrote:
>>> > >
>>> > > It seems as if dt is not a (base R) data frame but a
>>> > > data table. I assume, you will have to transform dt
>>> > > into a data frame (maybe with as.data.frame) to be
>>> > > able to apply unique in the suggested way. However,
>>> > > I am not familiar with data tables. Perhaps somebody
>>> > > else can provide a more profound guess.
>>> > >
>>> > >   Regards  --  Gerrit
>>> > >
>>> > >
---------------------------------------------------------------------
>>> > > Dr. Gerrit Eichner                   Mathematical
Institute, Room 212
>>> > > gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
>>> > > Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392
Giessen, Germany
>>> > > http://www.uni-giessen.de/eichner
>>> > >
---------------------------------------------------------------------
>>> > >
>>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija:
>>> > > > I tried it but I got this error:
>>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
>>> > > > Error in `[.data.table`(dt, c("chr",
"pos", "gene_id")) :
>>> > > >    When i is a data.table (or character vector), the
columns to join by
>>> > > > must be specified using 'on=' argument (see
?data.table), by keying x
>>> > > > (i.e. sorted, and, marked as sorted, see ?setkey),
or by sharing
>>> > > > column names between x and i (i.e., a natural join).
Keyed joins might
>>> > > > have further speed benefits on very large data due
to x being sorted
>>> > > > in RAM.
>>> > > >
>>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
>>> > > > <gerrit.eichner at math.uni-giessen.de> wrote:
>>> > > >>
>>> > > >> Hi, Ana,
>>> > > >>
>>> > > >> doesn't
>>> > > >>
>>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
>>> > > >> nrow(udt)
>>> > > >>
>>> > > >> get close to what you want?
>>> > > >>
>>> > > >>    Hth  --  Gerrit
>>> > > >>
>>> > > >>
---------------------------------------------------------------------
>>> > > >> Dr. Gerrit Eichner                  
Mathematical Institute, Room 212
>>> > > >> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
>>> > > >> Tel: +49-(0)641-99-32104          Arndtstr. 2,
35392 Giessen, Germany
>>> > > >> http://www.uni-giessen.de/eichner
>>> > > >>
---------------------------------------------------------------------
>>> > > >>
>>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija:
>>> > > >>> Hello,
>>> > > >>>
>>> > > >>> I have a data frame like this:
>>> > > >>>
>>> > > >>>> head(dt,20)
>>> > > >>>        chr    pos         gene_id
pval_nominal  pval_ret       wl      wr
>>> > > >>>    1: chr1  54490 ENSG00000227232   
0.6084950 0.7837780 31.62278 21.2838
>>> > > >>>    2: chr1  58814 ENSG00000227232   
0.2952110 0.8975820 31.62278 21.2838
>>> > > >>>    3: chr1  60351 ENSG00000227232   
0.4397880 0.8679590 31.62278 21.2838
>>> > > >>>    4: chr1  61920 ENSG00000227232   
0.3195280 0.6018090 31.62278 21.2838
>>> > > >>>    5: chr1  63671 ENSG00000227232   
0.2377390 0.9880390 31.62278 21.2838
>>> > > >>>    6: chr1  64931 ENSG00000227232   
0.2766790 0.9070370 31.62278 21.2838
>>> > > >>>    7: chr1  81587 ENSG00000227232   
0.6057930 0.6167630 31.62278 21.2838
>>> > > >>>    8: chr1 115746 ENSG00000227232   
0.4078770 0.7799110 31.62278 21.2838
>>> > > >>>    9: chr1 135203 ENSG00000227232   
0.4078770 0.9299130 31.62278 21.2838
>>> > > >>> 10: chr1 138593 ENSG00000227232    0.8464560
0.5696060 31.62278 21.2838
>>> > > >>>
>>> > > >>> it is very big,
>>> > > >>>> dim(dt)
>>> > > >>> [1] 73719122        8
>>> > > >>>
>>> > > >>> To count number of unique rows for all 3
columns: chr, pos and gene_id
>>> > > >>> I could just join those 3 columns and than
count. But how would I find
>>> > > >>> unique number of rows for these 4 columns
without joining them?
>>> > > >>>
>>> > > >>> Thanks
>>> > > >>> Ana
>>> > > >>>
>>> > > >>>
______________________________________________
>>> > > >>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> > > >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> > > >>> and provide commented, minimal,
self-contained, reproducible code.
>>> > > >>>
>>> > > >>
>>> > > >> ______________________________________________
>>> > > >> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
>>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> > > >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> > > >> and provide commented, minimal, self-contained,
reproducible code.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.

Martin Morgan

2019-Nov-08 21:33 UTC

head link

[R] how to find number of unique rows for combination of r columns

With this example
> df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4)
> df  a b value
1 1 1     1
2 1 1     2
3 2 2     3
4 2 3     4

The approach to drop duplicates in the first and second columns has as a
consequence the arbitrary choice of 'value' for the duplicate entries --
why chose a value of '1' rather than '2' (or the average of 1
and 2, or a list containing all possible values, or...) for the rows duplicated
in columns a and b?
> df[!duplicated(df[,1:2]),]  a b value
1 1 1     1
3 2 2     3
4 2 3     4

In base R one might
> aggregate(value ~ a + b, df, mean)  a b value
1 1 1   1.5
2 2 2   3.0
3 2 3   4.0> aggregate(value ~ a + b, df, list)  a b value
1 1 1  1, 2
2 2 2     3
3 2 3     4

but handling several value-like columns would be hard(?)

Using library(dplyr), I have
> group_by(df, a, b) %>% summarize(mean_value = mean(value))# A tibble: 3 x 3
# Groups:   a [2]
      a     b mean_value
  <dbl> <dbl>      <dbl>
1     1     1        1.5
2     2     2        3
3     2     3        4

or
> group_by(df, a, b) %>% summarize(values = list(value))# A tibble: 3 x 3
# Groups:   a [2]
      a     b values
  <dbl> <dbl> <list>
1     1     1 <int [2]>
2     2     2 <int [1]>
3     2     3 <int [1]>

summarizing multiple columns with dplyr
> df$v1 = 1:4
> df$v2 = 4:1
>  group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median =
median(v2))# A tibble: 3 x 4
# Groups:   a [2]
      a     b v1_mean v2_median
  <dbl> <dbl>   <dbl>     <dbl>
1     1     1     1.5       3.5
2     2     2     3         2
3     2     3     4         1

I do not know how performant this would be with data of your size.

Martin Morgan

?On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija"
<r-help-bounces at r-project.org on behalf of sokovic.anamarija at
gmail.com> wrote:

    Thank you so much!!!
    
    On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
    >
    > Correction:
    > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d =
LETTERS[c(1,1,2)])
    > df[!duplicated(df[,2:3]), ]  ## Note the ! sign
    >
    > Bert Gunter
    >
    > "The trouble with having an open mind is that people keep coming
along and sticking things into it."
    > -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
    >
    >
    > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
    >>
    >> Sorry, but you ask basic questions.You really need to spend some
more time with an R tutorial or two. This list is not meant to replace your own
learning efforts.
    >>
    >> You also do not seem to be reading the docs carefully. Under
?unique, it links ?duplicated and tells you that it gives indices of duplicated
rows of a data frame. These then can be used by subscripting to remove those
rows from the data frame. Here is a reproducible example:
    >>
    >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d =
LETTERS[c(1,1,2)])
    >> df[-duplicated(df[,2:3]), ]  ## Note the - sign
    >>
    >> If you prefer, the "Tidyverse" world has what are
purported to be more user-friendly versions of such data handling functionality
that you can use instead.
    >>
    >>
    >> Bert
    >>
    >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija at
gmail.com> wrote:
    >>>
    >>> would you know how would I extract from my original data frame,
just
    >>> these unique rows?
    >>> because this gives me only those 3 columns, and I want all
columns
    >>> from the original data frame
    >>>
    >>> > head(udt)
    >>>    chr   pos         gene_id
    >>> 1 chr1 54490 ENSG00000227232
    >>> 2 chr1 58814 ENSG00000227232
    >>> 3 chr1 60351 ENSG00000227232
    >>> 4 chr1 61920 ENSG00000227232
    >>> 5 chr1 63671 ENSG00000227232
    >>> 6 chr1 64931 ENSG00000227232
    >>>
    >>> > head(dt)
    >>>     chr   pos         gene_id pval_nominal pval_ret       wl   
wr      META
    >>> 1: chr1 54490 ENSG00000227232     0.608495 0.783778 31.62278
21.2838 0.7475480
    >>> 2: chr1 58814 ENSG00000227232     0.295211 0.897582 31.62278
21.2838 0.6031214
    >>> 3: chr1 60351 ENSG00000227232     0.439788 0.867959 31.62278
21.2838 0.6907182
    >>> 4: chr1 61920 ENSG00000227232     0.319528 0.601809 31.62278
21.2838 0.4032200
    >>> 5: chr1 63671 ENSG00000227232     0.237739 0.988039 31.62278
21.2838 0.7482519
    >>> 6: chr1 64931 ENSG00000227232     0.276679 0.907037 31.62278
21.2838 0.5974800
    >>>
    >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija
at gmail.com> wrote:
    >>> >
    >>> > Thank you so much! Converting it to data frame resolved
the issue!
    >>> >
    >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
    >>> > <gerrit.eichner at math.uni-giessen.de> wrote:
    >>> > >
    >>> > > It seems as if dt is not a (base R) data frame but a
    >>> > > data table. I assume, you will have to transform dt
    >>> > > into a data frame (maybe with as.data.frame) to be
    >>> > > able to apply unique in the suggested way. However,
    >>> > > I am not familiar with data tables. Perhaps somebody
    >>> > > else can provide a more profound guess.
    >>> > >
    >>> > >   Regards  --  Gerrit
    >>> > >
    >>> > >
---------------------------------------------------------------------
    >>> > > Dr. Gerrit Eichner                   Mathematical
Institute, Room 212
    >>> > > gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
    >>> > > Tel: +49-(0)641-99-32104          Arndtstr. 2, 35392
Giessen, Germany
    >>> > > http://www.uni-giessen.de/eichner
    >>> > >
---------------------------------------------------------------------
    >>> > >
    >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija:
    >>> > > > I tried it but I got this error:
    >>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
    >>> > > > Error in `[.data.table`(dt, c("chr",
"pos", "gene_id")) :
    >>> > > >    When i is a data.table (or character vector),
the columns to join by
    >>> > > > must be specified using 'on=' argument
(see ?data.table), by keying x
    >>> > > > (i.e. sorted, and, marked as sorted, see
?setkey), or by sharing
    >>> > > > column names between x and i (i.e., a natural
join). Keyed joins might
    >>> > > > have further speed benefits on very large data
due to x being sorted
    >>> > > > in RAM.
    >>> > > >
    >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner
    >>> > > > <gerrit.eichner at math.uni-giessen.de>
wrote:
    >>> > > >>
    >>> > > >> Hi, Ana,
    >>> > > >>
    >>> > > >> doesn't
    >>> > > >>
    >>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
    >>> > > >> nrow(udt)
    >>> > > >>
    >>> > > >> get close to what you want?
    >>> > > >>
    >>> > > >>    Hth  --  Gerrit
    >>> > > >>
    >>> > > >>
---------------------------------------------------------------------
    >>> > > >> Dr. Gerrit Eichner                  
Mathematical Institute, Room 212
    >>> > > >> gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
    >>> > > >> Tel: +49-(0)641-99-32104          Arndtstr.
2, 35392 Giessen, Germany
    >>> > > >> http://www.uni-giessen.de/eichner
    >>> > > >>
---------------------------------------------------------------------
    >>> > > >>
    >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija:
    >>> > > >>> Hello,
    >>> > > >>>
    >>> > > >>> I have a data frame like this:
    >>> > > >>>
    >>> > > >>>> head(dt,20)
    >>> > > >>>        chr    pos         gene_id
pval_nominal  pval_ret       wl      wr
    >>> > > >>>    1: chr1  54490 ENSG00000227232   
0.6084950 0.7837780 31.62278 21.2838
    >>> > > >>>    2: chr1  58814 ENSG00000227232   
0.2952110 0.8975820 31.62278 21.2838
    >>> > > >>>    3: chr1  60351 ENSG00000227232   
0.4397880 0.8679590 31.62278 21.2838
    >>> > > >>>    4: chr1  61920 ENSG00000227232   
0.3195280 0.6018090 31.62278 21.2838
    >>> > > >>>    5: chr1  63671 ENSG00000227232   
0.2377390 0.9880390 31.62278 21.2838
    >>> > > >>>    6: chr1  64931 ENSG00000227232   
0.2766790 0.9070370 31.62278 21.2838
    >>> > > >>>    7: chr1  81587 ENSG00000227232   
0.6057930 0.6167630 31.62278 21.2838
    >>> > > >>>    8: chr1 115746 ENSG00000227232   
0.4078770 0.7799110 31.62278 21.2838
    >>> > > >>>    9: chr1 135203 ENSG00000227232   
0.4078770 0.9299130 31.62278 21.2838
    >>> > > >>> 10: chr1 138593 ENSG00000227232   
0.8464560 0.5696060 31.62278 21.2838
    >>> > > >>>
    >>> > > >>> it is very big,
    >>> > > >>>> dim(dt)
    >>> > > >>> [1] 73719122        8
    >>> > > >>>
    >>> > > >>> To count number of unique rows for all 3
columns: chr, pos and gene_id
    >>> > > >>> I could just join those 3 columns and
than count. But how would I find
    >>> > > >>> unique number of rows for these 4
columns without joining them?
    >>> > > >>>
    >>> > > >>> Thanks
    >>> > > >>> Ana
    >>> > > >>>
    >>> > > >>>
______________________________________________
    >>> > > >>> R-help at r-project.org mailing list --
To UNSUBSCRIBE and more, see
    >>> > > >>>
https://stat.ethz.ch/mailman/listinfo/r-help
    >>> > > >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    >>> > > >>> and provide commented, minimal,
self-contained, reproducible code.
    >>> > > >>>
    >>> > > >>
    >>> > > >>
______________________________________________
    >>> > > >> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
    >>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
    >>> > > >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    >>> > > >> and provide commented, minimal,
self-contained, reproducible code.
    >>>
    >>> ______________________________________________
    >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
    >>> https://stat.ethz.ch/mailman/listinfo/r-help
    >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
    >>> and provide commented, minimal, self-contained, reproducible
code.
    
    ______________________________________________
    R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    and provide commented, minimal, self-contained, reproducible code.

Rui Barradas

2019-Nov-08 23:02 UTC

head link

[R] how to find number of unique rows for combination of r columns

Hello,

If performance is important, and with 73M rows it probably is, take a 
look at this StackOverflow post.

[1] https://stackoverflow.com/a/36058634/8245406


Hope this helps,

Rui Barradas

?s 21:33 de 08/11/19, Martin Morgan escreveu:> With this example
> 
>> df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4)
>> df
>    a b value
> 1 1 1     1
> 2 1 1     2
> 3 2 2     3
> 4 2 3     4
> 
> The approach to drop duplicates in the first and second columns has as a
consequence the arbitrary choice of 'value' for the duplicate entries --
why chose a value of '1' rather than '2' (or the average of 1
and 2, or a list containing all possible values, or...) for the rows duplicated
in columns a and b?
> 
>> df[!duplicated(df[,1:2]),]
>    a b value
> 1 1 1     1
> 3 2 2     3
> 4 2 3     4
> 
> In base R one might
> 
>> aggregate(value ~ a + b, df, mean)
>    a b value
> 1 1 1   1.5
> 2 2 2   3.0
> 3 2 3   4.0
>> aggregate(value ~ a + b, df, list)
>    a b value
> 1 1 1  1, 2
> 2 2 2     3
> 3 2 3     4
> 
> but handling several value-like columns would be hard(?)
> 
> Using library(dplyr), I have
> 
>> group_by(df, a, b) %>% summarize(mean_value = mean(value))
> # A tibble: 3 x 3
> # Groups:   a [2]
>        a     b mean_value
>    <dbl> <dbl>      <dbl>
> 1     1     1        1.5
> 2     2     2        3
> 3     2     3        4
> 
> or
> 
>> group_by(df, a, b) %>% summarize(values = list(value))
> # A tibble: 3 x 3
> # Groups:   a [2]
>        a     b values
>    <dbl> <dbl> <list>
> 1     1     1 <int [2]>
> 2     2     2 <int [1]>
> 3     2     3 <int [1]>
> 
> summarizing multiple columns with dplyr
> 
>> df$v1 = 1:4
>> df$v2 = 4:1
>>   group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median =
median(v2))
> # A tibble: 3 x 4
> # Groups:   a [2]
>        a     b v1_mean v2_median
>    <dbl> <dbl>   <dbl>     <dbl>
> 1     1     1     1.5       3.5
> 2     2     2     3         2
> 3     2     3     4         1
> 
> I do not know how performant this would be with data of your size.
> 
> Martin Morgan
> 
> ?On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija"
<r-help-bounces at r-project.org on behalf of sokovic.anamarija at
gmail.com> wrote:
> 
>      Thank you so much!!!
>      
>      On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>      >
>      > Correction:
>      > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d =
LETTERS[c(1,1,2)])
>      > df[!duplicated(df[,2:3]), ]  ## Note the ! sign
>      >
>      > Bert Gunter
>      >
>      > "The trouble with having an open mind is that people keep
coming along and sticking things into it."
>      > -- Opus (aka Berkeley Breathed in his "Bloom County"
comic strip )
>      >
>      >
>      > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>      >>
>      >> Sorry, but you ask basic questions.You really need to spend
some more time with an R tutorial or two. This list is not meant to replace your
own learning efforts.
>      >>
>      >> You also do not seem to be reading the docs carefully. Under
?unique, it links ?duplicated and tells you that it gives indices of duplicated
rows of a data frame. These then can be used by subscripting to remove those
rows from the data frame. Here is a reproducible example:
>      >>
>      >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d =
LETTERS[c(1,1,2)])
>      >> df[-duplicated(df[,2:3]), ]  ## Note the - sign
>      >>
>      >> If you prefer, the "Tidyverse" world has what are
purported to be more user-friendly versions of such data handling functionality
that you can use instead.
>      >>
>      >>
>      >> Bert
>      >>
>      >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija
<sokovic.anamarija at gmail.com> wrote:
>      >>>
>      >>> would you know how would I extract from my original data
frame, just
>      >>> these unique rows?
>      >>> because this gives me only those 3 columns, and I want
all columns
>      >>> from the original data frame
>      >>>
>      >>> > head(udt)
>      >>>    chr   pos         gene_id
>      >>> 1 chr1 54490 ENSG00000227232
>      >>> 2 chr1 58814 ENSG00000227232
>      >>> 3 chr1 60351 ENSG00000227232
>      >>> 4 chr1 61920 ENSG00000227232
>      >>> 5 chr1 63671 ENSG00000227232
>      >>> 6 chr1 64931 ENSG00000227232
>      >>>
>      >>> > head(dt)
>      >>>     chr   pos         gene_id pval_nominal pval_ret      
wl      wr      META
>      >>> 1: chr1 54490 ENSG00000227232     0.608495 0.783778
31.62278 21.2838 0.7475480
>      >>> 2: chr1 58814 ENSG00000227232     0.295211 0.897582
31.62278 21.2838 0.6031214
>      >>> 3: chr1 60351 ENSG00000227232     0.439788 0.867959
31.62278 21.2838 0.6907182
>      >>> 4: chr1 61920 ENSG00000227232     0.319528 0.601809
31.62278 21.2838 0.4032200
>      >>> 5: chr1 63671 ENSG00000227232     0.237739 0.988039
31.62278 21.2838 0.7482519
>      >>> 6: chr1 64931 ENSG00000227232     0.276679 0.907037
31.62278 21.2838 0.5974800
>      >>>
>      >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija
<sokovic.anamarija at gmail.com> wrote:
>      >>> >
>      >>> > Thank you so much! Converting it to data frame
resolved the issue!
>      >>> >
>      >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner
>      >>> > <gerrit.eichner at math.uni-giessen.de> wrote:
>      >>> > >
>      >>> > > It seems as if dt is not a (base R) data frame
but a
>      >>> > > data table. I assume, you will have to
transform dt
>      >>> > > into a data frame (maybe with as.data.frame) to
be
>      >>> > > able to apply unique in the suggested way.
However,
>      >>> > > I am not familiar with data tables. Perhaps
somebody
>      >>> > > else can provide a more profound guess.
>      >>> > >
>      >>> > >   Regards  --  Gerrit
>      >>> > >
>      >>> > >
---------------------------------------------------------------------
>      >>> > > Dr. Gerrit Eichner                  
Mathematical Institute, Room 212
>      >>> > > gerrit.eichner at math.uni-giessen.de  
Justus-Liebig-University Giessen
>      >>> > > Tel: +49-(0)641-99-32104          Arndtstr. 2,
35392 Giessen, Germany
>      >>> > > http://www.uni-giessen.de/eichner
>      >>> > >
---------------------------------------------------------------------
>      >>> > >
>      >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija:
>      >>> > > > I tried it but I got this error:
>      >>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
>      >>> > > > Error in `[.data.table`(dt,
c("chr", "pos", "gene_id")) :
>      >>> > > >    When i is a data.table (or character
vector), the columns to join by
>      >>> > > > must be specified using 'on='
argument (see ?data.table), by keying x
>      >>> > > > (i.e. sorted, and, marked as sorted, see
?setkey), or by sharing
>      >>> > > > column names between x and i (i.e., a
natural join). Keyed joins might
>      >>> > > > have further speed benefits on very large
data due to x being sorted
>      >>> > > > in RAM.
>      >>> > > >
>      >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit
Eichner
>      >>> > > > <gerrit.eichner at
math.uni-giessen.de> wrote:
>      >>> > > >>
>      >>> > > >> Hi, Ana,
>      >>> > > >>
>      >>> > > >> doesn't
>      >>> > > >>
>      >>> > > >> udt <- unique(dt[c("chr",
"pos", "gene_id")])
>      >>> > > >> nrow(udt)
>      >>> > > >>
>      >>> > > >> get close to what you want?
>      >>> > > >>
>      >>> > > >>    Hth  --  Gerrit
>      >>> > > >>
>      >>> > > >>
---------------------------------------------------------------------
>      >>> > > >> Dr. Gerrit Eichner                  
Mathematical Institute, Room 212
>      >>> > > >> gerrit.eichner at math.uni-giessen.de 
Justus-Liebig-University Giessen
>      >>> > > >> Tel: +49-(0)641-99-32104         
Arndtstr. 2, 35392 Giessen, Germany
>      >>> > > >> http://www.uni-giessen.de/eichner
>      >>> > > >>
---------------------------------------------------------------------
>      >>> > > >>
>      >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana
Marija:
>      >>> > > >>> Hello,
>      >>> > > >>>
>      >>> > > >>> I have a data frame like this:
>      >>> > > >>>
>      >>> > > >>>> head(dt,20)
>      >>> > > >>>        chr    pos         gene_id
pval_nominal  pval_ret       wl      wr
>      >>> > > >>>    1: chr1  54490 ENSG00000227232 
0.6084950 0.7837780 31.62278 21.2838
>      >>> > > >>>    2: chr1  58814 ENSG00000227232 
0.2952110 0.8975820 31.62278 21.2838
>      >>> > > >>>    3: chr1  60351 ENSG00000227232 
0.4397880 0.8679590 31.62278 21.2838
>      >>> > > >>>    4: chr1  61920 ENSG00000227232 
0.3195280 0.6018090 31.62278 21.2838
>      >>> > > >>>    5: chr1  63671 ENSG00000227232 
0.2377390 0.9880390 31.62278 21.2838
>      >>> > > >>>    6: chr1  64931 ENSG00000227232 
0.2766790 0.9070370 31.62278 21.2838
>      >>> > > >>>    7: chr1  81587 ENSG00000227232 
0.6057930 0.6167630 31.62278 21.2838
>      >>> > > >>>    8: chr1 115746 ENSG00000227232 
0.4078770 0.7799110 31.62278 21.2838
>      >>> > > >>>    9: chr1 135203 ENSG00000227232 
0.4078770 0.9299130 31.62278 21.2838
>      >>> > > >>> 10: chr1 138593 ENSG00000227232   
0.8464560 0.5696060 31.62278 21.2838
>      >>> > > >>>
>      >>> > > >>> it is very big,
>      >>> > > >>>> dim(dt)
>      >>> > > >>> [1] 73719122        8
>      >>> > > >>>
>      >>> > > >>> To count number of unique rows for
all 3 columns: chr, pos and gene_id
>      >>> > > >>> I could just join those 3 columns
and than count. But how would I find
>      >>> > > >>> unique number of rows for these 4
columns without joining them?
>      >>> > > >>>
>      >>> > > >>> Thanks
>      >>> > > >>> Ana
>      >>> > > >>>
>      >>> > > >>>
______________________________________________
>      >>> > > >>> R-help at r-project.org mailing
list -- To UNSUBSCRIBE and more, see
>      >>> > > >>>
https://stat.ethz.ch/mailman/listinfo/r-help
>      >>> > > >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>      >>> > > >>> and provide commented, minimal,
self-contained, reproducible code.
>      >>> > > >>>
>      >>> > > >>
>      >>> > > >>
______________________________________________
>      >>> > > >> R-help at r-project.org mailing list
-- To UNSUBSCRIBE and more, see
>      >>> > > >>
https://stat.ethz.ch/mailman/listinfo/r-help
>      >>> > > >> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>      >>> > > >> and provide commented, minimal,
self-contained, reproducible code.
>      >>>
>      >>> ______________________________________________
>      >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>      >>> https://stat.ethz.ch/mailman/listinfo/r-help
>      >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>      >>> and provide commented, minimal, self-contained,
reproducible code.
>      
>      ______________________________________________
>      R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>      https://stat.ethz.ch/mailman/listinfo/r-help
>      PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>      and provide commented, minimal, self-contained, reproducible code.
>      
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

R help - Nov 2019 - how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns

[R] how to find number of unique rows for combination of r columns