Ana Marija
2019-Nov-08 18:29 UTC
[R] how to find number of unique rows for combination of r columns
Thank you so much!!! On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at gmail.com> wrote:> > Correction: > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: >> >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. >> >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: >> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) >> df[-duplicated(df[,2:3]), ] ## Note the - sign >> >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. >> >> >> Bert >> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >>> >>> would you know how would I extract from my original data frame, just >>> these unique rows? >>> because this gives me only those 3 columns, and I want all columns >>> from the original data frame >>> >>> > head(udt) >>> chr pos gene_id >>> 1 chr1 54490 ENSG00000227232 >>> 2 chr1 58814 ENSG00000227232 >>> 3 chr1 60351 ENSG00000227232 >>> 4 chr1 61920 ENSG00000227232 >>> 5 chr1 63671 ENSG00000227232 >>> 6 chr1 64931 ENSG00000227232 >>> >>> > head(dt) >>> chr pos gene_id pval_nominal pval_ret wl wr META >>> 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480 >>> 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214 >>> 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182 >>> 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200 >>> 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519 >>> 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800 >>> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >>> > >>> > Thank you so much! Converting it to data frame resolved the issue! >>> > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >>> > <gerrit.eichner at math.uni-giessen.de> wrote: >>> > > >>> > > It seems as if dt is not a (base R) data frame but a >>> > > data table. I assume, you will have to transform dt >>> > > into a data frame (maybe with as.data.frame) to be >>> > > able to apply unique in the suggested way. However, >>> > > I am not familiar with data tables. Perhaps somebody >>> > > else can provide a more profound guess. >>> > > >>> > > Regards -- Gerrit >>> > > >>> > > --------------------------------------------------------------------- >>> > > Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > http://www.uni-giessen.de/eichner >>> > > --------------------------------------------------------------------- >>> > > >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija: >>> > > > I tried it but I got this error: >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : >>> > > > When i is a data.table (or character vector), the columns to join by >>> > > > must be specified using 'on=' argument (see ?data.table), by keying x >>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing >>> > > > column names between x and i (i.e., a natural join). Keyed joins might >>> > > > have further speed benefits on very large data due to x being sorted >>> > > > in RAM. >>> > > > >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner >>> > > > <gerrit.eichner at math.uni-giessen.de> wrote: >>> > > >> >>> > > >> Hi, Ana, >>> > > >> >>> > > >> doesn't >>> > > >> >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > >> nrow(udt) >>> > > >> >>> > > >> get close to what you want? >>> > > >> >>> > > >> Hth -- Gerrit >>> > > >> >>> > > >> --------------------------------------------------------------------- >>> > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > >> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > >> http://www.uni-giessen.de/eichner >>> > > >> --------------------------------------------------------------------- >>> > > >> >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: >>> > > >>> Hello, >>> > > >>> >>> > > >>> I have a data frame like this: >>> > > >>> >>> > > >>>> head(dt,20) >>> > > >>> chr pos gene_id pval_nominal pval_ret wl wr >>> > > >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 >>> > > >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 >>> > > >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 >>> > > >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 >>> > > >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 >>> > > >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 >>> > > >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 >>> > > >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 >>> > > >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 >>> > > >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 >>> > > >>> >>> > > >>> it is very big, >>> > > >>>> dim(dt) >>> > > >>> [1] 73719122 8 >>> > > >>> >>> > > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id >>> > > >>> I could just join those 3 columns and than count. But how would I find >>> > > >>> unique number of rows for these 4 columns without joining them? >>> > > >>> >>> > > >>> Thanks >>> > > >>> Ana >>> > > >>> >>> > > >>> ______________________________________________ >>> > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> > > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> > > >>> and provide commented, minimal, self-contained, reproducible code. >>> > > >>> >>> > > >> >>> > > >> ______________________________________________ >>> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help >>> > > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> > > >> and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code.
Martin Morgan
2019-Nov-08 21:33 UTC
[R] how to find number of unique rows for combination of r columns
With this example> df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4) > dfa b value 1 1 1 1 2 1 1 2 3 2 2 3 4 2 3 4 The approach to drop duplicates in the first and second columns has as a consequence the arbitrary choice of 'value' for the duplicate entries -- why chose a value of '1' rather than '2' (or the average of 1 and 2, or a list containing all possible values, or...) for the rows duplicated in columns a and b?> df[!duplicated(df[,1:2]),]a b value 1 1 1 1 3 2 2 3 4 2 3 4 In base R one might> aggregate(value ~ a + b, df, mean)a b value 1 1 1 1.5 2 2 2 3.0 3 2 3 4.0> aggregate(value ~ a + b, df, list)a b value 1 1 1 1, 2 2 2 2 3 3 2 3 4 but handling several value-like columns would be hard(?) Using library(dplyr), I have> group_by(df, a, b) %>% summarize(mean_value = mean(value))# A tibble: 3 x 3 # Groups: a [2] a b mean_value <dbl> <dbl> <dbl> 1 1 1 1.5 2 2 2 3 3 2 3 4 or> group_by(df, a, b) %>% summarize(values = list(value))# A tibble: 3 x 3 # Groups: a [2] a b values <dbl> <dbl> <list> 1 1 1 <int [2]> 2 2 2 <int [1]> 3 2 3 <int [1]> summarizing multiple columns with dplyr> df$v1 = 1:4 > df$v2 = 4:1 > group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median = median(v2))# A tibble: 3 x 4 # Groups: a [2] a b v1_mean v2_median <dbl> <dbl> <dbl> <dbl> 1 1 1 1.5 3.5 2 2 2 3 2 3 2 3 4 1 I do not know how performant this would be with data of your size. Martin Morgan ?On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija" <r-help-bounces at r-project.org on behalf of sokovic.anamarija at gmail.com> wrote: Thank you so much!!! On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: > > Correction: > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: >> >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. >> >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: >> >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) >> df[-duplicated(df[,2:3]), ] ## Note the - sign >> >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. >> >> >> Bert >> >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >>> >>> would you know how would I extract from my original data frame, just >>> these unique rows? >>> because this gives me only those 3 columns, and I want all columns >>> from the original data frame >>> >>> > head(udt) >>> chr pos gene_id >>> 1 chr1 54490 ENSG00000227232 >>> 2 chr1 58814 ENSG00000227232 >>> 3 chr1 60351 ENSG00000227232 >>> 4 chr1 61920 ENSG00000227232 >>> 5 chr1 63671 ENSG00000227232 >>> 6 chr1 64931 ENSG00000227232 >>> >>> > head(dt) >>> chr pos gene_id pval_nominal pval_ret wl wr META >>> 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480 >>> 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214 >>> 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182 >>> 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200 >>> 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519 >>> 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800 >>> >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: >>> > >>> > Thank you so much! Converting it to data frame resolved the issue! >>> > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner >>> > <gerrit.eichner at math.uni-giessen.de> wrote: >>> > > >>> > > It seems as if dt is not a (base R) data frame but a >>> > > data table. I assume, you will have to transform dt >>> > > into a data frame (maybe with as.data.frame) to be >>> > > able to apply unique in the suggested way. However, >>> > > I am not familiar with data tables. Perhaps somebody >>> > > else can provide a more profound guess. >>> > > >>> > > Regards -- Gerrit >>> > > >>> > > --------------------------------------------------------------------- >>> > > Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > http://www.uni-giessen.de/eichner >>> > > --------------------------------------------------------------------- >>> > > >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija: >>> > > > I tried it but I got this error: >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : >>> > > > When i is a data.table (or character vector), the columns to join by >>> > > > must be specified using 'on=' argument (see ?data.table), by keying x >>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing >>> > > > column names between x and i (i.e., a natural join). Keyed joins might >>> > > > have further speed benefits on very large data due to x being sorted >>> > > > in RAM. >>> > > > >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner >>> > > > <gerrit.eichner at math.uni-giessen.de> wrote: >>> > > >> >>> > > >> Hi, Ana, >>> > > >> >>> > > >> doesn't >>> > > >> >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) >>> > > >> nrow(udt) >>> > > >> >>> > > >> get close to what you want? >>> > > >> >>> > > >> Hth -- Gerrit >>> > > >> >>> > > >> --------------------------------------------------------------------- >>> > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 >>> > > >> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen >>> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany >>> > > >> http://www.uni-giessen.de/eichner >>> > > >> --------------------------------------------------------------------- >>> > > >> >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: >>> > > >>> Hello, >>> > > >>> >>> > > >>> I have a data frame like this: >>> > > >>> >>> > > >>>> head(dt,20) >>> > > >>> chr pos gene_id pval_nominal pval_ret wl wr >>> > > >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 >>> > > >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 >>> > > >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 >>> > > >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 >>> > > >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 >>> > > >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 >>> > > >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 >>> > > >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 >>> > > >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 >>> > > >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 >>> > > >>> >>> > > >>> it is very big, >>> > > >>>> dim(dt) >>> > > >>> [1] 73719122 8 >>> > > >>> >>> > > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id >>> > > >>> I could just join those 3 columns and than count. But how would I find >>> > > >>> unique number of rows for these 4 columns without joining them? >>> > > >>> >>> > > >>> Thanks >>> > > >>> Ana >>> > > >>> >>> > > >>> ______________________________________________ >>> > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> > > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> > > >>> and provide commented, minimal, self-contained, reproducible code. >>> > > >>> >>> > > >> >>> > > >> ______________________________________________ >>> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help >>> > > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> > > >> and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Rui Barradas
2019-Nov-08 23:02 UTC
[R] how to find number of unique rows for combination of r columns
Hello, If performance is important, and with 73M rows it probably is, take a look at this StackOverflow post. [1] https://stackoverflow.com/a/36058634/8245406 Hope this helps, Rui Barradas ?s 21:33 de 08/11/19, Martin Morgan escreveu:> With this example > >> df = data.frame(a = c(1, 1, 2, 2), b = c(1, 1, 2, 3), value = 1:4) >> df > a b value > 1 1 1 1 > 2 1 1 2 > 3 2 2 3 > 4 2 3 4 > > The approach to drop duplicates in the first and second columns has as a consequence the arbitrary choice of 'value' for the duplicate entries -- why chose a value of '1' rather than '2' (or the average of 1 and 2, or a list containing all possible values, or...) for the rows duplicated in columns a and b? > >> df[!duplicated(df[,1:2]),] > a b value > 1 1 1 1 > 3 2 2 3 > 4 2 3 4 > > In base R one might > >> aggregate(value ~ a + b, df, mean) > a b value > 1 1 1 1.5 > 2 2 2 3.0 > 3 2 3 4.0 >> aggregate(value ~ a + b, df, list) > a b value > 1 1 1 1, 2 > 2 2 2 3 > 3 2 3 4 > > but handling several value-like columns would be hard(?) > > Using library(dplyr), I have > >> group_by(df, a, b) %>% summarize(mean_value = mean(value)) > # A tibble: 3 x 3 > # Groups: a [2] > a b mean_value > <dbl> <dbl> <dbl> > 1 1 1 1.5 > 2 2 2 3 > 3 2 3 4 > > or > >> group_by(df, a, b) %>% summarize(values = list(value)) > # A tibble: 3 x 3 > # Groups: a [2] > a b values > <dbl> <dbl> <list> > 1 1 1 <int [2]> > 2 2 2 <int [1]> > 3 2 3 <int [1]> > > summarizing multiple columns with dplyr > >> df$v1 = 1:4 >> df$v2 = 4:1 >> group_by(df, a, b) %>% summarize(v1_mean = mean(v1), v2_median = median(v2)) > # A tibble: 3 x 4 > # Groups: a [2] > a b v1_mean v2_median > <dbl> <dbl> <dbl> <dbl> > 1 1 1 1.5 3.5 > 2 2 2 3 2 > 3 2 3 4 1 > > I do not know how performant this would be with data of your size. > > Martin Morgan > > ?On 11/8/19, 1:39 PM, "R-help on behalf of Ana Marija" <r-help-bounces at r-project.org on behalf of sokovic.anamarija at gmail.com> wrote: > > Thank you so much!!! > > On Fri, Nov 8, 2019 at 11:40 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: > > > > Correction: > > df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > > df[!duplicated(df[,2:3]), ] ## Note the ! sign > > > > Bert Gunter > > > > "The trouble with having an open mind is that people keep coming along and sticking things into it." > > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > > > > On Fri, Nov 8, 2019 at 7:59 AM Bert Gunter <bgunter.4567 at gmail.com> wrote: > >> > >> Sorry, but you ask basic questions.You really need to spend some more time with an R tutorial or two. This list is not meant to replace your own learning efforts. > >> > >> You also do not seem to be reading the docs carefully. Under ?unique, it links ?duplicated and tells you that it gives indices of duplicated rows of a data frame. These then can be used by subscripting to remove those rows from the data frame. Here is a reproducible example: > >> > >> df <- data.frame(a = 1:3, b = letters[c(1,1,2)], d = LETTERS[c(1,1,2)]) > >> df[-duplicated(df[,2:3]), ] ## Note the - sign > >> > >> If you prefer, the "Tidyverse" world has what are purported to be more user-friendly versions of such data handling functionality that you can use instead. > >> > >> > >> Bert > >> > >> On Fri, Nov 8, 2019 at 7:38 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > >>> > >>> would you know how would I extract from my original data frame, just > >>> these unique rows? > >>> because this gives me only those 3 columns, and I want all columns > >>> from the original data frame > >>> > >>> > head(udt) > >>> chr pos gene_id > >>> 1 chr1 54490 ENSG00000227232 > >>> 2 chr1 58814 ENSG00000227232 > >>> 3 chr1 60351 ENSG00000227232 > >>> 4 chr1 61920 ENSG00000227232 > >>> 5 chr1 63671 ENSG00000227232 > >>> 6 chr1 64931 ENSG00000227232 > >>> > >>> > head(dt) > >>> chr pos gene_id pval_nominal pval_ret wl wr META > >>> 1: chr1 54490 ENSG00000227232 0.608495 0.783778 31.62278 21.2838 0.7475480 > >>> 2: chr1 58814 ENSG00000227232 0.295211 0.897582 31.62278 21.2838 0.6031214 > >>> 3: chr1 60351 ENSG00000227232 0.439788 0.867959 31.62278 21.2838 0.6907182 > >>> 4: chr1 61920 ENSG00000227232 0.319528 0.601809 31.62278 21.2838 0.4032200 > >>> 5: chr1 63671 ENSG00000227232 0.237739 0.988039 31.62278 21.2838 0.7482519 > >>> 6: chr1 64931 ENSG00000227232 0.276679 0.907037 31.62278 21.2838 0.5974800 > >>> > >>> On Fri, Nov 8, 2019 at 9:30 AM Ana Marija <sokovic.anamarija at gmail.com> wrote: > >>> > > >>> > Thank you so much! Converting it to data frame resolved the issue! > >>> > > >>> > On Fri, Nov 8, 2019 at 9:19 AM Gerrit Eichner > >>> > <gerrit.eichner at math.uni-giessen.de> wrote: > >>> > > > >>> > > It seems as if dt is not a (base R) data frame but a > >>> > > data table. I assume, you will have to transform dt > >>> > > into a data frame (maybe with as.data.frame) to be > >>> > > able to apply unique in the suggested way. However, > >>> > > I am not familiar with data tables. Perhaps somebody > >>> > > else can provide a more profound guess. > >>> > > > >>> > > Regards -- Gerrit > >>> > > > >>> > > --------------------------------------------------------------------- > >>> > > Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> > > gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >>> > > Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> > > http://www.uni-giessen.de/eichner > >>> > > --------------------------------------------------------------------- > >>> > > > >>> > > Am 08.11.2019 um 16:02 schrieb Ana Marija: > >>> > > > I tried it but I got this error: > >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>> > > > Error in `[.data.table`(dt, c("chr", "pos", "gene_id")) : > >>> > > > When i is a data.table (or character vector), the columns to join by > >>> > > > must be specified using 'on=' argument (see ?data.table), by keying x > >>> > > > (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing > >>> > > > column names between x and i (i.e., a natural join). Keyed joins might > >>> > > > have further speed benefits on very large data due to x being sorted > >>> > > > in RAM. > >>> > > > > >>> > > > On Fri, Nov 8, 2019 at 8:58 AM Gerrit Eichner > >>> > > > <gerrit.eichner at math.uni-giessen.de> wrote: > >>> > > >> > >>> > > >> Hi, Ana, > >>> > > >> > >>> > > >> doesn't > >>> > > >> > >>> > > >> udt <- unique(dt[c("chr", "pos", "gene_id")]) > >>> > > >> nrow(udt) > >>> > > >> > >>> > > >> get close to what you want? > >>> > > >> > >>> > > >> Hth -- Gerrit > >>> > > >> > >>> > > >> --------------------------------------------------------------------- > >>> > > >> Dr. Gerrit Eichner Mathematical Institute, Room 212 > >>> > > >> gerrit.eichner at math.uni-giessen.de Justus-Liebig-University Giessen > >>> > > >> Tel: +49-(0)641-99-32104 Arndtstr. 2, 35392 Giessen, Germany > >>> > > >> http://www.uni-giessen.de/eichner > >>> > > >> --------------------------------------------------------------------- > >>> > > >> > >>> > > >> Am 08.11.2019 um 15:38 schrieb Ana Marija: > >>> > > >>> Hello, > >>> > > >>> > >>> > > >>> I have a data frame like this: > >>> > > >>> > >>> > > >>>> head(dt,20) > >>> > > >>> chr pos gene_id pval_nominal pval_ret wl wr > >>> > > >>> 1: chr1 54490 ENSG00000227232 0.6084950 0.7837780 31.62278 21.2838 > >>> > > >>> 2: chr1 58814 ENSG00000227232 0.2952110 0.8975820 31.62278 21.2838 > >>> > > >>> 3: chr1 60351 ENSG00000227232 0.4397880 0.8679590 31.62278 21.2838 > >>> > > >>> 4: chr1 61920 ENSG00000227232 0.3195280 0.6018090 31.62278 21.2838 > >>> > > >>> 5: chr1 63671 ENSG00000227232 0.2377390 0.9880390 31.62278 21.2838 > >>> > > >>> 6: chr1 64931 ENSG00000227232 0.2766790 0.9070370 31.62278 21.2838 > >>> > > >>> 7: chr1 81587 ENSG00000227232 0.6057930 0.6167630 31.62278 21.2838 > >>> > > >>> 8: chr1 115746 ENSG00000227232 0.4078770 0.7799110 31.62278 21.2838 > >>> > > >>> 9: chr1 135203 ENSG00000227232 0.4078770 0.9299130 31.62278 21.2838 > >>> > > >>> 10: chr1 138593 ENSG00000227232 0.8464560 0.5696060 31.62278 21.2838 > >>> > > >>> > >>> > > >>> it is very big, > >>> > > >>>> dim(dt) > >>> > > >>> [1] 73719122 8 > >>> > > >>> > >>> > > >>> To count number of unique rows for all 3 columns: chr, pos and gene_id > >>> > > >>> I could just join those 3 columns and than count. But how would I find > >>> > > >>> unique number of rows for these 4 columns without joining them? > >>> > > >>> > >>> > > >>> Thanks > >>> > > >>> Ana > >>> > > >>> > >>> > > >>> ______________________________________________ > >>> > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> > > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> > > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > > >>> > >>> > > >> > >>> > > >> ______________________________________________ > >>> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> > > >> https://stat.ethz.ch/mailman/listinfo/r-help > >>> > > >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> > > >> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >