thr3ads.net - R help - [R] fast way to find most common value across columns dataframe [Oct 2020]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2020-Oct-31 16:40 UTC

[R] fast way to find most common value across columns dataframe

As usual, a web search ("find statistical mode in R") brought up
something
that is possibly useful -- Did you try this before posting? If not, please
do so in future and let us know what your results were if you subsequently
post here.

Here's what SO suggested:

Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}

# ergo:
apply(as.matrix(df),1,Mode)

Note that all the functionality in Mode is via .Internal functions.  So you
can determine whether this is faster than Jim's code for your use case, but
I'm pretty sure it will be faster than yours. However, note that this gives
only the value of the *first* mode if there is more than one, while Jim's
code alerts you to multiple modes.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Oct 31, 2020 at 2:29 AM Jim Lemon <drjimlemon at gmail.com> wrote:
> Hi Luigi,
> If I understand your request:
>
> library(prettyR)
> apply(as.matrix(df),1,Mode)
> [1] "C"       "B"       "D"       ">1
mode" ">1 mode" ">1 mode" "D"
> [8] "C"       "B"       ">1 mode"
>
> Jim
>
> On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.luigi at
gmail.com>
> wrote:
>
> > Hello,
> > I have a large dataframe (1 000 000 rows, 1000 columns) where the
> > columns contain a character. I would like to determine the most common
> > character for each row.
> > In the example below, I can parse one row at the time and find the
> > most common character (apart for ties...). But I think this will be
> > very slow and memory consuming.
> > Is there a way to run it more efficiently?
> > Thank you
> >
> > ```
> > V = c("A", "B", "C", "D")
> > df = data.frame(n = 1:10,
> >        col_01 = sample(V, 10, replace = TRUE, prob = NULL),
> >        col_02 = sample(V, 10, replace = TRUE, prob = NULL),
> >        col_03 = sample(V, 10, replace = TRUE, prob = NULL),
> >        col_04 = sample(V, 10, replace = TRUE, prob = NULL),
> >        col_05 = sample(V, 10, replace = TRUE, prob = NULL),
> >        stringsAsFactors = FALSE)
> >
> > q = vector()
> > for(i in 1:nrow(df)) {
> >   x = as.vector(t(df[i,2:ncol(df)]))
> >   q[i] =    names(which.max(table(x)))
> > }
> > df$most = q
> > ```
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Luigi Marongiu

2020-Oct-31 17:12 UTC

head link

[R] fast way to find most common value across columns dataframe

Thank you. The problem was not finding the mode but applying it the R
way (I have the tendency to loop into each line of the dataframes,
which I believe is NOT the R way).
I'll try them.
Best regards
Luigi

On Sat, Oct 31, 2020 at 5:40 PM Bert Gunter <bgunter.4567 at gmail.com>
wrote:>
> As usual, a web search ("find statistical mode in R") brought up
something that is possibly useful -- Did you try this before posting? If not,
please do so in future and let us know what your results were if you
subsequently post here.
>
> Here's what SO suggested:
>
> Mode <- function(x) {
>    ux <- unique(x)
>    ux[which.max(tabulate(match(x, ux)))]
> }
>
> # ergo:
> apply(as.matrix(df),1,Mode)
>
> Note that all the functionality in Mode is via .Internal functions.  So you
can determine whether this is faster than Jim's code for your use case, but
I'm pretty sure it will be faster than yours. However, note that this gives
only the value of the *first* mode if there is more than one, while Jim's
code alerts you to multiple modes.
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Sat, Oct 31, 2020 at 2:29 AM Jim Lemon <drjimlemon at gmail.com>
wrote:
>>
>> Hi Luigi,
>> If I understand your request:
>>
>> library(prettyR)
>> apply(as.matrix(df),1,Mode)
>> [1] "C"       "B"       "D"      
">1 mode" ">1 mode" ">1 mode"
"D"
>> [8] "C"       "B"       ">1 mode"
>>
>> Jim
>>
>> On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.luigi at
gmail.com>
>> wrote:
>>
>> > Hello,
>> > I have a large dataframe (1 000 000 rows, 1000 columns) where the
>> > columns contain a character. I would like to determine the most
common
>> > character for each row.
>> > In the example below, I can parse one row at the time and find the
>> > most common character (apart for ties...). But I think this will
be
>> > very slow and memory consuming.
>> > Is there a way to run it more efficiently?
>> > Thank you
>> >
>> > ```
>> > V = c("A", "B", "C", "D")
>> > df = data.frame(n = 1:10,
>> >        col_01 = sample(V, 10, replace = TRUE, prob = NULL),
>> >        col_02 = sample(V, 10, replace = TRUE, prob = NULL),
>> >        col_03 = sample(V, 10, replace = TRUE, prob = NULL),
>> >        col_04 = sample(V, 10, replace = TRUE, prob = NULL),
>> >        col_05 = sample(V, 10, replace = TRUE, prob = NULL),
>> >        stringsAsFactors = FALSE)
>> >
>> > q = vector()
>> > for(i in 1:nrow(df)) {
>> >   x = as.vector(t(df[i,2:ncol(df)]))
>> >   q[i] =    names(which.max(table(x)))
>> > }
>> > df$most = q
>> > ```
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.


-- 
Best regards,
Luigi

Rui Barradas

2020-Oct-31 18:14 UTC

head link

[R] fast way to find most common value across columns dataframe

Hello,

Here is a comparative test of 3 options.

cumstats::Mode returns a list with two members,

Values: all the modes.
Frequency: their frequency

The value of the mode must be extracted after. cumstats::Mode is by far 
the slowest but returns more information.

The function below is in this StackOverflow post [1]. It's the fastest 
but only returns one mode, the first found.


set.seed(2020)
V <- LETTERS
df <- replicate(100, sample(V, 1000, replace = TRUE))
df <- as.data.frame(t(df))

Mode <- function(x) {
   ux <- unique(x)
   ux[which.max(tabulate(match(x, ux)))]
}

res1 <- apply(df, 1, prettyR::Mode)
res2 <- apply(df, 1, cumstats::Mode)
res3 <- apply(df, 1, Mode)

head(res1)
res2vals <- lapply(res2, '[[', 1)
head(res2vals)
head(res3)

library(microbenchmark)

mb <- microbenchmark(
   pre = apply(df, 1, prettyR::Mode),
   cum = cumstats::Mode(x),
   so = apply(x, 1, Mode),
   times = 10
)
print(mb, unit = "relative", order = "median")




[1] https://stackoverflow.com/a/8189441/8245406


Hope this helps,

Rui Barradas

?s 17:12 de 31/10/20, Luigi Marongiu escreveu:> Thank you. The problem was not finding the mode but applying it the R
> way (I have the tendency to loop into each line of the dataframes,
> which I believe is NOT the R way).
> I'll try them.
> Best regards
> Luigi
> 
> On Sat, Oct 31, 2020 at 5:40 PM Bert Gunter <bgunter.4567 at
gmail.com> wrote:
>>
>> As usual, a web search ("find statistical mode in R") brought
up something that is possibly useful -- Did you try this before posting? If not,
please do so in future and let us know what your results were if you
subsequently post here.
>>
>> Here's what SO suggested:
>>
>> Mode <- function(x) {
>>     ux <- unique(x)
>>     ux[which.max(tabulate(match(x, ux)))]
>> }
>>
>> # ergo:
>> apply(as.matrix(df),1,Mode)
>>
>> Note that all the functionality in Mode is via .Internal functions.  So
you can determine whether this is faster than Jim's code for your use case,
but I'm pretty sure it will be faster than yours. However, note that this
gives only the value of the *first* mode if there is more than one, while
Jim's code alerts you to multiple modes.
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming
along and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic
strip )
>>
>>
>> On Sat, Oct 31, 2020 at 2:29 AM Jim Lemon <drjimlemon at
gmail.com> wrote:
>>>
>>> Hi Luigi,
>>> If I understand your request:
>>>
>>> library(prettyR)
>>> apply(as.matrix(df),1,Mode)
>>> [1] "C"       "B"       "D"      
">1 mode" ">1 mode" ">1 mode"
"D"
>>> [8] "C"       "B"       ">1 mode"
>>>
>>> Jim
>>>
>>> On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.luigi
at gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>> I have a large dataframe (1 000 000 rows, 1000 columns) where
the
>>>> columns contain a character. I would like to determine the most
common
>>>> character for each row.
>>>> In the example below, I can parse one row at the time and find
the
>>>> most common character (apart for ties...). But I think this
will be
>>>> very slow and memory consuming.
>>>> Is there a way to run it more efficiently?
>>>> Thank you
>>>>
>>>> ```
>>>> V = c("A", "B", "C",
"D")
>>>> df = data.frame(n = 1:10,
>>>>         col_01 = sample(V, 10, replace = TRUE, prob = NULL),
>>>>         col_02 = sample(V, 10, replace = TRUE, prob = NULL),
>>>>         col_03 = sample(V, 10, replace = TRUE, prob = NULL),
>>>>         col_04 = sample(V, 10, replace = TRUE, prob = NULL),
>>>>         col_05 = sample(V, 10, replace = TRUE, prob = NULL),
>>>>         stringsAsFactors = FALSE)
>>>>
>>>> q = vector()
>>>> for(i in 1:nrow(df)) {
>>>>    x = as.vector(t(df[i,2:ncol(df)]))
>>>>    q[i] =    names(which.max(table(x)))
>>>> }
>>>> df$most = q
>>>> ```
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>
>>>          [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
> 
> 
>

Seemingly Similar Threads

Search for more maybe matching threads

R help - Oct 2020 - fast way to find most common value across columns dataframe

[R] fast way to find most common value across columns dataframe

[R] fast way to find most common value across columns dataframe

[R] fast way to find most common value across columns dataframe

Seemingly Similar Threads