Luigi Marongiu
2020-Oct-31 08:56 UTC
[R] fast way to find most common value across columns dataframe
Hello, I have a large dataframe (1 000 000 rows, 1000 columns) where the columns contain a character. I would like to determine the most common character for each row. In the example below, I can parse one row at the time and find the most common character (apart for ties...). But I think this will be very slow and memory consuming. Is there a way to run it more efficiently? Thank you ``` V = c("A", "B", "C", "D") df = data.frame(n = 1:10, col_01 = sample(V, 10, replace = TRUE, prob = NULL), col_02 = sample(V, 10, replace = TRUE, prob = NULL), col_03 = sample(V, 10, replace = TRUE, prob = NULL), col_04 = sample(V, 10, replace = TRUE, prob = NULL), col_05 = sample(V, 10, replace = TRUE, prob = NULL), stringsAsFactors = FALSE) q = vector() for(i in 1:nrow(df)) { x = as.vector(t(df[i,2:ncol(df)])) q[i] = names(which.max(table(x))) } df$most = q ```
Jim Lemon
2020-Oct-31 09:28 UTC
[R] fast way to find most common value across columns dataframe
Hi Luigi, If I understand your request: library(prettyR) apply(as.matrix(df),1,Mode) [1] "C" "B" "D" ">1 mode" ">1 mode" ">1 mode" "D" [8] "C" "B" ">1 mode" Jim On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.luigi at gmail.com> wrote:> Hello, > I have a large dataframe (1 000 000 rows, 1000 columns) where the > columns contain a character. I would like to determine the most common > character for each row. > In the example below, I can parse one row at the time and find the > most common character (apart for ties...). But I think this will be > very slow and memory consuming. > Is there a way to run it more efficiently? > Thank you > > ``` > V = c("A", "B", "C", "D") > df = data.frame(n = 1:10, > col_01 = sample(V, 10, replace = TRUE, prob = NULL), > col_02 = sample(V, 10, replace = TRUE, prob = NULL), > col_03 = sample(V, 10, replace = TRUE, prob = NULL), > col_04 = sample(V, 10, replace = TRUE, prob = NULL), > col_05 = sample(V, 10, replace = TRUE, prob = NULL), > stringsAsFactors = FALSE) > > q = vector() > for(i in 1:nrow(df)) { > x = as.vector(t(df[i,2:ncol(df)])) > q[i] = names(which.max(table(x))) > } > df$most = q > ``` > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bert Gunter
2020-Oct-31 16:40 UTC
[R] fast way to find most common value across columns dataframe
As usual, a web search ("find statistical mode in R") brought up something that is possibly useful -- Did you try this before posting? If not, please do so in future and let us know what your results were if you subsequently post here. Here's what SO suggested: Mode <- function(x) { ux <- unique(x) ux[which.max(tabulate(match(x, ux)))] } # ergo: apply(as.matrix(df),1,Mode) Note that all the functionality in Mode is via .Internal functions. So you can determine whether this is faster than Jim's code for your use case, but I'm pretty sure it will be faster than yours. However, note that this gives only the value of the *first* mode if there is more than one, while Jim's code alerts you to multiple modes. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Oct 31, 2020 at 2:29 AM Jim Lemon <drjimlemon at gmail.com> wrote:> Hi Luigi, > If I understand your request: > > library(prettyR) > apply(as.matrix(df),1,Mode) > [1] "C" "B" "D" ">1 mode" ">1 mode" ">1 mode" "D" > [8] "C" "B" ">1 mode" > > Jim > > On Sat, Oct 31, 2020 at 7:56 PM Luigi Marongiu <marongiu.luigi at gmail.com> > wrote: > > > Hello, > > I have a large dataframe (1 000 000 rows, 1000 columns) where the > > columns contain a character. I would like to determine the most common > > character for each row. > > In the example below, I can parse one row at the time and find the > > most common character (apart for ties...). But I think this will be > > very slow and memory consuming. > > Is there a way to run it more efficiently? > > Thank you > > > > ``` > > V = c("A", "B", "C", "D") > > df = data.frame(n = 1:10, > > col_01 = sample(V, 10, replace = TRUE, prob = NULL), > > col_02 = sample(V, 10, replace = TRUE, prob = NULL), > > col_03 = sample(V, 10, replace = TRUE, prob = NULL), > > col_04 = sample(V, 10, replace = TRUE, prob = NULL), > > col_05 = sample(V, 10, replace = TRUE, prob = NULL), > > stringsAsFactors = FALSE) > > > > q = vector() > > for(i in 1:nrow(df)) { > > x = as.vector(t(df[i,2:ncol(df)])) > > q[i] = names(which.max(table(x))) > > } > > df$most = q > > ``` > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]