Evan Cooch
2017-Jun-21 16:11 UTC
[R] selecting dataframe columns based on substring of col name(s)
Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4): d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10)) What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure. Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example piece <- d[,c("col2","col3","col4")] Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like: piece <- df[,col2:col4] where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number). Now, if I use the "to" function in the lessR package, I can get there from here fairly easily: piece <- d[,to("col",4,from=2,same.size=FALSE)] But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package? Thanks in advance...
Jeff Newmiller
2017-Jun-21 18:54 UTC
[R] selecting dataframe columns based on substring of col name(s)
d[ , paste( "col", 2:4 ) ] or d[ , sprintf( "col%d", 2:4 ) ] or d[ , grep( "^col[234]$", names( d ) ] Each approach has different ways of being flexible. -- Sent from my phone. Please excuse my brevity. On June 21, 2017 9:11:10 AM PDT, Evan Cooch <evan.cooch at gmail.com> wrote:>Suppose I have the following sort of dataframe, where each column name >has a common structure: prefix, followed by a number (for this example, > >col1, col2, col3 and col4): > > d = data.frame( col1=runif(10), col2=runif(10), >col3=runif(10),col4=runif(10)) > >What I haven't been able to suss out is how to efficiently >'extract/manipulate/play with' columns from the data frame, making use >of this common structure. > >Suppose, for example, I want to 'work with' col2, col3, and col4. Now, >I >could subset the dataframe d in any number of ways -- for example > >piece <- d[,c("col2","col3","col4")] > >Works as expected, but for *big* problems (where I might have dozens -> > >hundreds of columns -- often the case with big design matrices output >by >some linear models program or another), having to write them all out >using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm >wondering about is if there is a way to simply select over the >"changing >part" of the column name (you can do this relatively easily in a data >step in SAS, for example). Heuristically, something like: > >piece <- df[,col2:col4] > >where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the > >prefix 'col', and then simply select over the changing suffic -- i.e., >column number). > >Now, if I use the "to" function in the lessR package, I can get there >from here fairly easily: > >piece <- d[,to("col",4,from=2,same.size=FALSE)] > >But, is there a better way? Beyond 'efficiency' (ease of >implementation), part of what constitutes 'better' might be something >in >base R, rather than relying on a package? > >Thanks in advance... > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2017-Jun-21 19:08 UTC
[R] selecting dataframe columns based on substring of col name(s)
Assume there 100 columns, named col1, col2,..., col100 in data frame d + maybe some more columns with various names preceding them. You want col21 to col72. nm <- names(d) d[, which(nm == "col21"): which(nm == "col72") ] ## NB : if all you have is col1 to col100 the d[, 23:72] works fine. See any good tutorial on R for how to index matrix like structures in R. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Wed, Jun 21, 2017 at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote:> Suppose I have the following sort of dataframe, where each column name has a > common structure: prefix, followed by a number (for this example, col1, > col2, col3 and col4): > > d = data.frame( col1=runif(10), col2=runif(10), > col3=runif(10),col4=runif(10)) > > What I haven't been able to suss out is how to efficiently > 'extract/manipulate/play with' columns from the data frame, making use of > this common structure. > > Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I > could subset the dataframe d in any number of ways -- for example > > piece <- d[,c("col2","col3","col4")] > > Works as expected, but for *big* problems (where I might have dozens -> > hundreds of columns -- often the case with big design matrices output by > some linear models program or another), having to write them all out using > c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering > about is if there is a way to simply select over the "changing part" of the > column name (you can do this relatively easily in a data step in SAS, for > example). Heuristically, something like: > > piece <- df[,col2:col4] > > where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the > prefix 'col', and then simply select over the changing suffic -- i.e., > column number). > > Now, if I use the "to" function in the lessR package, I can get there from > here fairly easily: > > piece <- d[,to("col",4,from=2,same.size=FALSE)] > > But, is there a better way? Beyond 'efficiency' (ease of implementation), > part of what constitutes 'better' might be something in base R, rather than > relying on a package? > > Thanks in advance... > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2017-Jun-21 19:28 UTC
[R] selecting dataframe columns based on substring of col name(s)
> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote: > > Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4): > > d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10)) > > What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure. > > Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example > > piece <- d[,c("col2","col3","col4")] > > Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like: > > piece <- df[,col2:col4] > > where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number). > > Now, if I use the "to" function in the lessR package, I can get there from here fairly easily: > > piece <- d[,to("col",4,from=2,same.size=FALSE)] > > But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package?After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form: names(airquality) #[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day" subset(airquality, Temp > 90, # this is the row selection select = Ozone:Solar.R) # and this selects columns #-------- Ozone Solar.R 42 NA 259 43 NA 250 69 97 267 70 97 272 75 NA 291 102 NA 222 120 76 203 121 118 225 122 84 237 123 85 188 124 96 167 125 78 197 126 73 183 127 91 189 Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`. -- David Winsemius Alameda, CA, USA
Evan Cooch
2017-Jun-22 12:47 UTC
[R] selecting dataframe columns based on substring of col name(s)
Thanks to all the good suggestions/solutions to the original problem. On 6/21/2017 3:28 PM, David Winsemius wrote:>> On Jun 21, 2017, at 9:11 AM, Evan Cooch <evan.cooch at gmail.com> wrote: >> >> Suppose I have the following sort of dataframe, where each column name has a common structure: prefix, followed by a number (for this example, col1, col2, col3 and col4): >> >> d = data.frame( col1=runif(10), col2=runif(10), col3=runif(10),col4=runif(10)) >> >> What I haven't been able to suss out is how to efficiently 'extract/manipulate/play with' columns from the data frame, making use of this common structure. >> >> Suppose, for example, I want to 'work with' col2, col3, and col4. Now, I could subset the dataframe d in any number of ways -- for example >> >> piece <- d[,c("col2","col3","col4")] >> >> Works as expected, but for *big* problems (where I might have dozens -> hundreds of columns -- often the case with big design matrices output by some linear models program or another), having to write them all out using c("col2","col3",...."colXXXXX") takes a lot of time. What I'm wondering about is if there is a way to simply select over the "changing part" of the column name (you can do this relatively easily in a data step in SAS, for example). Heuristically, something like: >> >> piece <- df[,col2:col4] >> >> where the heuristic col2:col4 is interpreted as col2 -> col4 (parse the prefix 'col', and then simply select over the changing suffic -- i.e., column number). >> >> Now, if I use the "to" function in the lessR package, I can get there from here fairly easily: >> >> piece <- d[,to("col",4,from=2,same.size=FALSE)] >> >> But, is there a better way? Beyond 'efficiency' (ease of implementation), part of what constitutes 'better' might be something in base R, rather than relying on a package? > After staring at the code for the base function subset with a thought to hacking it to do this I realized that should be already part of the evaluation result from its current form: > > names(airquality) > #[1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day" > > subset(airquality, > Temp > 90, # this is the row selection > select = Ozone:Solar.R) # and this selects columns > #-------- > Ozone Solar.R > 42 NA 259 > 43 NA 250 > 69 97 267 > 70 97 272 > 75 NA 291 > 102 NA 222 > 120 76 203 > 121 118 225 > 122 84 237 > 123 85 188 > 124 96 167 > 125 78 197 > 126 73 183 > 127 91 189 > > Bert's advice to work with the numbers is good, but conversion to numeric designations of columns inside the `select`-expression is actually what is occurring inside `subset`. >[[alternative HTML version deleted]]