I have a fairly large data set with six variables set up like the following
dummy:
# Create fake data
df <- data.frame(code = c(rep(1001, 8), rep(1002, 8)),
year = rep(c(rep(2011, 4), rep(2012, 4)), 2),
fund = rep(c("10E", "10E",
"10E", "27E"), 4),
func = rep(c("110000", "122000",
"214000", "158000"), 4),
obj = rep("100", 16),
amount = round(rnorm(16, 50000, 10000)))
What I would like to do is sum the amount variable by code and year, filtering
rows using different wildcard searches in each of three columns: "1?E"
in fund, "1??????" in func, and "???" in obj. I'm OK
turning these into regular expressions:
# Set parameters
par.fund <- "10E"; par.func <- "100000"; par.obj <-
"000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))
The problem occurs when I try to apply multiple greps across columns. I'd
prefer to use data.table since it's so much faster than plyr and I have 159
different sets of parameters to run through, but I get the same error setting it
up either way:
# Doesn't work
library(data.table)
dt <- data.table(df)
eval(parse(text=paste(
"dt2 <- dt[", "grep('", par.fund, "',
fund) & ",
"grep('", par.func, "', func) & grep('",
par.obj, "', obj)",
", sum(amount), by=c('code', 'year')]" ,
sep="")))
# Warning message:
# In grep("^1.E$", fund) & grep("^1.....$", func) :
# longer object length is not a multiple of shorter object length
# Also doesn't work
library(plyr)
eval(parse(text=paste(
"df2 <- ddply(df[grep('", par.fund, "', df$fund)
& ",
"grep('", par.func, "', df$func) &
grep('", par.obj, "', df$obj), ]",
", .(code, year), summarize, amount = sum(amount))" ,
sep="")))
# Warning message:
# In grep("^1.E$", df$fund) & grep("^1.....$",
df$func) :
# longer object length is not a multiple of shorter object length
Clearly, the problem is how I'm trying to combine greps in subsetting rows,
but I haven't been able to find a solution that works. Any
thoughts-preferably something that works with data.table?
DB
Daniel Bush
School Finance Consultant
School Financial Services
Wisconsin Department of Public Instruction
PO Box 7841 | Madison, WI 53707-7841
daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
Ph: 608-267-9212 | Fax: 608-266-2840
[[alternative HTML version deleted]]
HI,
Not sure whether this helps.
If you take out the grep(",par.obj,..), it works without any warning.
eval(parse(text=paste(
? "dt2 <- dt[", "grep('", par.fund, "',
fund) & ",
? "grep('", par.func, "', func)",
? ", sum(amount), by=c('code', 'year')]" ,
sep="")))
?dt[grep('^1.E$',fund) &
grep('^1.....$',func),sum(amount),by=c('code','year')]
#?? code year???? V1
#1: 1001 2011 185482
#2: 1001 2012 189367
#3: 1002 2011 238098
#4: 1002 2012 211499
aggregate(amount~code+year,data=df,sum)
#? code year amount
#1 1001 2011 185482
#2 1002 2011 238098
#3 1001 2012 189367
#4 1002 2012 211499
In the df, you provided, there is only value of obj.
levels(df$obj)
#[1] "100"
A.K.
----- Original Message -----
From: "Bush, Daniel P. DPI" <Daniel.Bush at dpi.wi.gov>
To: "'r-help at r-project.org'" <r-help at
r-project.org>
Cc:
Sent: Thursday, March 14, 2013 5:43 PM
Subject: [R] Grep with wildcards across multiple columns
I have a fairly large data set with six variables set up like the following
dummy:
# Create fake data
df <- data.frame(code? = c(rep(1001, 8), rep(1002, 8)),
? ? ? ? ? ? ? ? year? = rep(c(rep(2011, 4), rep(2012, 4)), 2),
? ? ? ? ? ? ? ? fund? = rep(c("10E", "10E",
"10E", "27E"), 4),
? ? ? ? ? ? ? ? func? = rep(c("110000", "122000",
"214000", "158000"), 4),
? ? ? ? ? ? ? ? obj? ? = rep("100", 16),
? ? ? ? ? ? ? ? amount = round(rnorm(16, 50000, 10000)))
What I would like to do is sum the amount variable by code and year, filtering
rows using different wildcard searches in each of three columns: "1?E"
in fund, "1??????" in func, and "???" in obj. I'm OK
turning these into regular expressions:
# Set parameters
par.fund <- "10E"; par.func <- "100000"; par.obj <-
"000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))
The problem occurs when I try to apply multiple greps across columns. I'd
prefer to use data.table since it's so much faster than plyr and I have 159
different sets of parameters to run through, but I get the same error setting it
up either way:
# Doesn't work
library(data.table)
dt <- data.table(df)
eval(parse(text=paste(
? "dt2 <- dt[", "grep('", par.fund, "',
fund) & ",
? "grep('", par.func, "', func) & grep('",
par.obj, "', obj)",
? ", sum(amount), by=c('code', 'year')]" ,
sep="")))
# Warning message:
#? In grep("^1.E$", fund) & grep("^1.....$", func) :
#? longer object length is not a multiple of shorter object length
# Also doesn't work
library(plyr)
eval(parse(text=paste(
? "df2 <- ddply(df[grep('", par.fund, "', df$fund)
& ",
? "grep('", par.func, "', df$func) &
grep('", par.obj, "', df$obj), ]",
? ", .(code, year), summarize, amount = sum(amount))" ,
sep="")))
# Warning message:
#? In grep("^1.E$", df$fund) & grep("^1.....$",
df$func) :
#? longer object length is not a multiple of shorter object length
Clearly, the problem is how I'm trying to combine greps in subsetting rows,
but I haven't been able to find a solution that works. Any
thoughts-preferably something that works with data.table?
DB
Daniel Bush
School Finance Consultant
School Financial Services
Wisconsin Department of Public Instruction
PO Box 7841 | Madison, WI 53707-7841
daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
Ph: 608-267-9212 | Fax: 608-266-2840
??? [[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
grep(pattern, textVector) returns of the integer indices of the elements
of textVector that match the pattern. E.g.,
> grep("T",
c("One","Two","Three","Four"))
[1] 2 3
The '&' operator naturally operates on logical vectors of the same
length
(If you give it numbers it silently converts 0 to FALSE and other numbers
to TRUE.)
The two don't fit together. You could use grepl(), which returns a logical
vector the length of textVector, as in
grepl(p1,v1) & grepl(p2,v2)
to figure which entries in the table have v1 matching p1 and v2 matching p2.
Or, you could use
intersect(grep(p1,v1), grep(p2,v2))
if you want to stick with integer indices.
By the way, the eval(parse(text=paste(...))) business is a good way to make
hard-to-read code and hard-to-read means hard-to-fix. Just write out the
expression.
> paste(
+ "dt2 <- dt[", "grep('", par.fund, "',
fund) & ",
+ "grep('", par.func, "', func) &
grep('", par.obj, "', obj)",
+ ", sum(amount), by=c('code', 'year')]" ,
sep="")
[1] "dt2 <- dt[grep('^1.E$', fund) &
grep('^1.....$', func) & grep('^...$', obj), sum(amount),
by=c('code', 'year')]"
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf
> Of Bush, Daniel P. DPI
> Sent: Thursday, March 14, 2013 2:43 PM
> To: 'r-help at r-project.org'
> Subject: [R] Grep with wildcards across multiple columns
>
> I have a fairly large data set with six variables set up like the following
dummy:
>
> # Create fake data
> df <- data.frame(code = c(rep(1001, 8), rep(1002, 8)),
> year = rep(c(rep(2011, 4), rep(2012, 4)), 2),
> fund = rep(c("10E", "10E",
"10E", "27E"), 4),
> func = rep(c("110000", "122000",
"214000", "158000"), 4),
> obj = rep("100", 16),
> amount = round(rnorm(16, 50000, 10000)))
>
> What I would like to do is sum the amount variable by code and year,
filtering rows using
> different wildcard searches in each of three columns: "1?E" in
fund, "1??????" in func,
> and "???" in obj. I'm OK turning these into regular
expressions:
>
> # Set parameters
> par.fund <- "10E"; par.func <- "100000"; par.obj
<- "000"
> par.fund <- glob2rx(gsub("0", "?", par.fund))
> par.func <- glob2rx(gsub("0", "?", par.func))
> par.obj <- glob2rx(gsub("0", "?", par.obj))
>
> The problem occurs when I try to apply multiple greps across columns.
I'd prefer to use
> data.table since it's so much faster than plyr and I have 159 different
sets of parameters
> to run through, but I get the same error setting it up either way:
>
> # Doesn't work
> library(data.table)
> dt <- data.table(df)
> eval(parse(text=paste(
> "dt2 <- dt[", "grep('", par.fund, "',
fund) & ",
> "grep('", par.func, "', func) &
grep('", par.obj, "', obj)",
> ", sum(amount), by=c('code', 'year')]" ,
sep="")))
> # Warning message:
> # In grep("^1.E$", fund) & grep("^1.....$", func)
:
> # longer object length is not a multiple of shorter object length
>
> # Also doesn't work
> library(plyr)
> eval(parse(text=paste(
> "df2 <- ddply(df[grep('", par.fund, "',
df$fund) & ",
> "grep('", par.func, "', df$func) &
grep('", par.obj, "', df$obj), ]",
> ", .(code, year), summarize, amount = sum(amount))" ,
sep="")))
> # Warning message:
> # In grep("^1.E$", df$fund) & grep("^1.....$",
df$func) :
> # longer object length is not a multiple of shorter object length
>
> Clearly, the problem is how I'm trying to combine greps in subsetting
rows, but I haven't
> been able to find a solution that works. Any thoughts-preferably something
that works
> with data.table?
>
> DB
>
> Daniel Bush
> School Finance Consultant
> School Financial Services
> Wisconsin Department of Public Instruction
> PO Box 7841 | Madison, WI 53707-7841
> daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
> Ph: 608-267-9212 | Fax: 608-266-2840
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.