Simon Kiss
2010-Oct-10 16:35 UTC
[R] Create single vector after looping through multiple data frames with GREP
Hello all,
I changed the subject line of the e-mail, because the question I''m
posing now is different than the first one. I hope that this is proper
etiquette. However, the original chain is included below.
I've incorporated bits of both Ethan and Brian's code into the script
below, but there's one aspect I can't get my head around. I'm
totally new to programming with control structures. The reproducible code below
creates a list containing 19 data frames, one each for the "Most Important
Problem" survey data for Canada.
What I'd like at this stage is a loop where I can search through all the
data frames for rows containing the search term and then bind the rows together
in a plotable (sp?) format.
At the bottom of the code below, you'll find my first attempt to make use of
a search string and to put it into a plotable format. It only partially works.
I can only get the numbers for one year, where I'd like to be able to get a
string of numbers for several years.But, on the upside, grep appears to do the
trick in terms of selecting rows.
Can any one suggest a solution?
Yours truly,
Simon Kiss
#This is the reproducible code to set-up all the data frames
require("XML")
library(XML)
#This gets the data from the web and lists them
mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_",
c(1987:2001,2003:2006), ".htm", sep="")
alltables <- lapply(mylist, readHTMLTable)
#convert to dataframes
r<-lapply(alltables, function(x) {as.data.frame(x)} )
#This is just some house-cleaning; structuring all the tables so they are
uniform
r[[1]][3]<-r[[1]][2]
r[[1]][2]<-c(" ")
r[[2]][4]<-r[[2]][2]
r[[2]][5]<-r[[2]][3]
r[[2]][2:3]<-c(" ")
r[[3]][4:5]<-r[[3]][3:4]
r[[3]][3]<-c(" ")
#This loop deletes some superfluous columns and rows, turns the first column in
to character strings and the data into numeric
for (i in 1:19) {
n.rows<-dim(r[[i]])[1]
r[[i]] <- r[[i]][15:n.rows-3, 1:5]
n.rows<-dim(r[[i]])[1]
row.names(r[[i]]) <-NULL
names(r[[i]]) <- c("Response", "Q1", "Q2",
"Q3", "Q4")
r[[i]][, 1]<-as.character(r[[i]][,1])
#r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5]))
r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x)
{as.numeric(as.character(x))})
#n.rows<-dim(r[[i]])[1]
#r[[i]]<-r[[i]][9
}
#This code is my first attempt at introducing a search string, getting the rows,
binding and plotting;
economy<-r[[10]][grep('Economy', r[[10]][,1]),]
economy_2<-r[[11]][grep('Economy', r[[11]][,1]),]
test<-cbind(economy, economy_2)
plot(as.numeric(test), type='l')
#here's another attempt I'm trying....
economy<-data.frame
for (i in 15:19) {
economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ]
}
Begin forwarded message:
> From: Simon Kiss <sjkiss at gmail.com>
> Date: October 7, 2010 4:59:46 PM EDT
> To: Simon Kiss <simonjkiss at yahoo.ca>
> Subject: Fwd: [R] Converting scraped data
>
>
>
> Begin forwarded message:
>
>> From: Ethan Brown <ethancbrown at gmail.com>
>> Date: October 6, 2010 4:22:41 PM GMT-04:00
>> To: Simon Kiss <sjkiss at gmail.com>
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Converting scraped data
>>
>> Hi Simon,
>>
>> You'll notice the "test" data.frame has a whole mix of
characters in
>> the columns you're interested, including a "-" for
missing values, and
>> that the columns you're interested in are in fact factors.
>>
>> as.numeric(factor) returns the level of the factor, not the value of
>> the level. (See ?levels and ?factor)--that's why it's giving
you those
>> irrelevant integers. I always end up using something like this handy
>> code snippet to deal with the situation:
>>
>> unfactor <- function(factors)
>> # From
http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
>> # Transform a factor back into its factor names
>> {
>> return(levels(factors)[factors])
>> }
>>
>> Then, to get your data to where you want it, I'd do this:
>>
>> require(XML)
>> theurl <-
"http://www.queensu.ca/cora/_trends/mip_2006.htm"
>> tables <- readHTMLTable(theurl)
>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>> class(tables)
>> test<-data.frame(tables, stringsAsFactors=FALSE)
>>
>>
>> result <- test[11:42, 1:5] #Extract the actual data we want
>> names(result) <- c("Response", "Q1",
"Q2","Q3","Q4")
>> for(i in 2:5) {
>> # Convert columns to factors
>> result[,i] <- as.numeric(unfactor(result[,i]))
>> }
>> result
>>
>> From here you should be able to plot or do whatever else you want.
>>
>> Hope this helps,
>> Ethan Brown
>>
>>
>> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at gmail.com>
wrote:
>>> Dear Colleagues,
>>> I used this code to scrape data from the URL conatined within.
This code
>>> should be reproducible.
>>>
>>> require("XML")
>>> library(XML)
>>> theurl <-
"http://www.queensu.ca/cora/_trends/mip_2006.htm"
>>> tables <- readHTMLTable(theurl)
>>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>>> class(tables)
>>> test<-data.frame(tables, stringsAsFactors=FALSE)
>>> test[16,c(2:5)]
>>> as.numeric(test[16,c(2:5)])
>>> quartz()
>>> plot(c(1:4), test[15, c(2:5)])
>>>
>>> calling the values from the row of interest using test[16, c(2:5)]
can bring
>>> them up as represented on the screen, plotting them or coercing
them to
>>> numeric changes the values and in a way that doesn't make sense
to me. My
>>> intuitino is that there is something going on with the way the
characters
>>> are coded or classed when they're scraped into R. I've
looked around the
>>> help files for converting from character to numeric but can't
find a
>>> solution.
>>>
>>> I also tried this:
>>>
>>> as.numeric(as.character(test[16,c(2:5)] and that also changed the
values
>>> from what they originally were.
>>>
>>> I'm grateful for any suggestions.
>>> Yours, Simon Kiss
>>>
>>>
>>>
>>> *********************************
>>> Simon J. Kiss, PhD
>>> Assistant Professor, Wilfrid Laurier University
>>> 73 George Street
>>> Brantford, Ontario, Canada
>>> N3T 2C9
>>> Cell: +1 519 761 7606
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
>
>
>
>
>
>
>
>
>
>
*********************************
Simon J. Kiss, PhD
Assistant Professor, Wilfrid Laurier University
73 George Street
Brantford, Ontario, Canada
N3T 2C9
Cell: +1 519 761 7606
Michael Bedward
2010-Oct-11 05:19 UTC
[R] Create single vector after looping through multiple data frames with GREP
Hi Simon,
The function below should do it or at least get you started...
getPlotData <- function (datalist, response, times)
{
qdata <- sapply(datalist[times],
function(df) {
irow <- grepl(response, df$Response)
df[irow, 2:5]
}
)
# qdata is a matrix with rows Q1:Q4 and cols for times;
# we turn it into a two col matrix with col 1 = time index
# and col 2 = value
time.index <- seq(4 * ncol(qdata))
out <- cbind(time.index, as.numeric(qdata))
rownames(out) <- paste(time.index, rownames(qdata), sep=".")
colnames(out) <- c("time", response)
out
}
#Example, get data for times 10:15 where Response contains "Economy"
x <- getPlotData(r, "Economy", 10:15)
Michael
On 11 October 2010 03:35, Simon Kiss <sjkiss at gmail.com>
wrote:> Hello all,
>
> I changed the subject line of the e-mail, because the question I''m
posing now is different than the first one. I hope that this is proper
etiquette. ?However, the original chain is included below.
>
> I've incorporated bits of ?both Ethan and Brian's code into the
script below, but there's one aspect I can't get my head around. I'm
totally new to programming with control structures. The reproducible code below
creates a list containing 19 data frames, one each for the "Most Important
Problem" ?survey data for Canada.
>
> What I'd like at this stage is a loop where I can search through all
the data frames for rows containing the search term and then bind the rows
together in a plotable (sp?) format.
>
> At the bottom of the code below, you'll find my first attempt to make
use of a search string and to put it into a plotable format. ?It only partially
works. ?I can only get the numbers for one year, where I'd like to be able
to get a string of numbers for several years.But, on the upside, grep appears to
do the trick in terms of selecting rows.
>
> Can any one suggest a solution?
> Yours truly,
> Simon Kiss
>
> #This is the reproducible code to set-up all the data frames
> require("XML")
> library(XML)
> #This gets the data from the web and lists them
> mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_",
> c(1987:2001,2003:2006), ".htm", sep="")
> alltables <- lapply(mylist, readHTMLTable)
>
> #convert to dataframes
> r<-lapply(alltables, function(x) {as.data.frame(x)} )
>
> #This is just some house-cleaning; structuring all the tables so they are
uniform
> r[[1]][3]<-r[[1]][2]
> r[[1]][2]<-c(" ")
> r[[2]][4]<-r[[2]][2]
> r[[2]][5]<-r[[2]][3]
> r[[2]][2:3]<-c(" ")
> r[[3]][4:5]<-r[[3]][3:4]
> r[[3]][3]<-c(" ")
>
> #This loop deletes some superfluous columns and rows, turns the first
column in to character strings and the data into numeric
> for (i in 1:19) {
> n.rows<-dim(r[[i]])[1]
> r[[i]] <- r[[i]][15:n.rows-3, 1:5]
> n.rows<-dim(r[[i]])[1]
> row.names(r[[i]]) <-NULL
> names(r[[i]]) <- c("Response", "Q1", "Q2",
"Q3", "Q4")
>
> r[[i]][, 1]<-as.character(r[[i]][,1])
> #r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5]))
> r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x)
{as.numeric(as.character(x))})
> #n.rows<-dim(r[[i]])[1]
> #r[[i]]<-r[[i]][9
> }
>
> #This code is my first attempt at introducing a search string, getting the
rows, binding and plotting;
> economy<-r[[10]][grep('Economy', r[[10]][,1]),]
> economy_2<-r[[11]][grep('Economy', r[[11]][,1]),]
> test<-cbind(economy, economy_2)
> plot(as.numeric(test), type='l')
>
> #here's another attempt I'm trying....
> economy<-data.frame
> for (i in 15:19) {
> economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ]
> }
>
> Begin forwarded message:
>
>> From: Simon Kiss <sjkiss at gmail.com>
>> Date: October 7, 2010 4:59:46 PM EDT
>> To: Simon Kiss <simonjkiss at yahoo.ca>
>> Subject: Fwd: [R] Converting scraped data
>>
>>
>>
>> Begin forwarded message:
>>
>>> From: Ethan Brown <ethancbrown at gmail.com>
>>> Date: October 6, 2010 4:22:41 PM GMT-04:00
>>> To: Simon Kiss <sjkiss at gmail.com>
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Converting scraped data
>>>
>>> Hi Simon,
>>>
>>> You'll notice the "test" data.frame has a whole mix
of characters in
>>> the columns you're interested, including a "-" for
missing values, and
>>> that the columns you're interested in are in fact factors.
>>>
>>> as.numeric(factor) returns the level of the factor, not the value
of
>>> the level. (See ?levels and ?factor)--that's why it's
giving you those
>>> irrelevant integers. I always end up using something like this
handy
>>> code snippet to deal with the situation:
>>>
>>> unfactor <- function(factors)
>>> # From
http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
>>> # Transform a factor back into its factor names
>>> {
>>> ?return(levels(factors)[factors])
>>> }
>>>
>>> Then, to get your data to where you want it, I'd do this:
>>>
>>> require(XML)
>>> theurl <-
"http://www.queensu.ca/cora/_trends/mip_2006.htm"
>>> tables <- readHTMLTable(theurl)
>>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>>> class(tables)
>>> test<-data.frame(tables, stringsAsFactors=FALSE)
>>>
>>>
>>> result <- test[11:42, 1:5] #Extract the actual data we want
>>> names(result) <- c("Response", "Q1",
"Q2","Q3","Q4")
>>> for(i in 2:5) {
>>> # Convert columns to factors
>>> result[,i] <- as.numeric(unfactor(result[,i]))
>>> }
>>> result
>>>
>>> From here you should be able to plot or do whatever else you want.
>>>
>>> Hope this helps,
>>> Ethan Brown
>>>
>>>
>>> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at
gmail.com> wrote:
>>>> Dear Colleagues,
>>>> I used this code to scrape data from the URL conatined within.
?This code
>>>> should be reproducible.
>>>>
>>>> require("XML")
>>>> library(XML)
>>>> theurl <-
"http://www.queensu.ca/cora/_trends/mip_2006.htm"
>>>> tables <- readHTMLTable(theurl)
>>>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1]))
>>>> class(tables)
>>>> test<-data.frame(tables, stringsAsFactors=FALSE)
>>>> test[16,c(2:5)]
>>>> as.numeric(test[16,c(2:5)])
>>>> quartz()
>>>> plot(c(1:4), test[15, c(2:5)])
>>>>
>>>> calling the values from the row of interest using test[16,
c(2:5)] can bring
>>>> them up as represented on the screen, plotting them or coercing
them to
>>>> numeric changes the values and in a way that doesn't make
sense to me. My
>>>> intuitino is that there is something going on with the way the
characters
>>>> are coded or classed when they're scraped into R. ?I've
looked around the
>>>> help files for converting from character to numeric but
can't find a
>>>> solution.
>>>>
>>>> I also tried this:
>>>>
>>>> as.numeric(as.character(test[16,c(2:5)] and that also changed
the values
>>>> from what they originally were.
>>>>
>>>> I'm grateful for any suggestions.
>>>> Yours, Simon Kiss
>>>>
>>>>
>>>>
>>>> *********************************
>>>> Simon J. Kiss, PhD
>>>> Assistant Professor, Wilfrid Laurier University
>>>> 73 George Street
>>>> Brantford, Ontario, Canada
>>>> N3T 2C9
>>>> Cell: +1 519 761 7606
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>
>> *********************************
>> Simon J. Kiss, PhD
>> Assistant Professor, Wilfrid Laurier University
>> 73 George Street
>> Brantford, Ontario, Canada
>> N3T 2C9
>> Cell: +1 519 761 7606
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> *********************************
> Simon J. Kiss, PhD
> Assistant Professor, Wilfrid Laurier University
> 73 George Street
> Brantford, Ontario, Canada
> N3T 2C9
> Cell: +1 519 761 7606
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>