Simon Kiss
2010-Oct-10 16:35 UTC
[R] Create single vector after looping through multiple data frames with GREP
Hello all, I changed the subject line of the e-mail, because the question I''m posing now is different than the first one. I hope that this is proper etiquette. However, the original chain is included below. I've incorporated bits of both Ethan and Brian's code into the script below, but there's one aspect I can't get my head around. I'm totally new to programming with control structures. The reproducible code below creates a list containing 19 data frames, one each for the "Most Important Problem" survey data for Canada. What I'd like at this stage is a loop where I can search through all the data frames for rows containing the search term and then bind the rows together in a plotable (sp?) format. At the bottom of the code below, you'll find my first attempt to make use of a search string and to put it into a plotable format. It only partially works. I can only get the numbers for one year, where I'd like to be able to get a string of numbers for several years.But, on the upside, grep appears to do the trick in terms of selecting rows. Can any one suggest a solution? Yours truly, Simon Kiss #This is the reproducible code to set-up all the data frames require("XML") library(XML) #This gets the data from the web and lists them mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_", c(1987:2001,2003:2006), ".htm", sep="") alltables <- lapply(mylist, readHTMLTable) #convert to dataframes r<-lapply(alltables, function(x) {as.data.frame(x)} ) #This is just some house-cleaning; structuring all the tables so they are uniform r[[1]][3]<-r[[1]][2] r[[1]][2]<-c(" ") r[[2]][4]<-r[[2]][2] r[[2]][5]<-r[[2]][3] r[[2]][2:3]<-c(" ") r[[3]][4:5]<-r[[3]][3:4] r[[3]][3]<-c(" ") #This loop deletes some superfluous columns and rows, turns the first column in to character strings and the data into numeric for (i in 1:19) { n.rows<-dim(r[[i]])[1] r[[i]] <- r[[i]][15:n.rows-3, 1:5] n.rows<-dim(r[[i]])[1] row.names(r[[i]]) <-NULL names(r[[i]]) <- c("Response", "Q1", "Q2", "Q3", "Q4") r[[i]][, 1]<-as.character(r[[i]][,1]) #r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5])) r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x) {as.numeric(as.character(x))}) #n.rows<-dim(r[[i]])[1] #r[[i]]<-r[[i]][9 } #This code is my first attempt at introducing a search string, getting the rows, binding and plotting; economy<-r[[10]][grep('Economy', r[[10]][,1]),] economy_2<-r[[11]][grep('Economy', r[[11]][,1]),] test<-cbind(economy, economy_2) plot(as.numeric(test), type='l') #here's another attempt I'm trying.... economy<-data.frame for (i in 15:19) { economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ] } Begin forwarded message:> From: Simon Kiss <sjkiss at gmail.com> > Date: October 7, 2010 4:59:46 PM EDT > To: Simon Kiss <simonjkiss at yahoo.ca> > Subject: Fwd: [R] Converting scraped data > > > > Begin forwarded message: > >> From: Ethan Brown <ethancbrown at gmail.com> >> Date: October 6, 2010 4:22:41 PM GMT-04:00 >> To: Simon Kiss <sjkiss at gmail.com> >> Cc: r-help at r-project.org >> Subject: Re: [R] Converting scraped data >> >> Hi Simon, >> >> You'll notice the "test" data.frame has a whole mix of characters in >> the columns you're interested, including a "-" for missing values, and >> that the columns you're interested in are in fact factors. >> >> as.numeric(factor) returns the level of the factor, not the value of >> the level. (See ?levels and ?factor)--that's why it's giving you those >> irrelevant integers. I always end up using something like this handy >> code snippet to deal with the situation: >> >> unfactor <- function(factors) >> # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor >> # Transform a factor back into its factor names >> { >> return(levels(factors)[factors]) >> } >> >> Then, to get your data to where you want it, I'd do this: >> >> require(XML) >> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >> tables <- readHTMLTable(theurl) >> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >> class(tables) >> test<-data.frame(tables, stringsAsFactors=FALSE) >> >> >> result <- test[11:42, 1:5] #Extract the actual data we want >> names(result) <- c("Response", "Q1", "Q2","Q3","Q4") >> for(i in 2:5) { >> # Convert columns to factors >> result[,i] <- as.numeric(unfactor(result[,i])) >> } >> result >> >> From here you should be able to plot or do whatever else you want. >> >> Hope this helps, >> Ethan Brown >> >> >> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at gmail.com> wrote: >>> Dear Colleagues, >>> I used this code to scrape data from the URL conatined within. This code >>> should be reproducible. >>> >>> require("XML") >>> library(XML) >>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >>> tables <- readHTMLTable(theurl) >>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >>> class(tables) >>> test<-data.frame(tables, stringsAsFactors=FALSE) >>> test[16,c(2:5)] >>> as.numeric(test[16,c(2:5)]) >>> quartz() >>> plot(c(1:4), test[15, c(2:5)]) >>> >>> calling the values from the row of interest using test[16, c(2:5)] can bring >>> them up as represented on the screen, plotting them or coercing them to >>> numeric changes the values and in a way that doesn't make sense to me. My >>> intuitino is that there is something going on with the way the characters >>> are coded or classed when they're scraped into R. I've looked around the >>> help files for converting from character to numeric but can't find a >>> solution. >>> >>> I also tried this: >>> >>> as.numeric(as.character(test[16,c(2:5)] and that also changed the values >>> from what they originally were. >>> >>> I'm grateful for any suggestions. >>> Yours, Simon Kiss >>> >>> >>> >>> ********************************* >>> Simon J. Kiss, PhD >>> Assistant Professor, Wilfrid Laurier University >>> 73 George Street >>> Brantford, Ontario, Canada >>> N3T 2C9 >>> Cell: +1 519 761 7606 >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> > > ********************************* > Simon J. Kiss, PhD > Assistant Professor, Wilfrid Laurier University > 73 George Street > Brantford, Ontario, Canada > N3T 2C9 > Cell: +1 519 761 7606 > > > > > > > > > >********************************* Simon J. Kiss, PhD Assistant Professor, Wilfrid Laurier University 73 George Street Brantford, Ontario, Canada N3T 2C9 Cell: +1 519 761 7606
Michael Bedward
2010-Oct-11 05:19 UTC
[R] Create single vector after looping through multiple data frames with GREP
Hi Simon, The function below should do it or at least get you started... getPlotData <- function (datalist, response, times) { qdata <- sapply(datalist[times], function(df) { irow <- grepl(response, df$Response) df[irow, 2:5] } ) # qdata is a matrix with rows Q1:Q4 and cols for times; # we turn it into a two col matrix with col 1 = time index # and col 2 = value time.index <- seq(4 * ncol(qdata)) out <- cbind(time.index, as.numeric(qdata)) rownames(out) <- paste(time.index, rownames(qdata), sep=".") colnames(out) <- c("time", response) out } #Example, get data for times 10:15 where Response contains "Economy" x <- getPlotData(r, "Economy", 10:15) Michael On 11 October 2010 03:35, Simon Kiss <sjkiss at gmail.com> wrote:> Hello all, > > I changed the subject line of the e-mail, because the question I''m posing now is different than the first one. I hope that this is proper etiquette. ?However, the original chain is included below. > > I've incorporated bits of ?both Ethan and Brian's code into the script below, but there's one aspect I can't get my head around. I'm totally new to programming with control structures. The reproducible code below creates a list containing 19 data frames, one each for the "Most Important Problem" ?survey data for Canada. > > What I'd like at this stage is a loop where I can search through all the data frames for rows containing the search term and then bind the rows together in a plotable (sp?) format. > > At the bottom of the code below, you'll find my first attempt to make use of a search string and to put it into a plotable format. ?It only partially works. ?I can only get the numbers for one year, where I'd like to be able to get a string of numbers for several years.But, on the upside, grep appears to do the trick in terms of selecting rows. > > Can any one suggest a solution? > Yours truly, > Simon Kiss > > #This is the reproducible code to set-up all the data frames > require("XML") > library(XML) > #This gets the data from the web and lists them > mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_", > c(1987:2001,2003:2006), ".htm", sep="") > alltables <- lapply(mylist, readHTMLTable) > > #convert to dataframes > r<-lapply(alltables, function(x) {as.data.frame(x)} ) > > #This is just some house-cleaning; structuring all the tables so they are uniform > r[[1]][3]<-r[[1]][2] > r[[1]][2]<-c(" ") > r[[2]][4]<-r[[2]][2] > r[[2]][5]<-r[[2]][3] > r[[2]][2:3]<-c(" ") > r[[3]][4:5]<-r[[3]][3:4] > r[[3]][3]<-c(" ") > > #This loop deletes some superfluous columns and rows, turns the first column in to character strings and the data into numeric > for (i in 1:19) { > n.rows<-dim(r[[i]])[1] > r[[i]] <- r[[i]][15:n.rows-3, 1:5] > n.rows<-dim(r[[i]])[1] > row.names(r[[i]]) <-NULL > names(r[[i]]) <- c("Response", "Q1", "Q2", "Q3", "Q4") > > r[[i]][, 1]<-as.character(r[[i]][,1]) > #r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5])) > r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x) {as.numeric(as.character(x))}) > #n.rows<-dim(r[[i]])[1] > #r[[i]]<-r[[i]][9 > } > > #This code is my first attempt at introducing a search string, getting the rows, binding and plotting; > economy<-r[[10]][grep('Economy', r[[10]][,1]),] > economy_2<-r[[11]][grep('Economy', r[[11]][,1]),] > test<-cbind(economy, economy_2) > plot(as.numeric(test), type='l') > > #here's another attempt I'm trying.... > economy<-data.frame > for (i in 15:19) { > economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ] > } > > Begin forwarded message: > >> From: Simon Kiss <sjkiss at gmail.com> >> Date: October 7, 2010 4:59:46 PM EDT >> To: Simon Kiss <simonjkiss at yahoo.ca> >> Subject: Fwd: [R] Converting scraped data >> >> >> >> Begin forwarded message: >> >>> From: Ethan Brown <ethancbrown at gmail.com> >>> Date: October 6, 2010 4:22:41 PM GMT-04:00 >>> To: Simon Kiss <sjkiss at gmail.com> >>> Cc: r-help at r-project.org >>> Subject: Re: [R] Converting scraped data >>> >>> Hi Simon, >>> >>> You'll notice the "test" data.frame has a whole mix of characters in >>> the columns you're interested, including a "-" for missing values, and >>> that the columns you're interested in are in fact factors. >>> >>> as.numeric(factor) returns the level of the factor, not the value of >>> the level. (See ?levels and ?factor)--that's why it's giving you those >>> irrelevant integers. I always end up using something like this handy >>> code snippet to deal with the situation: >>> >>> unfactor <- function(factors) >>> # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor >>> # Transform a factor back into its factor names >>> { >>> ?return(levels(factors)[factors]) >>> } >>> >>> Then, to get your data to where you want it, I'd do this: >>> >>> require(XML) >>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >>> tables <- readHTMLTable(theurl) >>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >>> class(tables) >>> test<-data.frame(tables, stringsAsFactors=FALSE) >>> >>> >>> result <- test[11:42, 1:5] #Extract the actual data we want >>> names(result) <- c("Response", "Q1", "Q2","Q3","Q4") >>> for(i in 2:5) { >>> # Convert columns to factors >>> result[,i] <- as.numeric(unfactor(result[,i])) >>> } >>> result >>> >>> From here you should be able to plot or do whatever else you want. >>> >>> Hope this helps, >>> Ethan Brown >>> >>> >>> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjkiss at gmail.com> wrote: >>>> Dear Colleagues, >>>> I used this code to scrape data from the URL conatined within. ?This code >>>> should be reproducible. >>>> >>>> require("XML") >>>> library(XML) >>>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >>>> tables <- readHTMLTable(theurl) >>>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >>>> class(tables) >>>> test<-data.frame(tables, stringsAsFactors=FALSE) >>>> test[16,c(2:5)] >>>> as.numeric(test[16,c(2:5)]) >>>> quartz() >>>> plot(c(1:4), test[15, c(2:5)]) >>>> >>>> calling the values from the row of interest using test[16, c(2:5)] can bring >>>> them up as represented on the screen, plotting them or coercing them to >>>> numeric changes the values and in a way that doesn't make sense to me. My >>>> intuitino is that there is something going on with the way the characters >>>> are coded or classed when they're scraped into R. ?I've looked around the >>>> help files for converting from character to numeric but can't find a >>>> solution. >>>> >>>> I also tried this: >>>> >>>> as.numeric(as.character(test[16,c(2:5)] and that also changed the values >>>> from what they originally were. >>>> >>>> I'm grateful for any suggestions. >>>> Yours, Simon Kiss >>>> >>>> >>>> >>>> ********************************* >>>> Simon J. Kiss, PhD >>>> Assistant Professor, Wilfrid Laurier University >>>> 73 George Street >>>> Brantford, Ontario, Canada >>>> N3T 2C9 >>>> Cell: +1 519 761 7606 >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >> >> ********************************* >> Simon J. Kiss, PhD >> Assistant Professor, Wilfrid Laurier University >> 73 George Street >> Brantford, Ontario, Canada >> N3T 2C9 >> Cell: +1 519 761 7606 >> >> >> >> >> >> >> >> >> >> > > ********************************* > Simon J. Kiss, PhD > Assistant Professor, Wilfrid Laurier University > 73 George Street > Brantford, Ontario, Canada > N3T 2C9 > Cell: +1 519 761 7606 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >