Hi, I'm looking for help extracting some information of the zillow website. I'd like to do this for the general case where I manually change the address by modifying the url (see code below). With the url containing the address, I'd like to be able to extract the same information each time. The specific information I'd like to be able to extract includes the homedetails url, price (zestimate), number of beds, number of baths, and the Sqft. All this information is shown in a bubble on the webpage. I use the code below to try and do this but it's not working. I know the infomation I'm interested in is there because if I print out "doc", I see it all in one area. I've attached the relevant section of "doc" that shows and highlights all the information I'm interested in (note that either url that's highligted in doc is fine). http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf relevant-section-of-doc.pdf I'm guessing my xpath statements are wrong or getNodeSet needs something else to get to information contained in a bubble on a webpage. Any suggestions or ideas would be GREATLY appreciated. library(XML) url <- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb" doc <- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE) f1 <- getNodeSet(doc, "//a[contains(@href,'homedetails')]") f2 <- getNodeSet(doc, "//span[contains(@class,'price')]") f3 <- getNodeSet(doc, "//LIST[@Beds]") f4 <- getNodeSet(doc, "//LIST[@Baths]") f5 <- getNodeSet(doc, "//LIST[@Sqft]") g1 <-sapply(f1, xmlValue) g2 <-sapply(f2, xmlValue) g3 <-sapply(f3, xmlValue) g4 <-sapply(f4, xmlValue) g5 <-sapply(f5, xmlValue) print(f1) -- View this message in context: http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html Sent from the R help mailing list archive at Nabble.com.
Martin Morgan
2011-May-31 16:12 UTC
[R] Need help reading website info with XML package and XPath
On 05/30/2011 09:04 AM, eric wrote:> Hi, I'm looking for help extracting some information of the zillow website. > I'd like to do this for the general case where I manually change the address > by modifying the url (see code below). With the url containing the address, > I'd like to be able to extract the same information each time. The specific > information I'd like to be able to extract includes the homedetails url, > price (zestimate), number of beds, number of baths, and the Sqft. All this > information is shown in a bubble on the webpage. > > I use the code below to try and do this but it's not working. I know the > infomation I'm interested in is there because if I print out "doc", I see it > all in one area. I've attached the relevant section of "doc" that shows and > highlights all the information I'm interested in (note that either url > that's highligted in doc is fine). > http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf > relevant-section-of-doc.pdfHi Eric -- the problem is that the highlighted text is not in the XML per se, but embedded in a comment. You can extract the text of the comment as getNodeSet(doc, 'string(//div[@id="resurrection-page-state"]/comment())) you could go on to put some of that text into another XML document and use xpath on that, but... you're really 'screen scraping' here, which doesn't really showcase what XML is about. If you're trying to learn to use XML, then I'd suggest choosing a simpler example. If you're trying to corner the housing market (or whatever one does to housing markets) then you'll want to find a better data source. Hope that helps, Martin> > I'm guessing my xpath statements are wrong or getNodeSet needs something > else to get to information contained in a bubble on a webpage. Any > suggestions or ideas would be GREATLY appreciated. > > > library(XML) > url<- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb" > doc<- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE) > f1<- getNodeSet(doc, "//a[contains(@href,'homedetails')]") > f2<- getNodeSet(doc, "//span[contains(@class,'price')]") > f3<- getNodeSet(doc, "//LIST[@Beds]") > f4<- getNodeSet(doc, "//LIST[@Baths]") > f5<- getNodeSet(doc, "//LIST[@Sqft]") > g1<-sapply(f1, xmlValue) > g2<-sapply(f2, xmlValue) > g3<-sapply(f3, xmlValue) > g4<-sapply(f4, xmlValue) > g5<-sapply(f5, xmlValue) > print(f1) > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793