thr3ads.net - R help - [R] Need help reading website info with XML package and XPath [May 2011]

If this information is useful, please help other people find it:
Share via:

eric

2011-May-30 16:04 UTC

[R] Need help reading website info with XML package and XPath

Hi, I'm looking for help extracting some information of the zillow website.
I'd like to do this for the general case where I manually change the address
by modifying the url (see code below). With the url containing the address,
I'd like to be able to extract the same information each time. The specific
information I'd like to be able to extract includes the homedetails url,
price (zestimate), number of beds, number of baths, and the Sqft. All this
information is shown in a bubble on the webpage.

I use the code below to try and do this but it's not working. I know the
infomation I'm interested in is there because if I print out
"doc", I see it
all in one area. I've attached the relevant section of "doc" that
shows and
highlights all the information I'm interested in (note that either url
that's highligted in doc is fine). 
http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
relevant-section-of-doc.pdf 

I'm guessing my xpath statements are wrong or getNodeSet needs something
else to get to information contained in a bubble on a webpage. Any
suggestions or ideas would be GREATLY appreciated. 


library(XML)
url <- "http://www.zillow.com/homes/511 W Lafayette St, Norristown,
PA_rb"
doc <- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
f1 <- getNodeSet(doc, "//a[contains(@href,'homedetails')]")
f2 <- getNodeSet(doc, "//span[contains(@class,'price')]")
f3 <- getNodeSet(doc, "//LIST[@Beds]")
f4 <- getNodeSet(doc, "//LIST[@Baths]")
f5 <- getNodeSet(doc, "//LIST[@Sqft]")
g1 <-sapply(f1, xmlValue)
g2 <-sapply(f2, xmlValue)
g3 <-sapply(f3, xmlValue)
g4 <-sapply(f4, xmlValue)
g5 <-sapply(f5, xmlValue)
print(f1)



--
View this message in context:
http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
Sent from the R help mailing list archive at Nabble.com.

Martin Morgan

2011-May-31 16:12 UTC

head link

[R] Need help reading website info with XML package and XPath

On 05/30/2011 09:04 AM, eric wrote:> Hi, I'm looking for help extracting some information of the zillow
website.
> I'd like to do this for the general case where I manually change the
address
> by modifying the url (see code below). With the url containing the address,
> I'd like to be able to extract the same information each time. The
specific
> information I'd like to be able to extract includes the homedetails
url,
> price (zestimate), number of beds, number of baths, and the Sqft. All this
> information is shown in a bubble on the webpage.
>
> I use the code below to try and do this but it's not working. I know
the
> infomation I'm interested in is there because if I print out
"doc", I see it
> all in one area. I've attached the relevant section of "doc"
that shows and
> highlights all the information I'm interested in (note that either url
> that's highligted in doc is fine).
> http://r.789695.n4.nabble.com/file/n3561075/relevant-section-of-doc.pdf
> relevant-section-of-doc.pdf
Hi Eric -- the problem is that the highlighted text is not in the XML 
per se, but embedded in a comment. You can extract the text of the 
comment as

getNodeSet(doc,
'string(//div[@id="resurrection-page-state"]/comment()))

you could go on to put some of that text into another XML document and 
use xpath on that, but... you're really 'screen scraping' here,
which
doesn't really showcase what XML is about. If you're trying to learn to 
use XML, then I'd suggest choosing a simpler example. If you're trying 
to corner the housing market (or whatever one does to housing markets) 
then you'll want to find a better data source.

Hope that helps,

Martin
>
> I'm guessing my xpath statements are wrong or getNodeSet needs
something
> else to get to information contained in a bubble on a webpage. Any
> suggestions or ideas would be GREATLY appreciated.
>
>
> library(XML)
> url<- "http://www.zillow.com/homes/511 W Lafayette St, Norristown,
PA_rb"
> doc<- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE)
> f1<- getNodeSet(doc,
"//a[contains(@href,'homedetails')]")
> f2<- getNodeSet(doc,
"//span[contains(@class,'price')]")
> f3<- getNodeSet(doc, "//LIST[@Beds]")
> f4<- getNodeSet(doc, "//LIST[@Baths]")
> f5<- getNodeSet(doc, "//LIST[@Sqft]")
> g1<-sapply(f1, xmlValue)
> g2<-sapply(f2, xmlValue)
> g3<-sapply(f3, xmlValue)
> g4<-sapply(f4, xmlValue)
> g5<-sapply(f5, xmlValue)
> print(f1)
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Need-help-reading-website-info-with-XML-package-and-XPath-tp3561075p3561075.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

Possibly Parallel Threads

Search for more reasonably related threads

R help - May 2011 - Need help reading website info with XML package and XPath

[R] Need help reading website info with XML package and XPath

[R] Need help reading website info with XML package and XPath

Possibly Parallel Threads