search for: htmltreeparse

Displaying 20 results from an estimated 34 matches for "htmltreeparse".

2007 Nov 18
4
Re ad HTML table
You can use htmlTreeParse and xpathApply from the XML library. something like: xpathApply( htmlTreeParse("http://blabla", useInt=T), "//td", function(x) xmlValue(x)) should do it. Gamma wrote: > > anyone care to explain how to read a html table, it's streaming data > (updated every sec...
2012 Jun 08
0
XML htmlTreeParse fails with no obvious error
Hi all, Sorry for the rather uninformative subject, but the error I get is not very informative either. When using the XML and RCurl package to retrieve the content of an html page, htmlTreeParse fails, printing out the beginning of the HTML: Error in htmlTreeParse(getURL(url)) : File <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml&quo...
2011 Aug 25
1
R hangs after htmlTreeParse
...2&maxrow=10&startdate=2001-01-01&enddate=2011-08-25&article=2&pagenumber=1&isphrase=no&query=IIM&searchfield=&section=&kdaterange=30&date1mm=01&date1dd=01&date1yyyy=2001&date2mm=08&date2dd=25&date2yyyy=2011") .x<-getURL(myurl) htmlTreeParse(.x, asText=T) This prints approximately 15 lines of the output from the html document and then mysteriously stops. The command line prompt does not reappear and force quit is the only option. I'm running R 2.13 on Mac os 10.6 and the latest versions of XML and RCURL are installed. Yours, Simo...
2008 Nov 04
2
How to suppress errors from htmlTreeParse() function in XML package?
...is just letting me know that the html code is malformed, but for my purposes i can ignore that output. Is there a way to achieve this? ### Example: library(RCurl); library(XML) doc <- getURL('http://www.google.co.uk/search?q=%22R%20Project %22&as_qdr=d1&num=100') html.tree <- htmlTreeParse(doc, useInternalNodes = TRUE) ### Output - this is what i would like to suppress Tag nobr invalid htmlParseEntityRef: expecting ';' htmlParseEntityRef: expecting ';' ### etc. I attempted to use try(expr, silent=TRUE) but that didn't work for me: > try(htmlTreeParse(doc, us...
2009 Dec 31
3
XML and RCurl: problem with encoding (htmlTreeParse)
Hi, I'm trying to get data from web page and modify it in R. I have a problem with encoding. I'm not able to get encoding right in htmlTreeParse command. See below > library(RCurl) > library(XML) > > site <- getURL("http://www.aarresaari.net/jobboard/jobs.html") > txt <- readLines(tc <- textConnection(site)); close(tc) > txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) > &...
2010 Mar 15
1
XML: Slower parsing over time with htmlTreeParse()
...eader of my previous post! >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Dear List, has anyone of you experienced a significant increase in the time it takes to parse an URL via "htmlTreeParse()" when this function is called repeatedly every minute over a couple of hours? Initially, a single parse takes about 0.5 seconds on my machine (Quad Core, 2.67 GHz, 8 MB RAM, Windows 7 64 Bit), . After some time, this can go up to 15 seconds or more. I''ve tried garbage collec...
2010 Jul 03
1
XML and RCurl: problem with encoding (htmlTreeParse)
Hi All, First method:- >library(XML) >theurl <- "http://home.sina.com" >download.file(theurl, "tmp.html") >txt <- readLines("tmp.html") >txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = TRUE) >g <- xpathSApply(txt, "//p", function(x) xmlValue(x)) >head(grep(" ", g, value=T)) [1] " | | ENGLISH" " " [3] " ()"...
2010 Mar 15
0
RMySQL: Slower parsing over time with htmlTreeParse()
Dear List, has anyone of you experienced a significant increase in the time it takes to parse an URL via "htmlTreeParse()" when this function is called repeatedly every minute over a couple of hours? Initially, a single parse takes about 0.5 seconds on my machine (Quad Core, 2.67 GHz, 8 MB RAM, Windows 7 64 Bit), . After some time, this can go up to 15 seconds or more. I''ve tried garbage collec...
2008 Oct 06
3
Extracting text from html code using the RCurl package.
...her way to achieve this? This is the code i am using: > library(RCurl) > my.url <- 'https://stat.ethz.ch/mailman/listinfo/r-help' > html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, followlocation = TRUE) > print(html.file) I thought perhaps the htmlTreeParse() function from the XML package might help, but I just don't know what to do next with it: > library(XML) > htmlTreeParse(html.file) Many thanks for any help you can provide, Tony Breyal > sessionInfo() R version 2.7.2 (2008-08-25) i386-pc-mingw32 locale: LC_COLLATE=English_United...
2009 Nov 25
2
XML package example code?
I'm interested in parsing an html page. I should use XML, right? Could you somebody show me some example code? Is there a tutorial for this package?
2011 Oct 26
1
Webscraping - How to Scrape Out Text Into R As If Copied & Pasted From Webpage?
...on the page, pasted into a text file, and then read in the text file with read.csv(). # this is the actual page I'm trying to acquire text from: web.pg <- readLines("http://www.airweb.org/?page=574") # then parsed in hopes of an easier structure to work with: web.pg <- htmlTreeParse(file=web.pg, ignoreBlanks=TRUE) Now I have a lovely html tree, but don't know the best way to get just the text components (job descriptions, job titles, etc...) as they appear on the web site. I'd like to do a little text mining and make a wordcloud using the text. Can anybody suggest...
2016 Jan 18
3
Extraccion de datos de una Web
...olumna (semana, puntuacion) teniendo en cuenta que puede que haya semanas que no haya puntuado (en el ejemplo, la segunda semana). De momento lo estoy obteniendo de la siguiente forma: url_jugador<-"http://localhost:8080/jugadores/Luis" txt_jugador <- getURL(url_jugador) doc<-htmlTreeParse(txt_jugador, useInternalNodes = TRUE) puntos_nodo<- xpathApply(doc, "//table[@class='points']/tr") puntos_nodo [[1]] <tr> <td class="semana">1</td> <td class="neg"/> <td> <div class="bar">6</div&gt...
2011 Sep 05
2
htmlParse hangs or crashes
Dear colleagues, each time I use htmlParse, R crashes or hangs. The url I'd like to parse is included below as is the results of a series of basic commands that describe what I'm experiencing. The results of sessionInfo() are attached at the bottom of the message. The thing is, htmlTreeParse appears to work just fine, although it doesn't appear to contain the information I need (the URLs of the articles linked to on this search page). Regardless, I'd still like to understand why htmlParse doesn't work. Thank you for any insight. Yours, Simon Kiss myurl<-c("http:...
2009 May 12
2
import HTML tables
Hello, I was wondering if there is a function in R that imports tables directly from a HTML document. I know there are functions (say, getURL() from {RCurl} ) that download the entire page source, but here I refer to something like google document's function importHTML() (if you don't know this function, go check it, it's very useful). Anyway, if someone of something that does this
2012 Feb 29
2
Using a FOR LOOP to name objects
...a for loop to name objects in each iteraction. As in the following example (which doesn't work quite well) my_list<-c("A","B","C","D","E","F") for(i in c(1:length(my_list))){ url<- "http://finance.yahoo.com" doc = htmlTreeParse(url, useInternalNodes = T) tab_nodes = xpathApply(doc, "//table[@cellpadding = '3']") *my_list[i]*=lapply(tab_nodes, readHTMLTable) #problem is in this line names(*my_list[i]*)=c("Ins","outs") } The problem is that in iteraction #1, I need the info...
2012 Apr 21
1
how to write html output (webscraped using RCurl package) into file?
i want "http://scop.berkeley.edu/astral/pdbstyle/?id=d1fjgc2&output=html",showing information in webpage to be written in .txt file as it is(i don't want any html tag) i am using "RCurl" package >marathi<-htmlTreeParse("http://scop.berkeley.edu/astral/pdbstyle/?id=d1fjgc2&output=html") >marathi >kasam<-marathi$children$html[["body"]][["pre"]][["text"]] >kasam > write(kasam,"papita.txt") Error in cat(list(...), file, sep, fill, labels, append) :...
2008 Dec 17
1
Extract Data from a Webpage
...like this: http://oasasapps.oasas.state.ny.us/portal/pls/portal/oasasrep.providersearch.take_to_rpt?P1=3489&P2=11490 Based on searching R-help archives, it seems like the XML package might have something useful for this task. I can load the XML package and supply the url as an argument to htmlTreeParse(), but I don't know how to go from there. thanks, Chuck Cleland > sessionInfo() R version 2.8.0 Patched (2008-12-04 r47066) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=Englis...
2012 Feb 10
1
Bug with memory allocation when loading Rdata files iteratively?
...illing the process eventually). It just seems like removing the object via |rm()| and firing |gc()| do not have any effect, so the memory consumption of each loaded R object cumulates until there's no more memory left :-/ Possibly, this is also related to XML package functionality (mainly |htmlTreeParse| and |getNodeSet|), but I also experience the described behavior when simply iteratively loading and removing Rdata files. I've put together a little example that illustrates the memory ballooning mentioned above which you can find here: http://stackoverflow.com/questions/9220849/significan...
2008 Dec 31
1
Chinese characters encoding problem with XML
XML is a good tool reading data from web within R. But I wonder how could get the encoding correctly. library(XML) url <- 'http://www.szitic.com/docc/jz-lmzq.html' xml <- htmlTreeParse(url, useInternal=TRUE) q <- "//tbody/tr/td" dat <- unlist(xpathApply(xml, q, xmlValue)) df <- as.data.frame(t(matrix(dat, 4))) dt<-as.character(df[15,1]) The first column of df is dates in Chinese. dt is one of the Chinese dates. When I copied the content of dt into the ema...
2011 May 30
1
Need help reading website info with XML package and XPath
...uessing my xpath statements are wrong or getNodeSet needs something else to get to information contained in a bubble on a webpage. Any suggestions or ideas would be GREATLY appreciated. library(XML) url <- "http://www.zillow.com/homes/511 W Lafayette St, Norristown, PA_rb" doc <- htmlTreeParse(url, useInternalNode=TRUE, isURL=TRUE) f1 <- getNodeSet(doc, "//a[contains(@href,'homedetails')]") f2 <- getNodeSet(doc, "//span[contains(@class,'price')]") f3 <- getNodeSet(doc, "//LIST[@Beds]") f4 <- getNodeSet(doc, "//LIST[@Baths]&qu...