thr3ads.net - R help - [R] Scrap java scripts and styles from an html document [Mar 2011]

If this information is useful, please help other people find it:
Share via:

antujsrv

2011-Mar-29 06:38 UTC

[R] Scrap java scripts and styles from an html document

Hi,

I am working on developing a web crawler in R and I needed some help with
regard to removal of javascripts and style sheets from the html document of
a web page.

i tried using the xml package, hence the function xpathApply
library(XML)
txt
xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
xmlValue)

The output comes out as text lines, without any html tags. I want the html
tags to remain intact and scrap only the javascript and styles from it. 

Any help would be highly appreciated.
Thanks in advance.


--
View this message in context:
http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3413894.html
Sent from the R help mailing list archive at Nabble.com.

Duncan Temple Lang

2011-Mar-29 15:57 UTC

head link

[R] Scrap java scripts and styles from an html document

On 3/28/11 11:38 PM, antujsrv wrote:> Hi,
> 
> I am working on developing a web crawler in R and I needed some help with
> regard to removal of javascripts and style sheets from the html document of
> a web page.
> 
> i tried using the xml package, hence the function xpathApply
> library(XML)
> txt >
xpathApply(html,"//body//text()[not(ancestor::script)][not(ancestor::style)]",
> xmlValue)
> 
> The output comes out as text lines, without any html tags. I want the html
> tags to remain intact and scrap only the javascript and styles from it. 
Well then you would be best served to use that approach, i.e.
find the nodes named script and style and then remove them from
the tree. Then you have the document as a single object
rather than a bunch of individual elements.

So

 nodes = xpathApply(html, "//body//script | //body//style")
 removeNodes(nodes)

 saveXML(html)


But you don't say what you want to end up with or what you are doing with
the resulting content or why you have to remove the JavaScript content, etc.

  D.
> 
> Any help would be highly appreciated.
> Thanks in advance.
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3413894.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

antujsrv

2011-Apr-07 11:15 UTC

head link

[R] Scrap java scripts and styles from an html document

Hi ,

I am working on developing a web crawler.
Removing javascripts and styles is a part of the cleaning of the html
document. 
What I want is a cleaned html document with only the html tags and textual
information, 
so that i can figure out the pattern of the web page. This is being done to
extract relevant 
information from the webpage like comments for a particular product.

For e.g the amazon.com has all such comments within the 
 and 	 tags, 
with regular 
 occuring for breaks. So tags which appear the most help us in
 locating the required information. Different websites have different
patterns, 
but its more likely that tags that will occur the most will have the
relevant information enclosed in them. 

So, once the html page is cleaned, it would be easy to role up the tags and
knowing their frequency of occurrence, we can target the information. 

Should there be any suggestions to help, please let me know. I would be more
than pleased.

Regards,
Antuj

--
View this message in context:
http://r.789695.n4.nabble.com/Scrap-java-scripts-and-styles-from-an-html-document-tp3413894p3433052.html
Sent from the R help mailing list archive at Nabble.com.

Reasonably Related Threads

Search for more apparently analagous threads

R help - Mar 2011 - Scrap java scripts and styles from an html document

[R] Scrap java scripts and styles from an html document

[R] Scrap java scripts and styles from an html document

[R] Scrap java scripts and styles from an html document

Reasonably Related Threads