Ilio Fornasero
2018-May-23 13:14 UTC
[R] Using R htmlParse() for manipulating URLs to access multiple pages
I am trying to scrape a manual from web. For privacy reasons, I cannot write here the exact URL, anyway, the structure is as follows: https://home.lala.com/bibi/blabla/chapter_i_organization/101_contracts/whatever/,DanaInfo=intranet.lala.com+ https://home.lala.com/bibi/blabla/chapter_i_organization/125_bills/,DanaInfo=intranet.lala.com+ https://home.lala.com/bibi/blabla/chapter_vii_operational_modalities/701_wonderwall_18_oasis/701_wonderwall_18_oasis/ and so forth. Of course, I don't want to scrape the single URLs one by one. Hence, I am considering the base URL for parsing and to start from there onward. baseurl <- htmlParse( "https://home.lala.com/bibi/blabla/", encoding = "UTF-8") xpath <- "//div[@id='Page']/strong[2]" GetAllPages <- as.numeric(xpathSApply(baseurl, xpath, xmlValue)) Nevertheless, it does not work at all:> GetAllPagesnumeric(0) Any hint? [[alternative HTML version deleted]]