henrique monte
2017-Feb-22 22:52 UTC
[R] Web scraping - Having trouble figuring out how to approach this problem
Sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this? Fictional example: Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola). I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries). ... The main issue I'm trying to figure out is handling several webpages, like, would it be possible to scrape from the first link of the problem the countries as a list with the links of the countries webpages and then create and run a function to run a scraping command in each of those links from the list to get the specific data I'm looking for? [[alternative HTML version deleted]]
Jeff Newmiller
2017-Feb-23 18:03 UTC
[R] Web scraping - Having trouble figuring out how to approach this problem
The answer is yes, and does not seem like a big step from where you are now, so seeing what you already know how to do (reproducible example, or RE) would help focus the assistance. There are quite a few ways to do this kind of thing, and what you already know would be clarified with a RE. -- Sent from my phone. Please excuse my brevity. On February 22, 2017 2:52:55 PM PST, henrique monte <henrique.monte66 at gmail.com> wrote:>Sometimes I need to get some data from the web organizing it into a >dataframe and waste a lot of time doing it manually. I've been trying >to >figure out how to optimize this proccess, and I've tried with some R >scraping approaches, but couldn't get to do it right and I thought >there >could be an easier way to do this, can anyone help me out with this? > >Fictional example: > >Here's a webpage with countries listed by continents: >https://simple.wikipedia.org/wiki/List_of_countries_by_continents > >Each country name is also a link that leads to another webpage >(specific of >each country, e.g. https://simple.wikipedia.org/wiki/Angola). > >I would like as a final result to get a data frame with number of >observations (rows) = number of countries listed and 4 variables >(colums) >as ID=Country Name, Continent=Continent it belongs to, >Language=Official >language (from the specific webpage of the Countries) and Population >most >recent population count (from the specific webpage of the Countries). > >... > >The main issue I'm trying to figure out is handling several webpages, >like, >would it be possible to scrape from the first link of the problem the >countries as a list with the links of the countries webpages and then >create and run a function to run a scraping command in each of those >links >from the list to get the specific data I'm looking for? > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.