hi all, I'm working on scrapping some website data to build a database. Under most cases, I can use package XML to get the dataset. However, some of the website doesn't give a explicit address of the downloaded tables. To be more specific, for example, I'm interested in the website http://ets.aeso.ca/ The data we are scraping is the "Pool Weekly Summary" under the category of "Historical". However, after clicking "historical" and choose the "Pool Weekly Summary" item on the website, the address is always http://ets.aeso.ca/ and doesn't change. In this case, I guess I need to tell R first click the "historical" button then choose the item before scraping the data. But, the question is how? Any suggestions are welcome. Guang
Tyler Ritchie
2012-Mar-05 20:40 UTC
[R] How to choose a button and scrape the website data
That website uses javascript to submit the form (and doesn't work in Chrome). You could build a javascript interpreter in R, have parse the page, and then use the various javascript to submit the form. R just isn't the right tool for that type of interaction. Performing the task you want--as described--is possible, just not reasonable with R. There are better tools for automating webpages such as Automato [1] or Sikuli [2] which are handy tools. But better would be to query the site directly. Checking the source of the page each of the different report types stems from a different URL, passing it arguments in the form of: beginDate=03012012&endDate=03032012&SelectFormat=CSV results in values from March 1st to 3rd of this year in a csv. To find the URLs of interest go view the source and search for "Select a Report" Easier still might be to contact AESO and ask them for the data. [1] http://automa.to/ [2] http://sikuli.org/ -Tyler On Mon, Mar 5, 2012 at 10:38 AM, Guang Dai <Guang.Dai@albertamsa.ca> wrote:> hi all, > I'm working on scrapping some website data to build a database. > Under most cases, I can use package XML to get the dataset. > However, some of the website doesn't give a explicit address of the > downloaded tables. > > To be more specific, for example, I'm interested in the website > http://ets.aeso.ca/ > The data we are scraping is the "Pool Weekly Summary" under the category > of "Historical". > However, after clicking "historical" and choose the "Pool Weekly Summary" > item on the website, > the address is always http://ets.aeso.ca/ and doesn't change. > > In this case, I guess I need to tell R first click the "historical" button > then choose the item before > scraping the data. But, the question is how? > > Any suggestions are welcome. > Guang > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Thank you, Tyler. Just have a quick read on automa.to and sikuli.org, seems very promising. Since I anticipate there many other cases where a similar issue can arise, I don't mind spending sometime to learn something that is very efficient for the purpose. any suggestions? ________________________________ From: Tyler Ritchie [mailto:tyler.ritchie@gmail.com] Sent: Monday, March 05, 2012 1:40 PM To: Guang Dai Cc: r-help@r-project.org Subject: Re: [R] How to choose a button and scrape the website data That website uses javascript to submit the form (and doesn't work in Chrome). You could build a javascript interpreter in R, have parse the page, and then use the various javascript to submit the form. R just isn't the right tool for that type of interaction. Performing the task you want--as described--is possible, just not reasonable with R. There are better tools for automating webpages such as Automato [1] or Sikuli [2] which are handy tools. But better would be to query the site directly. Checking the source of the page each of the different report types stems from a different URL, passing it arguments in the form of: beginDate=03012012&endDate=03032012&SelectFormat=CSV results in values from March 1st to 3rd of this year in a csv. To find the URLs of interest go view the source and search for "Select a Report" Easier still might be to contact AESO and ask them for the data. [1] http://automa.to/ [2] http://sikuli.org/ -Tyler On Mon, Mar 5, 2012 at 10:38 AM, Guang Dai <Guang.Dai@albertamsa.ca<mailto:Guang.Dai@albertamsa.ca>> wrote: hi all, I'm working on scrapping some website data to build a database. Under most cases, I can use package XML to get the dataset. However, some of the website doesn't give a explicit address of the downloaded tables. To be more specific, for example, I'm interested in the website http://ets.aeso.ca/ The data we are scraping is the "Pool Weekly Summary" under the category of "Historical". However, after clicking "historical" and choose the "Pool Weekly Summary" item on the website, the address is always http://ets.aeso.ca/ and doesn't change. In this case, I guess I need to tell R first click the "historical" button then choose the item before scraping the data. But, the question is how? Any suggestions are welcome. Guang ______________________________________________ R-help@r-project.org<mailto:R-help@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]