Dear R-users, I have to read data from a worksheet that is available on the Internet. I have been doing this by copying the worksheet from the browser. But I would like to be able to copy the data automatically using the url command. But when using "url" command the result is the source code, I mean, a html code. I see that the data I need is in the source code but before thinking about reading the data from the html code I wonder if there is a package or anoher way to extract these data since reading from the code will demand many work and it can be not so accurate. Below one can see the from where I am trying to export the data: dados<-url(" http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r ") I am looking forward any help. Thanks in advance , Nilza Barros [[alternative HTML version deleted]]
CIURANA EUGENE (R)
2012-Feb-12 12:44 UTC
[R] [R-sig-DB] Reading data from a worksheet on the Internet
On Sat, 11 Feb 2012 22:49:07 -0200, Nilza BARROS wrote:> I haveto read data from a worksheet that is available on the Internet. I>have been doing this by copying the worksheet from the browser.> But Iwould like to be able to copy the data automatically using the url>command.> > But when using "url" command the result is the sourcecode, I mean, a html> code. > I see that the data I need is in thesource code but before thinking about> reading the data from the htmlcode I wonder if there is a package or> anoher way to extract thesedata since reading from the code will demand> many work and it can benot so accurate.> > Below one can see the from where I am trying toexport the data:> >dadoshttp://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r>") Hi Nilza, The URL that you posted points at a document that has another document within it, in a frame. These files are Excel dumps into HTML. To view the actual data you need the URIs for each data set. Those appear at the bottom of the listing, under sc1201_arquivos/sheet001.htm and sheet002.htm. Your code must fetch these files, not the one at http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm [1] which only "wraps" them. Most of what you see on the file that you linked isn't HTML - it's JavaScript and style information for the data living on the two separate HTML documents. You can do this in R using the RCurl and XML libraries, by pulling the specific files for each data source. If this is a one-time thing, I'd suggest just coding something simple that loads the data for each file. If this is something you'll execute periodically, you'll need a bit more code to extract the internal data sheets (e.g. the "planhilas" at the bottom), then extracting the actual data. Let me know if you want this as a one-time thing, or as a reusable program. If you don't know how to use RCurl and XML to parse HTML I'll be happy to help with that too. I'd just like to know more about the scope of your question. Cheers, pr3d -- pr3d4t0r at #R, ##java, #awk, #pyton irc.freeenode.net -- pr3d4t0r at #R, ##java, #awk, #pyton irc.freeenode.net Links: ------ [1] http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm [[alternative HTML version deleted]]
Nilza BARROS
2012-Feb-12 18:24 UTC
[R] [R-sig-DB] Reading data from a worksheet on the Internet
Hi, I really appreciate your help. I definitively need a reusable program since I have been asking to someone to extract these data from the Internet everyday. That's the reason why I am trying to do a program to do that Related to the url I sent, I have just realized that although I had written the one related to only worksheet (PLANILHA2) when I copy it to my browse it is showed the link with both worksheets. I am going to read about Rcurl and XML libraries but I hope you can help me too. Thanks in advance Nilza Barros On Sun, Feb 12, 2012 at 10:42 AM, CIURANA EUGENE (R) <r.user@ciurana.eu>wrote:> ** > > On Sat, 11 Feb 2012 22:49:07 -0200, Nilza BARROS wrote: > > I have to read data from a worksheet that is available on the Internet. I > have been doing this by copying the worksheet from the browser. > But I would like to be able to copy the data automatically using the url > command. > > But when using "url" command the result is the source code, I mean, a html > code. > I see that the data I need is in the source code but before thinking about > reading the data from the html code I wonder if there is a package or > anoher way to extract these data since reading from the code will demand > many work and it can be not so accurate. > > Below one can see the from where I am trying to export the data: > > dadoshttp://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1201_arquivos/sheet002.htm","r > ") > > > > Hi Nilza, > > The URL that you posted points at a document that has another document > within it, in a frame. These files are Excel dumps into HTML. To view the > actual data you need the URIs for each data set. Those appear at the > bottom of the listing, under sc1201_arquivos/sheet001.htm and sheet002.htm. > Your code must fetch these files, not the one at > http://www.mar.mil.br/dhn/chm/meteo/prev/dados/pnboia/sc1202.htm which > only "wraps" them. Most of what you see on the file that you linked isn't > HTML - it's JavaScript and style information for the data living on the two > separate HTML documents. > > You can do this in R using the RCurl and XML libraries, by pulling the > specific files for each data source. If this is a one-time thing, I'd > suggest just coding something simple that loads the data for each file. If > this is something you'll execute periodically, you'll need a bit more code > to extract the internal data sheets (e.g. the "planhilas" at the bottom), > then extracting the actual data. > > Let me know if you want this as a one-time thing, or as a reusable > program. If you don't know how to use RCurl and XML to parse HTML I'll be > happy to help with that too. I'd just like to know more about the scope of > your question. > > Cheers, > > pr3d > > -- > pr3d4t0r at #R, ##java, #awk, #pytonirc.freeenode.net > >-- Abraço, Nilza Barros [[alternative HTML version deleted]]