thr3ads.net - R help - [R] is there a way to extract fata from web pages through some R function ? [Jul 2009]

If this information is useful, please help other people find it:
Share via:

mauede at alice.it

2009-Jul-01 11:53 UTC

[R] is there a way to extract fata from web pages through some R function ?

I deal with a huge amount of Biology data stored in different databases.
The databases belongig to Bioconductor organization can be accessed through
Bioconductor packages.
Unluckily some useful data is stored in databases like, for instance, miRDB,
miRecords, etc ... which offer just an
interactive HTML interface. See for instance
 mirdb.org/cgi-bin/search.cgi, 
 mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search

Downloading data manually from the web pages is a painstaking time-consumung and
error-prone activity.
I came across a Python script that downloads (dumps) whole web pages  into a
text file that is then parsed.
This is possible because Python has a library to access web pages.
But I have no experience with Python programming nor I like such a programming
language whose syntax is indentation-sensitive.

I am *hoping* that there exists some sort of web pages, HTML connection  from R
... is there ??

Thank you very much for any suggestion.
Maura



tutti i telefonini TIM!


	[[alternative HTML version deleted]]

mauede at alice.it

2009-Jul-01 15:17 UTC

head link

[R] Is there a way to extract some fields data from HTML pages through any R function ?

I deal with a huge amount of Biology data stored in different databases.
The databases belongig to Bioconductor organization can be accessed through
Bioconductor packages.
Unluckily some useful data is stored in databases like, for instance, miRDB,
miRecords, etc ... which offer just an
interactive HTML interface. See for instance
 mirdb.org/cgi-bin/search.cgi, 
 mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search

Downloading data manually from the web pages is a painstaking time-consumung and
error-prone activity.
I came across a Python script that downloads (dumps) whole web pages  into a
text file that is then parsed.
This is possible because Python has a library to access web pages.
But I have no experience with Python programming nor I like such a programming
language whose syntax is indentation-sensitive.

I am *hoping* that there exists some sort of web pages, HTML connection  from R
... is there ??

Thank you very much for any suggestion.
Maura


tutti i telefonini TIM!


	[[alternative HTML version deleted]]

Greg Hirson

2009-Jul-01 15:41 UTC

head link

[R] is there a way to extract fata from web pages through some R function ?

Maura,

Try the RCurl package, specifically the functions getURL and getForm.

Greg

mauede at alice.it wrote:> I deal with a huge amount of Biology data stored in different databases.
> The databases belongig to Bioconductor organization can be accessed through
Bioconductor packages.
> Unluckily some useful data is stored in databases like, for instance,
miRDB, miRecords, etc ... which offer just an
> interactive HTML interface. See for instance
>  mirdb.org/cgi-bin/search.cgi, 
> 
mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
>
> Downloading data manually from the web pages is a painstaking
time-consumung and error-prone activity.
> I came across a Python script that downloads (dumps) whole web pages  into
a text file that is then parsed.
> This is possible because Python has a library to access web pages.
> But I have no experience with Python programming nor I like such a
programming language whose syntax is indentation-sensitive.
>
> I am *hoping* that there exists some sort of web pages, HTML connection 
from R ... is there ??
>
> Thank you very much for any suggestion.
> Maura
>
>
>
> tutti i telefonini TIM!
>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>   
-- 
Greg Hirson
ghirson at ucdavis.edu

Graduate Student
Agricultural and Environmental Chemistry

1106 Robert Mondavi Institute North
One Shields Avenue
Davis, CA 95616

Martin Morgan

2009-Jul-01 15:51 UTC

head link

[R] Is there a way to extract some fields data from HTML pages through any R function ?

Hi Maura --

mauede at alice.it wrote:> I deal with a huge amount of Biology data stored in different databases.
> The databases belongig to Bioconductor organization can be accessed through
Bioconductor packages.
> Unluckily some useful data is stored in databases like, for instance,
miRDB, miRecords, etc ... which offer just an
> interactive HTML interface. See for instance
>  mirdb.org/cgi-bin/search.cgi, 
> 
mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search
> 
> Downloading data manually from the web pages is a painstaking
time-consumung and error-prone activity.
> I came across a Python script that downloads (dumps) whole web pages  into
a text file that is then parsed.
> This is possible because Python has a library to access web pages.
> But I have no experience with Python programming nor I like such a
programming language whose syntax is indentation-sensitive.
> 
> I am *hoping* that there exists some sort of web pages, HTML connection 
from R ... is there ??
Tools in R for this are the RCurl package and the XML package.

  library(RCurl)
  library(XML)

Typically this involves manual exploration of the web form, Then you
might query the web form

  result <- postForm("mirdb.org/cgi-bin/search.cgi",
                     searchType="miRNA", species="Human",
                     searchBox="hsa-let-7a",
submitButton="Go")

and parse the results into a convenient structure

  html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE)

you can then use XPath (w3.org/TR/xpath, especially section
2.5) to explore and extract information, e.g.,

  ## second table, first row
  getNodeSet(html, "//table[2]/tr[1]")
  ## second table, makes subsequent paths shorter
  tbl <- getNodeSet(html, "//table[2]")[[1]]
  xget <- function(xml, path) # a helper function
      unlist(xpathApply(xml, path, xmlValue))[-1]
  df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")),
                   TargetScore=as.numeric(xget(tbl, "./tr/td[3]")),
                   miRNAName=xget(tbl, "./tr/td[4]"),
                   GeneSymbol=xget(tbl, "./tr/td[5]"),
                   GeneDescription=xget(tbl, "./tr/td[6]"))

There are many ways through this latter part, probably some much cleaner
than presented above. There are fairly extensive examples on each of the
relevant help pages, e.g., ?postForm.

Martin

> Thank you very much for any suggestion.
> Maura
> 
> 
> tutti i telefonini TIM!
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jul 2009 - is there a way to extract fata from web pages through some R function ?

[R] is there a way to extract fata from web pages through some R function ?

[R] Is there a way to extract some fields data from HTML pages through any R function ?

[R] is there a way to extract fata from web pages through some R function ?

[R] Is there a way to extract some fields data from HTML pages through any R function ?

Reasonably Related Threads