mauede at alice.it
2009-Jul-01 11:53 UTC
[R] is there a way to extract fata from web pages through some R function ?
I deal with a huge amount of Biology data stored in different databases. The databases belongig to Bioconductor organization can be accessed through Bioconductor packages. Unluckily some useful data is stored in databases like, for instance, miRDB, miRecords, etc ... which offer just an interactive HTML interface. See for instance http://mirdb.org/cgi-bin/search.cgi, http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search Downloading data manually from the web pages is a painstaking time-consumung and error-prone activity. I came across a Python script that downloads (dumps) whole web pages into a text file that is then parsed. This is possible because Python has a library to access web pages. But I have no experience with Python programming nor I like such a programming language whose syntax is indentation-sensitive. I am *hoping* that there exists some sort of web pages, HTML connection from R ... is there ?? Thank you very much for any suggestion. Maura tutti i telefonini TIM! [[alternative HTML version deleted]]
mauede at alice.it
2009-Jul-01 15:17 UTC
[R] Is there a way to extract some fields data from HTML pages through any R function ?
I deal with a huge amount of Biology data stored in different databases. The databases belongig to Bioconductor organization can be accessed through Bioconductor packages. Unluckily some useful data is stored in databases like, for instance, miRDB, miRecords, etc ... which offer just an interactive HTML interface. See for instance http://mirdb.org/cgi-bin/search.cgi, http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search Downloading data manually from the web pages is a painstaking time-consumung and error-prone activity. I came across a Python script that downloads (dumps) whole web pages into a text file that is then parsed. This is possible because Python has a library to access web pages. But I have no experience with Python programming nor I like such a programming language whose syntax is indentation-sensitive. I am *hoping* that there exists some sort of web pages, HTML connection from R ... is there ?? Thank you very much for any suggestion. Maura tutti i telefonini TIM! [[alternative HTML version deleted]]
Greg Hirson
2009-Jul-01 15:41 UTC
[R] is there a way to extract fata from web pages through some R function ?
Maura, Try the RCurl package, specifically the functions getURL and getForm. Greg mauede at alice.it wrote:> I deal with a huge amount of Biology data stored in different databases. > The databases belongig to Bioconductor organization can be accessed through Bioconductor packages. > Unluckily some useful data is stored in databases like, for instance, miRDB, miRecords, etc ... which offer just an > interactive HTML interface. See for instance > http://mirdb.org/cgi-bin/search.cgi, > http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search > > Downloading data manually from the web pages is a painstaking time-consumung and error-prone activity. > I came across a Python script that downloads (dumps) whole web pages into a text file that is then parsed. > This is possible because Python has a library to access web pages. > But I have no experience with Python programming nor I like such a programming language whose syntax is indentation-sensitive. > > I am *hoping* that there exists some sort of web pages, HTML connection from R ... is there ?? > > Thank you very much for any suggestion. > Maura > > > > tutti i telefonini TIM! > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Greg Hirson ghirson at ucdavis.edu Graduate Student Agricultural and Environmental Chemistry 1106 Robert Mondavi Institute North One Shields Avenue Davis, CA 95616
Martin Morgan
2009-Jul-01 15:51 UTC
[R] Is there a way to extract some fields data from HTML pages through any R function ?
Hi Maura -- mauede at alice.it wrote:> I deal with a huge amount of Biology data stored in different databases. > The databases belongig to Bioconductor organization can be accessed through Bioconductor packages. > Unluckily some useful data is stored in databases like, for instance, miRDB, miRecords, etc ... which offer just an > interactive HTML interface. See for instance > http://mirdb.org/cgi-bin/search.cgi, > http://mirecords.umn.edu/miRecords/interactions.php?species=Homo+sapiens&mirna_acc=Any&targetgene_type=refseq_acc&targetgene_info=&v=yes&search_int=Search > > Downloading data manually from the web pages is a painstaking time-consumung and error-prone activity. > I came across a Python script that downloads (dumps) whole web pages into a text file that is then parsed. > This is possible because Python has a library to access web pages. > But I have no experience with Python programming nor I like such a programming language whose syntax is indentation-sensitive. > > I am *hoping* that there exists some sort of web pages, HTML connection from R ... is there ??Tools in R for this are the RCurl package and the XML package. library(RCurl) library(XML) Typically this involves manual exploration of the web form, Then you might query the web form result <- postForm("http://mirdb.org/cgi-bin/search.cgi", searchType="miRNA", species="Human", searchBox="hsa-let-7a", submitButton="Go") and parse the results into a convenient structure html <- htmlTreeParse(result, asText=TRUE, useInternalNodes=TRUE) you can then use XPath (http://www.w3.org/TR/xpath, especially section 2.5) to explore and extract information, e.g., ## second table, first row getNodeSet(html, "//table[2]/tr[1]") ## second table, makes subsequent paths shorter tbl <- getNodeSet(html, "//table[2]")[[1]] xget <- function(xml, path) # a helper function unlist(xpathApply(xml, path, xmlValue))[-1] df <- data.frame(TargetRank=as.numeric(xget(tbl, "./tr/td[2]")), TargetScore=as.numeric(xget(tbl, "./tr/td[3]")), miRNAName=xget(tbl, "./tr/td[4]"), GeneSymbol=xget(tbl, "./tr/td[5]"), GeneDescription=xget(tbl, "./tr/td[6]")) There are many ways through this latter part, probably some much cleaner than presented above. There are fairly extensive examples on each of the relevant help pages, e.g., ?postForm. Martin> Thank you very much for any suggestion. > Maura > > > tutti i telefonini TIM! > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.