thr3ads.net - R help - [R] Analyzing Publications from Pubmed via XML [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Farrel Buchinsky

2007-Dec-14 02:03 UTC

[R] Analyzing Publications from Pubmed via XML

I would like to track in which journals articles about a particular disease
are being published. Creating a pubmed search is trivial. The search
provides data but obviously not as an R dataframe. I can get the search to
export the data as an xml feed and the xml package seems to be able to read
it.

xmlTreeParse("
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
",isURL=TRUE)

But getting from there to a dataframe in which one column would be the name
of the journal and another column would be the year (to keep things simple)
seems to be beyond my capabilities.

Has anyone ever done this and could you share your script? Are there any
published examples where the end result is a dataframe.

I guess what I am looking for is an easy and simple way to parse the feed
and extract the data. Alternatively how does one turn an RSS feed into a CSV
file?

-- 
Farrel Buchinsky
GrandCentral Tel: (412) 567-7870

	[[alternative HTML version deleted]]

Rajarshi Guha

2007-Dec-14 02:12 UTC

head link

[R] Analyzing Publications from Pubmed via XML

On Dec 13, 2007, at 9:03 PM, Farrel Buchinsky wrote:
> I would like to track in which journals articles about a particular  
> disease
> are being published. Creating a pubmed search is trivial. The search
> provides data but obviously not as an R dataframe. I can get the  
> search to
> export the data as an xml feed and the xml package seems to be able  
> to read
> it.
>
> xmlTreeParse("
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi? 
> rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
> ",isURL=TRUE)
>
> But getting from there to a dataframe in which one column would be  
> the name
> of the journal and another column would be the year (to keep things  
> simple)
> seems to be beyond my capabilities.
If you're comfortable with Python (or Perl, Ruby etc), it'd be easier  
to just extract the required stuff from the raw feed - using  
ElementTree in Python makes this a trivial task

Once you have the raw data you can read it into R

-------------------------------------------------------------------
Rajarshi Guha  <rguha at indiana.edu>
GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04  06F7 1BB9 E634 9B87 56EE
-------------------------------------------------------------------
A committee is a group that keeps the minutes and loses hours.
	-- Milton Berle

Gabor Grothendieck

2007-Dec-14 02:42 UTC

head link

[R] Analyzing Publications from Pubmed via XML

On Dec 13, 2007 9:03 PM, Farrel Buchinsky <fjbuch at gmail.com>
wrote:> I would like to track in which journals articles about a particular disease
> are being published. Creating a pubmed search is trivial. The search
> provides data but obviously not as an R dataframe. I can get the search to
> export the data as an xml feed and the xml package seems to be able to read
> it.
>
> xmlTreeParse("
>
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
> ",isURL=TRUE)
>
> But getting from there to a dataframe in which one column would be the name
> of the journal and another column would be the year (to keep things simple)
> seems to be beyond my capabilities.
>
> Has anyone ever done this and could you share your script? Are there any
> published examples where the end result is a dataframe.
>
> I guess what I am looking for is an easy and simple way to parse the feed
> and extract the data. Alternatively how does one turn an RSS feed into a
CSV
> file?
Try this:

library(XML)
doc <-
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
isURL = TRUE, useInternalNodes = TRUE)
sapply(c("//author", "//category"), xpathApply, doc = doc,
fun = xmlValue)

Robert Gentleman

2007-Dec-14 04:35 UTC

head link

[R] Analyzing Publications from Pubmed via XML

or just try looking in the annotate package from Bioconductor


Gabor Grothendieck wrote:> On Dec 13, 2007 9:03 PM, Farrel Buchinsky <fjbuch at gmail.com>
wrote:
>> I would like to track in which journals articles about a particular
disease
>> are being published. Creating a pubmed search is trivial. The search
>> provides data but obviously not as an R dataframe. I can get the search
to
>> export the data as an xml feed and the xml package seems to be able to
read
>> it.
>>
>> xmlTreeParse("
>>
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-
>> ",isURL=TRUE)
>>
>> But getting from there to a dataframe in which one column would be the
name
>> of the journal and another column would be the year (to keep things
simple)
>> seems to be beyond my capabilities.
>>
>> Has anyone ever done this and could you share your script? Are there
any
>> published examples where the end result is a dataframe.
>>
>> I guess what I am looking for is an easy and simple way to parse the
feed
>> and extract the data. Alternatively how does one turn an RSS feed into
a CSV
>> file?
> 
> Try this:
> 
> library(XML)
> doc <-
>
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-",
> isURL = TRUE, useInternalNodes = TRUE)
> sapply(c("//author", "//category"), xpathApply, doc =
doc, fun = xmlValue)
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org

Farrel Buchinsky

2007-Dec-14 20:16 UTC

head link

[R] Analyzing Publications from Pubmed via XML

On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org>
wrote:> or just try looking in the annotate package from Bioconductor
>
Yip. annotate seems to be the most streamlined way to do this.
1) How does one turn the list that is created into a dataframe whose
column names are along the lines of date, title, journal, authors etc
2) I have already created a standing search in pubmed using MyNCBI.
There are many ways I can feed those results to the pubmed() function.
The most brute force way of doing it is by running the search and
outputing the data as a UI List and getting that into the pubmed
brackets. A way that involved more finesse would allow me to create a
rss feed based on my search and then give the rss feed url to the
pubmed function. Or perhaps once could just plop the query inside the
pubmed functions
pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND
"Papilloma"[MeSH])
OR ((("recurrence"[TIAB] NOT Medline[SB]) OR
"recurrence"[MeSH Terms]
OR recurrent[Text Word]) AND respiratory[All Fields] AND
(("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH
Terms] OR
papillomatosis[Text Word])))

Does "somefunction" exist?

If there are any further questions do you think I should migrate this
conversation to the bioconductor mailing list?



Farrel Buchinsky

Gabor Grothendieck

2007-Dec-16 20:26 UTC

head link

[R] Analyzing Publications from Pubmed via XML

On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net>
wrote:> # in the debugging phase I needed to set useInternalNodes = TRUE to see the
> tags. Never did find a way to "print" them when internal.
I assume you mean FALSE.  See:
?saveXML

Duncan Temple Lang

2007-Dec-17 00:28 UTC

head link

[R] Analyzing Publications from Pubmed via XML

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



David Winsemius wrote:> On 15 Dec 2007, you wrote in gmane.comp.lang.r.general:
> 
>> If we can assume that the abstract is always the 4th paragraph then we
>> can try something like this:
>>
>> library(XML)
>> doc <-
>>
xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss
>> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE,
>> useInternalNodes = TRUE, trim = TRUE) 
>>
>> out <- cbind(
>>      Author = unlist(xpathApply(doc, "//author", xmlValue)),
>>      PMID = gsub(".*:", "", unlist(xpathApply(doc,
"//guid",
>>      xmlValue))), 
>>      Abstract = unlist(xpathApply(doc, "//description",
>>           function(x) {
>>                on.exit(free(doc2))
>>                doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText =
TRUE,
>>                     useInternalNodes = TRUE, trim = TRUE)
>>                xpathApply(doc2, "//p[4]", xmlValue)
>>           }
>>      )))
>> free(doc)
>> substring(out, 1, 25) # display first 25 chars of each field
>>
>>
>> The last line produces (it may look messed up in this email):
>>
>>> substring(out, 1, 25) # display it
>>       Author                      PMID       Abstract
>  [1,] " Goon P, Sonnex C, Jani P" "18046565"
"Human papillomaviruses (H"
>  [2,] " Rad MH, Alizadeh E, Ilkh" "17978930"
"Recurrent laryngeal papil"
>  [3,] " Lee LA, Cheng AJ, Fang T" "17975511"
"OBJECTIVES:: Papillomas o"
>  [4,] " Gerein V, Schmandt S, Ba" "17935912"
"BACKGROUND: Human papillo"
> snip
>>
> 
> It looked beautifully regular in my newsreader. It is helpful to see an 
> example showing the indexed access to nodes. It was also helpful to see the
> example of substring for column display. Thank you (for this and all of 
> your other contributions.)
> 
> I find upon further browsing that the pmfetch access point is obsolete. 
> Experimentation with the PubMed eFetch server access point results in fully
> xml-tagged results:
> 
> e.fetch.doc<- function (){
>    fetch.stem <-
>        
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?"
>    src.mode <- "db=pubmed&retmode=xml&"
>    request <- "id=11045395"
>   
doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""),
>                           isURL = TRUE, useInternalNodes = TRUE)
>      }
> # in the debugging phase I needed to set useInternalNodes = TRUE to see the
> tags. Never did find a way to "print" them when internal.
saveXML(node)

will return a string giving the XML content of that node as tree.

> 
> doc<-e.fetch.doc()
> get.info<- function(doc){
>          df<-cbind(
>  	Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)),
>  	Journal =  unlist(xpathApply(doc, "//Title", xmlValue)),
>  	Pmid =  unlist(xpathApply(doc, "//PMID", xmlValue))
>                    )
>    return(df)
>    } 
> 
> # this works
>> substring(get.info(doc), 1, 25)
>      Abstract                    Journal                     Pmid      
> [1,] "We studied the prevalence" "Pediatric nephrology
(Ber" "11045395"
> 
> -----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT
zdLB6mkNPFh5lWgNgb70sDc=SR2E
-----END PGP SIGNATURE-----

Seemingly Similar Threads

Search for more reasonably related threads

R help - Dec 2007 - Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

[R] Analyzing Publications from Pubmed via XML

Seemingly Similar Threads