I would like to track in which journals articles about a particular disease are being published. Creating a pubmed search is trivial. The search provides data but obviously not as an R dataframe. I can get the search to export the data as an xml feed and the xml package seems to be able to read it. xmlTreeParse(" http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT- ",isURL=TRUE) But getting from there to a dataframe in which one column would be the name of the journal and another column would be the year (to keep things simple) seems to be beyond my capabilities. Has anyone ever done this and could you share your script? Are there any published examples where the end result is a dataframe. I guess what I am looking for is an easy and simple way to parse the feed and extract the data. Alternatively how does one turn an RSS feed into a CSV file? -- Farrel Buchinsky GrandCentral Tel: (412) 567-7870 [[alternative HTML version deleted]]
On Dec 13, 2007, at 9:03 PM, Farrel Buchinsky wrote:> I would like to track in which journals articles about a particular > disease > are being published. Creating a pubmed search is trivial. The search > provides data but obviously not as an R dataframe. I can get the > search to > export the data as an xml feed and the xml package seems to be able > to read > it. > > xmlTreeParse(" > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi? > rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT- > ",isURL=TRUE) > > But getting from there to a dataframe in which one column would be > the name > of the journal and another column would be the year (to keep things > simple) > seems to be beyond my capabilities.If you're comfortable with Python (or Perl, Ruby etc), it'd be easier to just extract the required stuff from the raw feed - using ElementTree in Python makes this a trivial task Once you have the raw data you can read it into R ------------------------------------------------------------------- Rajarshi Guha <rguha at indiana.edu> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- A committee is a group that keeps the minutes and loses hours. -- Milton Berle
On Dec 13, 2007 9:03 PM, Farrel Buchinsky <fjbuch at gmail.com> wrote:> I would like to track in which journals articles about a particular disease > are being published. Creating a pubmed search is trivial. The search > provides data but obviously not as an R dataframe. I can get the search to > export the data as an xml feed and the xml package seems to be able to read > it. > > xmlTreeParse(" > http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT- > ",isURL=TRUE) > > But getting from there to a dataframe in which one column would be the name > of the journal and another column would be the year (to keep things simple) > seems to be beyond my capabilities. > > Has anyone ever done this and could you share your script? Are there any > published examples where the end result is a dataframe. > > I guess what I am looking for is an easy and simple way to parse the feed > and extract the data. Alternatively how does one turn an RSS feed into a CSV > file?Try this: library(XML) doc <- xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE, useInternalNodes = TRUE) sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue)
or just try looking in the annotate package from Bioconductor Gabor Grothendieck wrote:> On Dec 13, 2007 9:03 PM, Farrel Buchinsky <fjbuch at gmail.com> wrote: >> I would like to track in which journals articles about a particular disease >> are being published. Creating a pubmed search is trivial. The search >> provides data but obviously not as an R dataframe. I can get the search to >> export the data as an xml feed and the xml package seems to be able to read >> it. >> >> xmlTreeParse(" >> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT- >> ",isURL=TRUE) >> >> But getting from there to a dataframe in which one column would be the name >> of the journal and another column would be the year (to keep things simple) >> seems to be beyond my capabilities. >> >> Has anyone ever done this and could you share your script? Are there any >> published examples where the end result is a dataframe. >> >> I guess what I am looking for is an easy and simple way to parse the feed >> and extract the data. Alternatively how does one turn an RSS feed into a CSV >> file? > > Try this: > > library(XML) > doc <- > xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss_guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", > isURL = TRUE, useInternalNodes = TRUE) > sapply(c("//author", "//category"), xpathApply, doc = doc, fun = xmlValue) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Robert Gentleman, PhD Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 PO Box 19024 Seattle, Washington 98109-1024 206-667-7700 rgentlem at fhcrc.org
On Dec 13, 2007 11:35 PM, Robert Gentleman <rgentlem at fhcrc.org> wrote:> or just try looking in the annotate package from Bioconductor >Yip. annotate seems to be the most streamlined way to do this. 1) How does one turn the list that is created into a dataframe whose column names are along the lines of date, title, journal, authors etc 2) I have already created a standing search in pubmed using MyNCBI. There are many ways I can feed those results to the pubmed() function. The most brute force way of doing it is by running the search and outputing the data as a UI List and getting that into the pubmed brackets. A way that involved more finesse would allow me to create a rss feed based on my search and then give the rss feed url to the pubmed function. Or perhaps once could just plop the query inside the pubmed functions pubmed(somefunction("Laryngeal Neoplasms"[MeSH] AND "Papilloma"[MeSH]) OR ((("recurrence"[TIAB] NOT Medline[SB]) OR "recurrence"[MeSH Terms] OR recurrent[Text Word]) AND respiratory[All Fields] AND (("papilloma"[TIAB] NOT Medline[SB]) OR "papilloma"[MeSH Terms] OR papillomatosis[Text Word]))) Does "somefunction" exist? If there are any further questions do you think I should migrate this conversation to the bioconductor mailing list? Farrel Buchinsky
On Dec 16, 2007 2:53 PM, David Winsemius <dwinsemius at comcast.net> wrote:> # in the debugging phase I needed to set useInternalNodes = TRUE to see the > tags. Never did find a way to "print" them when internal.I assume you mean FALSE. See: ?saveXML
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 David Winsemius wrote:> On 15 Dec 2007, you wrote in gmane.comp.lang.r.general: > >> If we can assume that the abstract is always the 4th paragraph then we >> can try something like this: >> >> library(XML) >> doc <- >> xmlTreeParse("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/erss.cgi?rss >> _guid=0_JYbpsax0ZAAPnOd7nFAX-29fXDpTk5t8M4hx9ytT-", isURL = TRUE, >> useInternalNodes = TRUE, trim = TRUE) >> >> out <- cbind( >> Author = unlist(xpathApply(doc, "//author", xmlValue)), >> PMID = gsub(".*:", "", unlist(xpathApply(doc, "//guid", >> xmlValue))), >> Abstract = unlist(xpathApply(doc, "//description", >> function(x) { >> on.exit(free(doc2)) >> doc2 <- htmlTreeParse(xmlValue(x)[[1]], asText = TRUE, >> useInternalNodes = TRUE, trim = TRUE) >> xpathApply(doc2, "//p[4]", xmlValue) >> } >> ))) >> free(doc) >> substring(out, 1, 25) # display first 25 chars of each field >> >> >> The last line produces (it may look messed up in this email): >> >>> substring(out, 1, 25) # display it >> Author PMID Abstract > [1,] " Goon P, Sonnex C, Jani P" "18046565" "Human papillomaviruses (H" > [2,] " Rad MH, Alizadeh E, Ilkh" "17978930" "Recurrent laryngeal papil" > [3,] " Lee LA, Cheng AJ, Fang T" "17975511" "OBJECTIVES:: Papillomas o" > [4,] " Gerein V, Schmandt S, Ba" "17935912" "BACKGROUND: Human papillo" > snip >> > > It looked beautifully regular in my newsreader. It is helpful to see an > example showing the indexed access to nodes. It was also helpful to see the > example of substring for column display. Thank you (for this and all of > your other contributions.) > > I find upon further browsing that the pmfetch access point is obsolete. > Experimentation with the PubMed eFetch server access point results in fully > xml-tagged results: > > e.fetch.doc<- function (){ > fetch.stem <- > "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?" > src.mode <- "db=pubmed&retmode=xml&" > request <- "id=11045395" > doc<-xmlTreeParse(paste(fetch.stem,src.mode,request,sep=""), > isURL = TRUE, useInternalNodes = TRUE) > } > # in the debugging phase I needed to set useInternalNodes = TRUE to see the > tags. Never did find a way to "print" them when internal.saveXML(node) will return a string giving the XML content of that node as tree.> > doc<-e.fetch.doc() > get.info<- function(doc){ > df<-cbind( > Abstract = unlist(xpathApply(doc, "//AbstractText", xmlValue)), > Journal = unlist(xpathApply(doc, "//Title", xmlValue)), > Pmid = unlist(xpathApply(doc, "//PMID", xmlValue)) > ) > return(df) > } > > # this works >> substring(get.info(doc), 1, 25) > Abstract Journal Pmid > [1,] "We studied the prevalence" "Pediatric nephrology (Ber" "11045395" > >-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHZcKo9p/Jzwa2QP4RAnu3AJ9ucFyb17rm48PLQaPTw6VWyrZWSQCdG0rT zdLB6mkNPFh5lWgNgb70sDc=SR2E -----END PGP SIGNATURE-----