Hi,
There are many occurrences of the CIK number in the page source. This pulls
out the first node containing it:
node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
>From there you can extract the number. Here's one way to do it.
strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2],
"&type")[[1]][1]
Jeff
On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <jspark4@uic.edu>
wrote:
> Dear R Helpers,
>
> I would like to pull the CIK number from the web page
>
>
>
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
>
> If you put this web page into your browser you will see the CIK number in
> red on the left side of the page near the top.
>
> When I try the basic
> require(scrapeR)
> require(XML)
> require(RCurl)
> doc
> <-htmlTreeParse("
>
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
> str(doc)
>
> I get a large number of items in the data frame that I don't know how
to
> interpret. Both
> tables <- readHTMLTable(doc)
>
> and
>
> list<-xmlToList(doc)
>
> result in errors.
>
> Any (positive) guidance would be much appreciated.
>
> --John J. Sparks, Ph.D.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]