Sorry, I know I should read a little 1st about this, but I am actually just helping somebody really quick and need help too. I want to grep all of the names of the .txt files mentioned on this html web page: http://www.epa.gov/emap/remap/html/three/data/index.html Thanks ahead of time. -- View this message in context: http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html Sent from the R help mailing list archive at Nabble.com.
they're all of the form http.*txt but the best way to "grep" them (by which I assume you mean extract the file names from the page source) depends on what you plan to do with them, and what sort of output you expect. It isn't even clear whether you plan to do this in R. Sarah On Wed, Oct 31, 2012 at 12:56 PM, chuck.01 <CharlieTheBrown77@gmail.com>wrote:> Sorry, I know I should read a little 1st about this, but I am actually just > helping somebody really quick and need help too. > > I want to grep all of the names of the .txt files mentioned on this html > web > page: > > http://www.epa.gov/emap/remap/html/three/data/index.html > > Thanks ahead of time. > >-- Sarah Goslee http://www.functionaldiversity.org [[alternative HTML version deleted]]
On Oct 31, 2012, at 9:56 AM, chuck.01 wrote:> Sorry, I know I should read a little 1st about this, but I am actually just > helping somebody really quick and need help too. > > I want to grep all of the names of the .txt files mentioned on this html web > page: > > http://www.epa.gov/emap/remap/html/three/data/index.htmlThis shows code that will identify lines in that source page containing URLs that end in '.txt"'> lines <- readLines(con=url("http://www.epa.gov/emap/remap/html/three/data/index.html") )Warning message: In readLines(con = url("http://www.epa.gov/emap/remap/html/three/data/index.html")) : incomplete final line found on 'http://www.epa.gov/emap/remap/html/three/data/index.html' # You can generally ignore that warning.> length(grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) )[1] 11 Should be fairly straightforward to remove the preceding and trailing material.> sub('(^.*\\")(http://([./A-Za-z]){1+}\\.txt)(".*$)', "\\2", lines[ grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) ] )[1] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt" [2] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/bencnt.txt" [3] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt" [4] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/habbest.txt" [5] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/design/sdesign.txt" [6] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt" [7] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshmet.txt" [8] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshcnt.txt" [9] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshnam.txt" [10] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftmet.txt" [11] "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftorg.txt">> Thanks ahead of time. > > > > -- > View this message in context: http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD Alameda, CA, USA
Sorry Sarah. I want to store them as a vector for use later. so, similar to this: links <- c("http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt", "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt", "http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt") Sarah Goslee wrote> they're all of the form > > http.*txt > > but the best way to "grep" them (by which I assume you mean extract the > file names from the page source) depends on what you plan to do with them, > and what sort of output you expect. > > It isn't even clear whether you plan to do this in R. > > Sarah > > > On Wed, Oct 31, 2012 at 12:56 PM, chuck.01 <> CharlieTheBrown77@> >wrote: > >> Sorry, I know I should read a little 1st about this, but I am actually >> just >> helping somebody really quick and need help too. >> >> I want to grep all of the names of the .txt files mentioned on this html >> web >> page: >> >> http://www.epa.gov/emap/remap/html/three/data/index.html >> >> Thanks ahead of time. >> >> > -- > Sarah Goslee > http://www.functionaldiversity.org > > [[alternative HTML version deleted]] > > ______________________________________________> R-help@> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- View this message in context: http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037p4648043.html Sent from the R help mailing list archive at Nabble.com.