thr3ads.net - R help - [R] grep txt file names from html [Oct 2012]

If this information is useful, please help other people find it:
Share via:

chuck.01

2012-Oct-31 16:56 UTC

[R] grep txt file names from html

Sorry, I know I should read a little 1st about this, but I am actually just
helping somebody really quick and need help too. 

I want to grep all of the names of the .txt files mentioned on this html web
page:

http://www.epa.gov/emap/remap/html/three/data/index.html

Thanks ahead of time.



--
View this message in context:
http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html
Sent from the R help mailing list archive at Nabble.com.

Sarah Goslee

2012-Oct-31 17:11 UTC

head link

[R] grep txt file names from html

they're all of the form

http.*txt

but the best way to "grep" them (by which I assume you mean extract
the
file names from the page source) depends on what you plan to do with them,
and what sort of output you expect.

It isn't even clear whether you plan to do this in R.

Sarah

On Wed, Oct 31, 2012 at 12:56 PM, chuck.01
<CharlieTheBrown77@gmail.com>wrote:
> Sorry, I know I should read a little 1st about this, but I am actually just
> helping somebody really quick and need help too.
>
> I want to grep all of the names of the .txt files mentioned on this html
> web
> page:
>
> http://www.epa.gov/emap/remap/html/three/data/index.html
>
> Thanks ahead of time.
>
>-- 
Sarah Goslee
http://www.functionaldiversity.org

	[[alternative HTML version deleted]]

David Winsemius

2012-Oct-31 17:16 UTC

head link

[R] grep txt file names from html

On Oct 31, 2012, at 9:56 AM, chuck.01 wrote:
> Sorry, I know I should read a little 1st about this, but I am actually just
> helping somebody really quick and need help too. 
> 
> I want to grep all of the names of the .txt files mentioned on this html
web
> page:
> 
> http://www.epa.gov/emap/remap/html/three/data/index.html

This shows code that will identify lines in that source page containing URLs
that end in '.txt"'
> lines <-
readLines(con=url("http://www.epa.gov/emap/remap/html/three/data/index.html")
)Warning message:
In readLines(con =
url("http://www.epa.gov/emap/remap/html/three/data/index.html")) :
  incomplete final line found on
'http://www.epa.gov/emap/remap/html/three/data/index.html'
# You can generally ignore that warning.
> length(grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) )[1] 11

Should be fairly straightforward to remove the preceding and trailing material.
> sub('(^.*\\")(http://([./A-Za-z]){1+}\\.txt)(".*$)',
"\\2", lines[
grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) ] ) [1]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt"
 [2]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/bencnt.txt"
 [3]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt"
 [4]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/habbest.txt"
 [5]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/design/sdesign.txt"
 [6]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt"
 [7]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshmet.txt"
 [8]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshcnt.txt"
 [9]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshnam.txt"
[10]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftmet.txt"
[11]
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftorg.txt"
> 
> Thanks ahead of time.
> 
> 
> 
> --
> View this message in context:
http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Alameda, CA, USA

chuck.01

2012-Oct-31 17:21 UTC

head link

[R] grep txt file names from html

Sorry Sarah. 
I want to store them as a vector for use later.  

so, similar to this:

links <-
c("http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt",
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt",
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt")




Sarah Goslee wrote> they're all of the form
> 
> http.*txt
> 
> but the best way to "grep" them (by which I assume you mean
extract the
> file names from the page source) depends on what you plan to do with them,
> and what sort of output you expect.
> 
> It isn't even clear whether you plan to do this in R.
> 
> Sarah
> 
> 
> On Wed, Oct 31, 2012 at 12:56 PM, chuck.01 &lt;
> CharlieTheBrown77@
> &gt;wrote:
> 
>> Sorry, I know I should read a little 1st about this, but I am actually
>> just
>> helping somebody really quick and need help too.
>>
>> I want to grep all of the names of the .txt files mentioned on this
html
>> web
>> page:
>>
>> http://www.epa.gov/emap/remap/html/three/data/index.html
>>
>> Thanks ahead of time.
>>
>>
> -- 
> Sarah Goslee
> http://www.functionaldiversity.org
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@
>  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




--
View this message in context:
http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037p4648043.html
Sent from the R help mailing list archive at Nabble.com.

Seemingly Similar Threads

Search for more reasonably related threads

R help - Oct 2012 - grep txt file names from html

[R] grep txt file names from html

[R] grep txt file names from html

[R] grep txt file names from html

[R] grep txt file names from html

Seemingly Similar Threads