thr3ads.net - R help - [R] Downloading a directory of text files into R [Jul 2023]

If this information is useful, please help other people find it:
Share via:

Rui Barradas

2023-Jul-26 05:52 UTC

[R] Downloading a directory of text files into R

?s 23:06 de 25/07/2023, Bob Green escreveu:> Hello,
> 
> I am seeking advice as to how I can download the 833 files from this 
> site:"http://home.brisnet.org.au/~bgreen/Data/"
> 
> I want to be able to download them to perform a textual analysis.
> 
> If the 833 files, which are in a Directory with two subfolders were on 
> my computer I could read them through readtext. Using readtext I get the 
> error:
> 
>  > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*")
> Error in download_remote(file, ignore_missing, cache, verbosity) :
>  ? Remote URL does not end in known extension. Please download the file 
> manually.
> 
>  > x =
readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
> Error in download_remote(file, ignore_missing, cache, verbosity) :
>  ? Remote URL does not end in known extension. Please download the file 
> manually.
> 
> Any suggestions are appreciated.
> 
> Bob
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.Hello,

The following code downloads all files in the posted link.



suppressPackageStartupMessages({
   library(rvest)
})

# destination directory, change this at will
dest_dir <- "~/Temp"

# first get the two subfolders from the Data webpage
link <- "http://home.brisnet.org.au/~bgreen/Data/"
page <- read_html(link)
page %>%
   html_elements("a") %>%
   html_text() %>%
   grep("/$", ., value = TRUE) -> sub_folder

# create relevant disk sub-directories, if
# they do not exist yet
for(subf in sub_folder) {
   d <- file.path(dest_dir, subf)
   if(!dir.exists(d)) {
     success <- dir.create(d)
     msg <- paste("created directory", d, "-", success)
     message(msg)
   }
}

# prepare to download the files
dest_dir <- file.path(dest_dir, sub_folder)
source_url <- paste0(link, sub_folder)

success <- mapply(\(src, dest) {
   # read each Data subfolder
   # and get the file names therein
   # then lapply 'download.file' to each filename
   pg <- read_html(src)
   pg %>%
     html_elements("a") %>%
     html_text() %>%
     grep("\\.txt$", ., value = TRUE) %>%
     lapply(\(x) {
       s <- paste0(src, x)
       d <- file.path(dest, x)
       tryCatch(
         download.file(url = s, destfile = d),
         warning = function(w) w,
         error = function(e) e
       )
     })
}, source_url, dest_dir)

lengths(success)
# http://home.brisnet.org.au/~bgreen/Data/Hanson1/
#                                               84
# http://home.brisnet.org.au/~bgreen/Data/Hanson2/
#                                              749

# matches the question's number
sum(lengths(success))
# [1] 833



Hope this helps,

Rui Barradas

Bob Green

2023-Jul-26 06:12 UTC

head link

[R] Downloading a directory of text files into R

Rui,

Many thanks for  your reply and coding, I was not 
expecting so much work was required. It worked perfectly.

The only thing I needed to do, was create a Temp file in the Documents folder.

Thanks again,


Bob

At 03:52 PM 7/26/2023, Rui Barradas wrote:>??s 23:06 de 25/07/2023, Bob Green escreveu:
>>Hello,
>>I am seeking advice as to how I can download 
>>the 833 files from this
site:"http://home.brisnet.org.au/~bgreen/Data/"
>>I want to be able to download them to perform a textual analysis.
>>If the 833 files, which are in a Directory with 
>>two subfolders were on my computer I could read 
>>them through readtext. Using readtext I get the error:
>>  > x =
readtext("http://home.brisnet.org.au/~bgreen/Data/*")
>>Error in download_remote(file, ignore_missing, cache, verbosity) :
>>  ?  Remote URL does not end in known 
>> extension. Please download the file manually.
>>  > x =
readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
>>Error in download_remote(file, ignore_missing, cache, verbosity) :
>>  ?  Remote URL does not end in known 
>> extension. Please download the file manually.
>>Any suggestions are appreciated.
>>Bob
>>______________________________________________
>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>https://stat.ethz.ch/mailman/listinfo/r-help
>>PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>and provide commented, minimal, self-contained, reproducible code.
>Hello,
>
>The following code downloads all files in the posted link.
>
>
>
>suppressPackageStartupMessages({
>   library(rvest)
>})
>
># destination directory, change this at will
>dest_dir <- "~/Temp"
>
># first get the two subfolders from the Data webpage
>link <- "http://home.brisnet.org.au/~bgreen/Data/"
>page <- read_html(link)
>page %>%
>   html_elements("a") %>%
>   html_text() %>%
>   grep("/$", ., value = TRUE) -> sub_folder
>
># create relevant disk sub-directories, if
># they do not exist yet
>for(subf in sub_folder) {
>   d <- file.path(dest_dir, subf)
>   if(!dir.exists(d)) {
>     success <- dir.create(d)
>     msg <- paste("created directory", d, "-",
success)
>     message(msg)
>   }
>}
>
># prepare to download the files
>dest_dir <- file.path(dest_dir, sub_folder)
>source_url <- paste0(link, sub_folder)
>
>success <- mapply(\(src, dest) {
>   # read each Data subfolder
>   # and get the file names therein
>   # then lapply 'download.file' to each filename
>   pg <- read_html(src)
>   pg %>%
>     html_elements("a") %>%
>     html_text() %>%
>     grep("\\.txt$", ., value = TRUE) %>%
>     lapply(\(x) {
>       s <- paste0(src, x)
>       d <- file.path(dest, x)
>       tryCatch(
>         download.file(url = s, destfile = d),
>         warning = function(w) w,
>         error = function(e) e
>       )
>     })
>}, source_url, dest_dir)
>
>lengths(success)
># http://home.brisnet.org.au/~bgreen/Data/Hanson1/
>#                                               84
># http://home.brisnet.org.au/~bgreen/Data/Hanson2/
>#                                              749
>
># matches the question's number
>sum(lengths(success))
># [1] 833
>
>
>
>Hope this helps,
>
>Rui Barradas

Maybe Matching Threads

Search for more possibly parallel threads

R help - Jul 2023 - Downloading a directory of text files into R

[R] Downloading a directory of text files into R

[R] Downloading a directory of text files into R

Maybe Matching Threads