Hi, I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc. I give below a sample of the kind of the information in the text file : ######## #(a lot of preceding text) 2008-10-01????? 06:30:12??? ??? ??? ??? 2 of 3 page #(some lines of text - varies from file to file) sekvens??? 890 # lines of text sNo???? start??? ??? ??? stop??????????? direction??? ??? value 1??????? 70??? ??? ??? ??? 85??? ??? ??? ??? up??? ??? ??? ??? 60.2 3??? ??? 60??? ??? ??? ??? 90??? ??? ??? ??? down??? ??? ??? 71.5 ######### In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page. After that, I am interested in the number?following the word, sekvens. Also, the table underneath. Finally, I want to collect all the data in a data frame with the following structure : fileno??? sekvens??? sNo??? start??? stop??? direction??? value 82534??? 890??? ??? ??? 1??? ??? 70???????85??? up??? ??? ??? 60.2 82534??? 890??? ??? ??? 3??? ??? 60??? ??? 90????down??? ??? 71.5 82555???? ..???????????????..??? ??? ..??? ??? ..??? ??? ..??? ??? ??? .. There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I?locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want?? I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit(). I would appreciate some help in getting started here. I would certainly benefit from?a few hints. I would also appreciate it if I could get?some links to references with examples showing how similiar problems are tackled. Thanking you, Ravi
jim holtman
2008-Nov-20 13:35 UTC
[R] extracting data from a list of unformatted text files
Here is a way to process the file. You will have to add the loop, error checking, piecing multiple files together, and determination of the end of the data:> x <- "I give below a sample of the kind of the information in the text file :+ ######## + #(a lot of preceding text) + 2008-10-01 06:30:12 2 of 3 + page + + #(some lines of text - varies from file to file) + sekvens 890 + # lines of text + sNo start stop direction value + 1 70 85 up 60.2 + 3 60 90 down 71.5 + ######### + + In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page. + After that, I am interested in the number following the word, sekvens. Also, the table underneath."> input <- readLines(textConnection(x)) > closeAllConnections() > # find 'page' > pageNo <- grep("^page", input) > # backup one line and look for "2 of" > page2 <- grep("2 of ", input[pageNo - 1]) > # compute the start of the data and delete preceeding data > startData <- pageNo[page2] > input <- tail(input, -startData) > # find 'sekvens' > sek.indx <- grep("^sekvens", input) > # extract number after > sek.value <- sub(".*?(\\d+).*", "\\1", input[sek.indx], perl=TRUE) > # find start of table > sNo.indx <- grep("sNo", input) > # read the data (you did not say how to determine the end, so I will read the three lines > values <- read.table(textConnection(input[sNo.indx + (0:2)]), header=TRUE) > closeAllConnections() > sek.value[1] "890"> valuessNo start stop direction value 1 1 70 85 up 60.2 2 3 60 90 down 71.5 On Thu, Nov 20, 2008 at 5:18 AM, ravi <rv15i at yahoo.se> wrote:> Hi, > I want to extract information from a number of text files in a folder. The files are named as : 82534.txt, 82555.txt, 8282787.txt etc. > > I give below a sample of the kind of the information in the text file : > ######## > #(a lot of preceding text) > 2008-10-01 06:30:12 2 of 3 > page > > #(some lines of text - varies from file to file) > sekvens 890 > # lines of text > sNo start stop direction value > 1 70 85 up 60.2 > 3 60 90 down 71.5 > ######### > > In each of the files that I choose, I want to first go to the appropriate page number. This is the first line in the above text and the page number is 2 (from 2 of 3). The date and time preceding the page number vary from file to file, but the next line always has the word, page. > After that, I am interested in the number following the word, sekvens. Also, the table underneath. > > Finally, I want to collect all the data in a data frame with the following structure : > > fileno sekvens sNo start stop direction value > 82534 890 1 70 85 up 60.2 > 82534 890 3 60 90 down 71.5 > 82555 .. .. .. .. .. .. > > There are a number of topics involved here where I have almost no familiarity. First, the use of regular expressions to specify the files that I want from a folder. Next, how do I locate a particular section (or page) in the text file from the description that I am interested in? Should these files be read in their entirety first, or is it possible to directly go the section with the relevant text? Next, how do I extract the data in the form that I want? > > I have identified the following commands that would be useful for me here : list.files(), readLines(), strsplit(). > I would appreciate some help in getting started here. I would certainly benefit from a few hints. I would also appreciate it if I could get some links to references with examples showing how similiar problems are tackled. > Thanking you, > Ravi > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?