thr3ads.net - R help - [R] extracting data from a list of unformatted text files [Nov 2008]

If this information is useful, please help other people find it:
Share via:

ravi

2008-Nov-20 10:18 UTC

[R] extracting data from a list of unformatted text files

Hi,
I want to extract information from a number of text files in a folder. The files
are named as : 82534.txt, 82555.txt, 8282787.txt etc.

I give below a sample of the kind of the information in the text file :
########
#(a lot of preceding text)
2008-10-01????? 06:30:12??? ??? ??? ??? 2 of 3
page

#(some lines of text - varies from file to file)
sekvens??? 890
# lines of text
sNo???? start??? ??? ??? stop??????????? direction??? ??? value
1??????? 70??? ??? ??? ??? 85??? ??? ??? ??? up??? ??? ??? ??? 60.2
3??? ??? 60??? ??? ??? ??? 90??? ??? ??? ??? down??? ??? ??? 71.5
#########

In each of the files that I choose, I want to first go to the appropriate page
number. This is the first line in the above text and the page number is 2 (from
2 of 3). The date and time preceding the page number vary from file to file, but
the next line always has the word, page.
After that, I am interested in the number?following the word, sekvens. Also, the
table underneath.

Finally, I want to collect all the data in a data frame with the following
structure :

fileno??? sekvens??? sNo??? start??? stop??? direction??? value
82534??? 890??? ??? ??? 1??? ??? 70???????85??? up??? ??? ??? 60.2
82534??? 890??? ??? ??? 3??? ??? 60??? ??? 90????down??? ??? 71.5
82555???? ..???????????????..??? ??? ..??? ??? ..??? ??? ..??? ??? ??? ..

There are a number of topics involved here where I have almost no familiarity.
First, the use of regular expressions to specify the files that I want from a
folder. Next, how do I?locate a particular section (or page) in the text file
from the description that I am interested in? Should these files be read in
their entirety first, or is it possible to directly go the section with the
relevant text? Next, how do I extract the data in the form that I want??

I have identified the following commands that would be useful for me here :
list.files(), readLines(), strsplit().
I would appreciate some help in getting started here. I would certainly benefit
from?a few hints. I would also appreciate it if I could get?some links to
references with examples showing how similiar problems are tackled.
Thanking you,
Ravi

jim holtman

2008-Nov-20 13:35 UTC

head link

[R] extracting data from a list of unformatted text files

Here is a way to process the file.  You will have to add the loop,
error checking, piecing multiple files together, and determination of
the end of the data:
> x <- "I give below a sample of the kind of the information in the
text file :+ ########
+ #(a lot of preceding text)
+ 2008-10-01      06:30:12                2 of 3
+ page
+
+ #(some lines of text - varies from file to file)
+ sekvens    890
+ # lines of text
+ sNo     start            stop            direction        value
+ 1        70                85                up                60.2
+ 3        60                90                down            71.5
+ #########
+
+ In each of the files that I choose, I want to first go to the
appropriate page number. This is the first line in the above text and
the page number is 2 (from 2 of 3). The date and time preceding the
page number vary from file to file, but the next line always has the
word, page.
+ After that, I am interested in the number following the word,
sekvens. Also, the table underneath."> input <- readLines(textConnection(x))
> closeAllConnections()
> # find 'page'
> pageNo <- grep("^page", input)
> # backup one line and look for "2 of"
> page2 <- grep("2 of ", input[pageNo - 1])
> # compute the start of the data and delete preceeding data
> startData <- pageNo[page2]
> input <- tail(input, -startData)
> # find 'sekvens'
> sek.indx <- grep("^sekvens", input)
> # extract number after
> sek.value <- sub(".*?(\\d+).*", "\\1",
input[sek.indx], perl=TRUE)
> # find start of table
> sNo.indx <- grep("sNo", input)
> # read the data (you did not say how to determine the end, so I will read
the three lines
> values <- read.table(textConnection(input[sNo.indx + (0:2)]),
header=TRUE)
> closeAllConnections()
> sek.value
[1] "890"> values  sNo start stop direction value
1   1    70   85        up  60.2
2   3    60   90      down  71.5


On Thu, Nov 20, 2008 at 5:18 AM, ravi <rv15i at yahoo.se>
wrote:> Hi,
> I want to extract information from a number of text files in a folder. The
files are named as : 82534.txt, 82555.txt, 8282787.txt etc.
>
> I give below a sample of the kind of the information in the text file :
> ########
> #(a lot of preceding text)
> 2008-10-01      06:30:12                2 of 3
> page
>
> #(some lines of text - varies from file to file)
> sekvens    890
> # lines of text
> sNo     start            stop            direction        value
> 1        70                85                up                60.2
> 3        60                90                down            71.5
> #########
>
> In each of the files that I choose, I want to first go to the appropriate
page number. This is the first line in the above text and the page number is 2
(from 2 of 3). The date and time preceding the page number vary from file to
file, but the next line always has the word, page.
> After that, I am interested in the number following the word, sekvens.
Also, the table underneath.
>
> Finally, I want to collect all the data in a data frame with the following
structure :
>
> fileno    sekvens    sNo    start    stop    direction    value
> 82534    890            1        70       85    up            60.2
> 82534    890            3        60        90    down        71.5
> 82555     ..               ..        ..        ..        ..            ..
>
> There are a number of topics involved here where I have almost no
familiarity. First, the use of regular expressions to specify the files that I
want from a folder. Next, how do I locate a particular section (or page) in the
text file from the description that I am interested in? Should these files be
read in their entirety first, or is it possible to directly go the section with
the relevant text? Next, how do I extract the data in the form that I want?
>
> I have identified the following commands that would be useful for me here :
list.files(), readLines(), strsplit().
> I would appreciate some help in getting started here. I would certainly
benefit from a few hints. I would also appreciate it if I could get some links
to references with examples showing how similiar problems are tackled.
> Thanking you,
> Ravi
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Nov 2008 - extracting data from a list of unformatted text files

[R] extracting data from a list of unformatted text files

[R] extracting data from a list of unformatted text files

Possibly Parallel Threads