thr3ads.net - R help - [R] reading partial data set [Dec 2011]

If this information is useful, please help other people find it:
Share via:

bcdc

2011-Dec-07 14:52 UTC

[R] reading partial data set

Hi all,

I'm trying to read a data set into R, but the file is messy, so I have to do
it partially. The whole data is in a .txt file, and the values are separated
by a space. So far ok. The problem is that in this file, not all the lines
have the same number of elements, and the reading stops. And I loose the
reading of the previous lines.

ex. of data set:
11   12   13
21   22   23
31   32   33   34
41   42   43   44
51   52   53
61   62   63   64
71   72   73   74   75
81   82
(...)

If I use the following:
> aux <- read.table(file="data.txt", sep=" ",
header=F,
> colClasses="numeric")
it stops the reading with the error message:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, 
> :    line 3 did not have 3 elements
   Calls: read.table -> scan

and I loose the reading of the previous reading. And since I'm running my
data in a cluster (it's actually a big data set) the error halts my
execution.

What I tried at first was to do
> aux1 <- read.table(file="data.txt", sep=" ",
header=F,
> colClasses="numeric", nrow=2)
> aux2 <- read.table(file="data.txt", sep=" ",
header=F,
> colClasses="numeric", skip=2, nrow=2)
> aux3 <- read.table(file="data.txt", sep=" ",
header=F,
> colClasses="numeric", skip=4, nrow=1)
> (...)
This procedure works. However, I have about 5000 lines to read, and I don't
know precisely which ones are messy. So what I have to do, to keep the above
procedure is:

1. try to read data set
2. check error message to find out which line has different size
3. read data set for the block of same sized lines (aux1)
4. read data set skipping the lines read in aux1;
   check error message to find out which line has different size
5. read data set for second block of same sized lines (aux2)
6. read data set skipping the lines read in aux1 and aux2;
   check error message to find out which line has different size
(and so on)

So, if I had only a hundred lines, this would be OK, but I have a few
thousands, and It'll take me forever to finish reading if I need to read
block by block and check manually where is the problem.

My question is: is there anyway I can read my data with some
"if's" or
"while's" to control the read.table? What I'd like to do is
something like:

1. read data set while all lines has the same length
2. if a line has different length from the previous ones, store what was
read in a variable and abort reading
3. start reading data set from the line where it stopped, and read it while
all lines has the same length
4. if a line has different length from the previous ones, store what was
read in a variable and abort reading
5. start reading data set from the line where it stopped, and read it while
all lines has the same length
6. if a line has different length from the previous ones, store what was
read in a variable and abort reading
(and so on until the whole data set was finally read)

This would make the program run by itself, and solve my problem. It's OK if
it returns a couple of variables, I can just bind them and assemble my data
set as I need, since I know how it should look like in the end.

Thanks in advance for suggestions!
Beatriz

--
View this message in context:
http://r.789695.n4.nabble.com/reading-partial-data-set-tp4169210p4169210.html
Sent from the R help mailing list archive at Nabble.com.

Peter Langfelder

2011-Dec-07 19:26 UTC

head link

[R] reading partial data set

On Wed, Dec 7, 2011 at 6:52 AM, bcdc <bia.cdc at gmail.com>
wrote:> Hi all,
>
> I'm trying to read a data set into R, but the file is messy, so I have
to do
> it partially. The whole data is in a .txt file, and the values are
separated
> by a space. So far ok. The problem is that in this file, not all the lines
> have the same number of elements, and the reading stops. And I loose the
> reading of the previous lines.
>
> ex. of data set:
> 11 ? 12 ? 13
> 21 ? 22 ? 23
> 31 ? 32 ? 33 ? 34
> 41 ? 42 ? 43 ? 44
> 51 ? 52 ? 53
> 61 ? 62 ? 63 ? 64
> 71 72 73 74 75
> 81 ? 82
> (...)
>
> If I use the following:
>
>> aux <- read.table(file="data.txt", sep=" ",
header=F,
>> colClasses="numeric")
>
> it stops the reading with the error message:
>
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,
>> :
> ? line 3 did not have 3 elements
> ? Calls: read.table -> scan
>
> and I loose the reading of the previous reading. And since I'm running
my
> data in a cluster (it's actually a big data set) the error halts my
> execution.
>
> What I tried at first was to do
>
>> aux1 <- read.table(file="data.txt", sep=" ",
header=F,
>> colClasses="numeric", nrow=2)
>> aux2 <- read.table(file="data.txt", sep=" ",
header=F,
>> colClasses="numeric", skip=2, nrow=2)
>> aux3 <- read.table(file="data.txt", sep=" ",
header=F,
>> colClasses="numeric", skip=4, nrow=1)
>> (...)
>
> This procedure works. However, I have about 5000 lines to read, and I
don't
> know precisely which ones are messy. So what I have to do, to keep the
above
> procedure is:
>
> 1. try to read data set
> 2. check error message to find out which line has different size
> 3. read data set for the block of same sized lines (aux1)
> 4. read data set skipping the lines read in aux1;
> ? check error message to find out which line has different size
> 5. read data set for second block of same sized lines (aux2)
> 6. read data set skipping the lines read in aux1 and aux2;
> ? check error message to find out which line has different size
> (and so on)
>
> So, if I had only a hundred lines, this would be OK, but I have a few
> thousands, and It'll take me forever to finish reading if I need to
read
> block by block and check manually where is the problem.
>
> My question is: is there anyway I can read my data with some
"if's" or
> "while's" to control the read.table? What I'd like to do
is something like:
>
> 1. read data set while all lines has the same length
> 2. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> 3. start reading data set from the line where it stopped, and read it while
> all lines has the same length
> 4. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> 5. start reading data set from the line where it stopped, and read it while
> all lines has the same length
> 6. if a line has different length from the previous ones, store what was
> read in a variable and abort reading
> (and so on until the whole data set was finally read)
>
> This would make the program run by itself, and solve my problem. It's
OK if
> it returns a couple of variables, I can just bind them and assemble my data
> set as I need, since I know how it should look like in the end.
>
> Thanks in advance for suggestions!
> Beatriz
>
I think you make it too complicated. Look at the help file for
read.table, particularly for argument fill (whose default AFAICS is
FALSE). If you set it to TRUE, the function will automatically fill
missing entries with NAs. After reading the file you can decide what
to do with lines that have some missing values.

HTH,

Peter
> --
> View this message in context:
http://r.789695.n4.nabble.com/reading-partial-data-set-tp4169210p4169210.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

bcdc

2011-Dec-08 07:57 UTC

head link

[R] reading partial data set

Thank you!

--
View this message in context:
http://r.789695.n4.nabble.com/reading-partial-data-set-tp4169210p4171885.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Dec 2011 - reading partial data set

[R] reading partial data set

[R] reading partial data set

[R] reading partial data set

Apparently Analagous Threads