thr3ads.net - R help - [R] problem formatting data frames [Jul 2002]

If this information is useful, please help other people find it:
Share via:

VBMorozov@lbl.gov

2002-Jul-17 14:01 UTC

[R] problem formatting data frames

Dear R-guRus:
I have a problem with the format of my data in R. 
Let's say I have a HUGE text table which consists of columns of 
numerical data, separated by tabs, but in some places rows of text 
(error messages, etc) are inserted in between rows of numerical data. 
Because the data file is so huge and because I have thousands of these 
files, it's unpractical to try and go thru these files manually and 
remove text rows - I'd like R to do it for me.
The following command works:

MyDataFrame<-data.frame(read.table("MyFile"))

but instead of numerical data in my frame I get "factor" data, because
of these text inserts. How do I filter them out??

Thank you very much,
Vlad.


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Thomas Lumley

2002-Jul-17 15:54 UTC

head link

[R] problem formatting data frames

On Wed, 17 Jul 2002 VBMorozov at lbl.gov wrote:
>
>  Dear R-guRus:
> I have a problem with the format of my data in R.
> Let's say I have a HUGE text table which consists of columns of
> numerical data, separated by tabs, but in some places rows of text
> (error messages, etc) are inserted in between rows of numerical data.
> Because the data file is so huge and because I have thousands of these
> files, it's unpractical to try and go thru these files manually and
> remove text rows - I'd like R to do it for me.
> The following command works:
>
> MyDataFrame<-data.frame(read.table("MyFile"))
>
> but instead of numerical data in my frame I get "factor" data,
because
> of these text inserts. How do I filter them out??
The simplest case would be if the error messages always began with the
same character (eg "E").  In that case you could use
comment.char="E" in
read.table to say that lines beginning with E are comments.

Otherwise you will probably need to read the file line by line and remove
the error messages.  The most computationally efficient solution would
probably be to use something like Perl to preprocess the file, but you
could do it in R.

Eg

  Read the file as lines of text
  Use grep() to find which lines contain only numbers
  Write those lines to a temporary file
  Read the temporary file with read.table()

Something similar is done by read.fwf(), which reads fixed format data
files.


	-thomas

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Don MacQueen

2002-Jul-17 17:58 UTC

head link

[R] problem formatting data frames

Have you looked at the help page for read.table, and noticed the 
comment.char and colClasses arguments?

If the rows of text have sufficient consistency you might be able to 
use comment.char option.

You could try the colClasses option to force it to read the columns 
as numerical, in which case the non-numeric data should be coerced to 
NA.

If you are on a unix machine, or have a windows machine with unix 
tools installed, you could use pipe() to pipe the files through, for 
example, a sed script that tosses any rows with non-numeric 
characters (other than 'E' if there is any data formatted in 
scientific notation).

-Don

At 10:01 AM -0400 7/17/02, VBMorozov at lbl.gov wrote:>  Dear R-guRus:
>I have a problem with the format of my data in R.
>Let's say I have a HUGE text table which consists of columns of
>numerical data, separated by tabs, but in some places rows of text
>(error messages, etc) are inserted in between rows of numerical data.
>Because the data file is so huge and because I have thousands of these
>files, it's unpractical to try and go thru these files manually and
>remove text rows - I'd like R to do it for me.
>The following command works:
>
>MyDataFrame<-data.frame(read.table("MyFile"))
>
>but instead of numerical data in my frame I get "factor" data,
because
>of these text inserts. How do I filter them out??
>
>Thank you very much,
>Vlad.
>
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

-- 
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Tony Plate

2002-Jul-17 18:13 UTC

head link

[R] problem formatting data frames

If you're actually able to read the data files into R (I'd be surprised
--
having messages interspersed between the rows of data should cause problems 
with numbers of columns), you could do something like the following:

 > x <-
data.frame(a=c(1,"foo",2,1),b=c("bar","foo",3,3)) 
# create a
"problem" data frame
 > x
     a   b
1   1 bar
2 foo foo
3   2   3
4   1   3
 > sapply(x, data.class)  # verify that it contains factor data
        a        b
"factor" "factor"
 > # convert it
 > x1 <- data.frame(lapply(x, function(col) if (is.factor(col)) 
as.numeric(levels(col))[as.numeric(col)] else col))
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
 > x1
    a  b
1  1 NA
2 NA NA
3  2  3
4  1  3
 > sapply(x1, data.class)
         a         b
"numeric" "numeric"
 > x1[!apply(is.na(x1), 1, any), ]    # filter rows with any NA's in them
   a b
3 2 3
4 1 3
 >

At 10:01 AM 7/17/2002 -0400, you wrote:
>  Dear R-guRus:
>I have a problem with the format of my data in R.
>Let's say I have a HUGE text table which consists of columns of
>numerical data, separated by tabs, but in some places rows of text
>(error messages, etc) are inserted in between rows of numerical data.
>Because the data file is so huge and because I have thousands of these
>files, it's unpractical to try and go thru these files manually and
>remove text rows - I'd like R to do it for me.
>The following command works:
>
>MyDataFrame<-data.frame(read.table("MyFile"))
>
>but instead of numerical data in my frame I get "factor" data,
because
>of these text inserts. How do I filter them out??
>
>Thank you very much,
>Vlad.
>
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

John Fox

2002-Jul-17 23:19 UTC

head link

[R] problem formatting data frames

Dear Vlad,

You could solve this problem in R, but I suspect that it would be easier to 
pre-filter the files before reading them into data frames, using a tool 
such as grep. In particular, if all of the valid data are numeric and all 
of the offending lines have alphabetic characters, then something like the 
following should do the trick

         grep -v [a-z,A-Z] data.file > filtered.file

(You may have to adjust the regular expression to get exactly what you want.)

As well, since read.table produces a data frame, you don't need to call 
data.frame.

I hope that this helps,
  John

At 10:01 AM 7/17/2002 -0400, VBMorozov at lbl.gov wrote:
>  Dear R-guRus:
>I have a problem with the format of my data in R.
>Let's say I have a HUGE text table which consists of columns of
>numerical data, separated by tabs, but in some places rows of text
>(error messages, etc) are inserted in between rows of numerical data.
>Because the data file is so huge and because I have thousands of these
>files, it's unpractical to try and go thru these files manually and
>remove text rows - I'd like R to do it for me.
>The following command works:
>
>MyDataFrame<-data.frame(read.table("MyFile"))
>
>but instead of numerical data in my frame I get "factor" data,
because
>of these text inserts. How do I filter them out??
>
>Thank you very much,
>Vlad.
>
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
-----------------------------------------------------

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Daniel Mastropietro

2002-Jul-18 00:13 UTC

head link

[R] Automatic adjustment of axis ranges

Hello,

I would like to know if it is possible to add new points to a given plot 
and that the axis ranges be automatically adjusted so that all the new 
points are seen in the plot.
For example, I am thinking of something like this:

plot(1:10,1:10);
points(5:20,2*(5:20));	# This would change the x range to 0-20 and the y 
range to 0-40 so that all the new points are fitted in the graph.

Thanks
Daniel Mastropietro

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Andrew C. Ward

2002-Jul-18 01:46 UTC

head link

[R] problem formatting data frames

Hello Vlad,

I usually don't get R to do anything like this for me, and prefer
preparing the data using GNU awk or something similar. awk
or gawk are available for many different operating systems,
including Unix, Linux, and Windows.

I would usually use something like
    gawk ' BEGIN {FS="\t"} {if (NF==8) print $0 } ' file.dat
which only displays lines with 8 tab-separated fields. More
advanced users of R than I am undoubtedly know how to
effect the same thing using scan or read.table.

Regards,

Andrew C. Ward

CAPE Centre,
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au

----- Original Message -----
From: <VBMorozov at lbl.gov>
To: <r-help at stat.math.ethz.ch>
Sent: Thursday, July 18, 2002 12:01 AM
Subject: [R] problem formatting data frames

>
>  Dear R-guRus:
> I have a problem with the format of my data in R.
> Let's say I have a HUGE text table which consists of columns of
> numerical data, separated by tabs, but in some places rows of text
> (error messages, etc) are inserted in between rows of numerical data.
> Because the data file is so huge and because I have thousands of these
> files, it's unpractical to try and go thru these files manually and
> remove text rows - I'd like R to do it for me.
> The following command works:
>
> MyDataFrame<-data.frame(read.table("MyFile"))
>
> but instead of numerical data in my frame I get "factor" data,
because
> of these text inserts. How do I filter them out??
>
> Thank you very much,
> Vlad.
>
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-> r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Marc Schwartz

2002-Jul-18 03:03 UTC

head link

[R] Automatic adjustment of axis ranges

> -----Original Message-----
> From: owner-r-help at stat.math.ethz.ch
[mailto:owner-r-help at stat.math.ethz.ch] On> Behalf Of Daniel Mastropietro
> Sent: Wednesday, July 17, 2002 7:13 PM
> To: R-help at stat.math.ethz.ch
> Subject: [R] Automatic adjustment of axis ranges
> 
> Hello,
> 
> I would like to know if it is possible to add new points to a given
plot> and that the axis ranges be automatically adjusted so that all the new
> points are seen in the plot.
> For example, I am thinking of something like this:
> 
> plot(1:10,1:10);
> points(5:20,2*(5:20));	# This would change the x range to 0-20
and the y> range to 0-40 so that all the new points are fitted in the graph.
> 
> Thanks
> Daniel Mastropietro
To the best of my knowledge, once a plot is drawn in the graphics
device, the plot region axis ranges are fixed.

You can certainly add new points, lines, etc. to an existing plot as you
are doing above, but within the existing axis ranges.

If you need to change the axis ranges themselves, I believe that you
actually have to re-draw the plot.

If you know what the minimum and maximum values of the existing and the
new data points are going to be, you can explicitly define the ranges
with the xlim and ylim arguments to the plot() function.  You can either
provide specific numbers or use the range() function to secure the min
and max values from the numeric vectors that you are using as source
data.

If someone has other ideas, I would certainly be interested in how this
could be done.

HTH.

Marc



-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Christian Hoffmann

2002-Jul-18 06:32 UTC

head link

[R] elimination of multiple rows

Hi all,

Excuse me for posting a question which may not be true R.

I want to process a text file so that the resulting file contains only one
of possibly multiple consecutive rows. Example (the row numbers do not
belong to the file):

(1) a b c d
(2) a b c d
(3) A b c d
(4) A b c d
(5) A b c d
(6) a b c d
(7) a b c D
..

resulting in:

(1) a b c d
(3) A b c d
(6) a b c d
(7) a b c D
..

(6) could be disposed of also by first sorting the original file.

Does anybody have a script ready, preferably in Pearl? I do not know Pearl
well enough to write it myself.

Thanks for your help.
--christian

Dr.sc.math.Christian W. Hoffmann
Mathematics and Statistical Computing
Landscape Dynamics and Spatial Development
Swiss Federal Research Institute WSL 
Zuercherstrasse 111
CH-8903 Birmensdorf, Switzerland
phone: ++41-1-739 22 77    fax: ++41-1-739 22 15
e-mail: christian.hoffmann at wsl.ch
www: http://www.wsl.ch/staff/christian.hoffmann/

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

John Day

2002-Jul-18 13:00 UTC

head link

[R] elimination of multiple rows

Christian,

Sounds like you want something that can run as a preprocess step from the 
command line. If so (and if you're in Unix) you can use 'uniq',
which will
remove any adjacent dupe lines, but otherwise leave the file in the 
original order:

 > uniq myfile > umyfile

If you really want to get rid of _all_ dupes then sort first:

 > sort myfile | uniq > umyfile
OR
 > sort -u myfile > umyfile

BTW, 'uniq' has a nifty -c  option which counts the number of dupes and
can
be used to create a useful 'histogram', sorted by frequency. Must sort 
first to make this work and then sort again after the uniq to arrange in 
descending order:

sort myfile | uniq -c | sort -r -n

HTH,

John Day
Staff Scientist
Computer Science Innovations
Melbourne, FL
http://www.csi.cc/~jday

At 08:32 AM 7/18/02 +0200, you wrote:>Hi all,
>
>Excuse me for posting a question which may not be true R.
>
>I want to process a text file so that the resulting file contains only one
>of possibly multiple consecutive rows. Example (the row numbers do not
>belong to the file):
>
>(1) a b c d
>(2) a b c d
>(3) A b c d
>(4) A b c d
>(5) A b c d
>(6) a b c d
>(7) a b c D
>..
>
>resulting in:
>
>(1) a b c d
>(3) A b c d
>(6) a b c d
>(7) a b c D
>..
>
>(6) could be disposed of also by first sorting the original file.
>
>Does anybody have a script ready, preferably in Pearl? I do not know Pearl
>well enough to write it myself.
>
>Thanks for your help.
>--christian
>
>Dr.sc.math.Christian W. Hoffmann
>Mathematics and Statistical Computing
>Landscape Dynamics and Spatial Development
>Swiss Federal Research Institute WSL
>Zuercherstrasse 111
>CH-8903 Birmensdorf, Switzerland
>phone: ++41-1-739 22 77    fax: ++41-1-739 22 15
>e-mail: christian.hoffmann at wsl.ch
>www: http://www.wsl.ch/staff/christian.hoffmann/
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Jul 2002 - problem formatting data frames

[R] problem formatting data frames

[R] problem formatting data frames

[R] problem formatting data frames

[R] problem formatting data frames

[R] problem formatting data frames

[R] Automatic adjustment of axis ranges

[R] problem formatting data frames

[R] Automatic adjustment of axis ranges

[R] elimination of multiple rows

[R] elimination of multiple rows

Possibly Parallel Threads