Dear R-guRus:
I have a problem with the format of my data in R. 
Let's say I have a HUGE text table which consists of columns of 
numerical data, separated by tabs, but in some places rows of text 
(error messages, etc) are inserted in between rows of numerical data. 
Because the data file is so huge and because I have thousands of these 
files, it's unpractical to try and go thru these files manually and 
remove text rows - I'd like R to do it for me.
The following command works:
MyDataFrame<-data.frame(read.table("MyFile"))
but instead of numerical data in my frame I get "factor" data, because
of these text inserts. How do I filter them out??
Thank you very much,
Vlad.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 17 Jul 2002 VBMorozov at lbl.gov wrote:> > Dear R-guRus: > I have a problem with the format of my data in R. > Let's say I have a HUGE text table which consists of columns of > numerical data, separated by tabs, but in some places rows of text > (error messages, etc) are inserted in between rows of numerical data. > Because the data file is so huge and because I have thousands of these > files, it's unpractical to try and go thru these files manually and > remove text rows - I'd like R to do it for me. > The following command works: > > MyDataFrame<-data.frame(read.table("MyFile")) > > but instead of numerical data in my frame I get "factor" data, because > of these text inserts. How do I filter them out??The simplest case would be if the error messages always began with the same character (eg "E"). In that case you could use comment.char="E" in read.table to say that lines beginning with E are comments. Otherwise you will probably need to read the file line by line and remove the error messages. The most computationally efficient solution would probably be to use something like Perl to preprocess the file, but you could do it in R. Eg Read the file as lines of text Use grep() to find which lines contain only numbers Write those lines to a temporary file Read the temporary file with read.table() Something similar is done by read.fwf(), which reads fixed format data files. -thomas -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Have you looked at the help page for read.table, and noticed the comment.char and colClasses arguments? If the rows of text have sufficient consistency you might be able to use comment.char option. You could try the colClasses option to force it to read the columns as numerical, in which case the non-numeric data should be coerced to NA. If you are on a unix machine, or have a windows machine with unix tools installed, you could use pipe() to pipe the files through, for example, a sed script that tosses any rows with non-numeric characters (other than 'E' if there is any data formatted in scientific notation). -Don At 10:01 AM -0400 7/17/02, VBMorozov at lbl.gov wrote:> Dear R-guRus: >I have a problem with the format of my data in R. >Let's say I have a HUGE text table which consists of columns of >numerical data, separated by tabs, but in some places rows of text >(error messages, etc) are inserted in between rows of numerical data. >Because the data file is so huge and because I have thousands of these >files, it's unpractical to try and go thru these files manually and >remove text rows - I'd like R to do it for me. >The following command works: > >MyDataFrame<-data.frame(read.table("MyFile")) > >but instead of numerical data in my frame I get "factor" data, because >of these text inserts. How do I filter them out?? > >Thank you very much, >Vlad. > > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >Send "info", "help", or "[un]subscribe" >(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA -------------------------------------- -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
If you're actually able to read the data files into R (I'd be surprised
--
having messages interspersed between the rows of data should cause problems 
with numbers of columns), you could do something like the following:
 > x <-
data.frame(a=c(1,"foo",2,1),b=c("bar","foo",3,3)) 
# create a
"problem" data frame
 > x
     a   b
1   1 bar
2 foo foo
3   2   3
4   1   3
 > sapply(x, data.class)  # verify that it contains factor data
        a        b
"factor" "factor"
 > # convert it
 > x1 <- data.frame(lapply(x, function(col) if (is.factor(col)) 
as.numeric(levels(col))[as.numeric(col)] else col))
Warning messages:
1: NAs introduced by coercion
2: NAs introduced by coercion
 > x1
    a  b
1  1 NA
2 NA NA
3  2  3
4  1  3
 > sapply(x1, data.class)
         a         b
"numeric" "numeric"
 > x1[!apply(is.na(x1), 1, any), ]    # filter rows with any NA's in them
   a b
3 2 3
4 1 3
 >
At 10:01 AM 7/17/2002 -0400, you wrote:
>  Dear R-guRus:
>I have a problem with the format of my data in R.
>Let's say I have a HUGE text table which consists of columns of
>numerical data, separated by tabs, but in some places rows of text
>(error messages, etc) are inserted in between rows of numerical data.
>Because the data file is so huge and because I have thousands of these
>files, it's unpractical to try and go thru these files manually and
>remove text rows - I'd like R to do it for me.
>The following command works:
>
>MyDataFrame<-data.frame(read.table("MyFile"))
>
>but instead of numerical data in my frame I get "factor" data,
because
>of these text inserts. How do I filter them out??
>
>Thank you very much,
>Vlad.
>
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Dear Vlad,
You could solve this problem in R, but I suspect that it would be easier to 
pre-filter the files before reading them into data frames, using a tool 
such as grep. In particular, if all of the valid data are numeric and all 
of the offending lines have alphabetic characters, then something like the 
following should do the trick
         grep -v [a-z,A-Z] data.file > filtered.file
(You may have to adjust the regular expression to get exactly what you want.)
As well, since read.table produces a data frame, you don't need to call 
data.frame.
I hope that this helps,
  John
At 10:01 AM 7/17/2002 -0400, VBMorozov at lbl.gov wrote:
>  Dear R-guRus:
>I have a problem with the format of my data in R.
>Let's say I have a HUGE text table which consists of columns of
>numerical data, separated by tabs, but in some places rows of text
>(error messages, etc) are inserted in between rows of numerical data.
>Because the data file is so huge and because I have thousands of these
>files, it's unpractical to try and go thru these files manually and
>remove text rows - I'd like R to do it for me.
>The following command works:
>
>MyDataFrame<-data.frame(read.table("MyFile"))
>
>but instead of numerical data in my frame I get "factor" data,
because
>of these text inserts. How do I filter them out??
>
>Thank you very much,
>Vlad.
>
>
>-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
>r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
>Send "info", "help", or "[un]subscribe"
>(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: jfox at mcmaster.ca
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox
-----------------------------------------------------
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hello, I would like to know if it is possible to add new points to a given plot and that the axis ranges be automatically adjusted so that all the new points are seen in the plot. For example, I am thinking of something like this: plot(1:10,1:10); points(5:20,2*(5:20)); # This would change the x range to 0-20 and the y range to 0-40 so that all the new points are fitted in the graph. Thanks Daniel Mastropietro -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hello Vlad,
I usually don't get R to do anything like this for me, and prefer
preparing the data using GNU awk or something similar. awk
or gawk are available for many different operating systems,
including Unix, Linux, and Windows.
I would usually use something like
    gawk ' BEGIN {FS="\t"} {if (NF==8) print $0 } ' file.dat
which only displays lines with 8 tab-separated fields. More
advanced users of R than I am undoubtedly know how to
effect the same thing using scan or read.table.
Regards,
Andrew C. Ward
CAPE Centre,
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au
----- Original Message -----
From: <VBMorozov at lbl.gov>
To: <r-help at stat.math.ethz.ch>
Sent: Thursday, July 18, 2002 12:01 AM
Subject: [R] problem formatting data frames
>
>  Dear R-guRus:
> I have a problem with the format of my data in R.
> Let's say I have a HUGE text table which consists of columns of
> numerical data, separated by tabs, but in some places rows of text
> (error messages, etc) are inserted in between rows of numerical data.
> Because the data file is so huge and because I have thousands of these
> files, it's unpractical to try and go thru these files manually and
> remove text rows - I'd like R to do it for me.
> The following command works:
>
> MyDataFrame<-data.frame(read.table("MyFile"))
>
> but instead of numerical data in my frame I get "factor" data,
because
> of these text inserts. How do I filter them out??
>
> Thank you very much,
> Vlad.
>
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
-.-.-> r-help mailing list -- Read
http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
>
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
_._
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> -----Original Message----- > From: owner-r-help at stat.math.ethz.ch[mailto:owner-r-help at stat.math.ethz.ch] On> Behalf Of Daniel Mastropietro > Sent: Wednesday, July 17, 2002 7:13 PM > To: R-help at stat.math.ethz.ch > Subject: [R] Automatic adjustment of axis ranges > > Hello, > > I would like to know if it is possible to add new points to a givenplot> and that the axis ranges be automatically adjusted so that all the new > points are seen in the plot. > For example, I am thinking of something like this: > > plot(1:10,1:10); > points(5:20,2*(5:20)); # This would change the x range to 0-20and the y> range to 0-40 so that all the new points are fitted in the graph. > > Thanks > Daniel MastropietroTo the best of my knowledge, once a plot is drawn in the graphics device, the plot region axis ranges are fixed. You can certainly add new points, lines, etc. to an existing plot as you are doing above, but within the existing axis ranges. If you need to change the axis ranges themselves, I believe that you actually have to re-draw the plot. If you know what the minimum and maximum values of the existing and the new data points are going to be, you can explicitly define the ranges with the xlim and ylim arguments to the plot() function. You can either provide specific numbers or use the range() function to secure the min and max values from the numeric vectors that you are using as source data. If someone has other ideas, I would certainly be interested in how this could be done. HTH. Marc -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Hi all, Excuse me for posting a question which may not be true R. I want to process a text file so that the resulting file contains only one of possibly multiple consecutive rows. Example (the row numbers do not belong to the file): (1) a b c d (2) a b c d (3) A b c d (4) A b c d (5) A b c d (6) a b c d (7) a b c D .. resulting in: (1) a b c d (3) A b c d (6) a b c d (7) a b c D .. (6) could be disposed of also by first sorting the original file. Does anybody have a script ready, preferably in Pearl? I do not know Pearl well enough to write it myself. Thanks for your help. --christian Dr.sc.math.Christian W. Hoffmann Mathematics and Statistical Computing Landscape Dynamics and Spatial Development Swiss Federal Research Institute WSL Zuercherstrasse 111 CH-8903 Birmensdorf, Switzerland phone: ++41-1-739 22 77 fax: ++41-1-739 22 15 e-mail: christian.hoffmann at wsl.ch www: http://www.wsl.ch/staff/christian.hoffmann/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Christian, Sounds like you want something that can run as a preprocess step from the command line. If so (and if you're in Unix) you can use 'uniq', which will remove any adjacent dupe lines, but otherwise leave the file in the original order: > uniq myfile > umyfile If you really want to get rid of _all_ dupes then sort first: > sort myfile | uniq > umyfile OR > sort -u myfile > umyfile BTW, 'uniq' has a nifty -c option which counts the number of dupes and can be used to create a useful 'histogram', sorted by frequency. Must sort first to make this work and then sort again after the uniq to arrange in descending order: sort myfile | uniq -c | sort -r -n HTH, John Day Staff Scientist Computer Science Innovations Melbourne, FL http://www.csi.cc/~jday At 08:32 AM 7/18/02 +0200, you wrote:>Hi all, > >Excuse me for posting a question which may not be true R. > >I want to process a text file so that the resulting file contains only one >of possibly multiple consecutive rows. Example (the row numbers do not >belong to the file): > >(1) a b c d >(2) a b c d >(3) A b c d >(4) A b c d >(5) A b c d >(6) a b c d >(7) a b c D >.. > >resulting in: > >(1) a b c d >(3) A b c d >(6) a b c d >(7) a b c D >.. > >(6) could be disposed of also by first sorting the original file. > >Does anybody have a script ready, preferably in Pearl? I do not know Pearl >well enough to write it myself. > >Thanks for your help. >--christian > >Dr.sc.math.Christian W. Hoffmann >Mathematics and Statistical Computing >Landscape Dynamics and Spatial Development >Swiss Federal Research Institute WSL >Zuercherstrasse 111 >CH-8903 Birmensdorf, Switzerland >phone: ++41-1-739 22 77 fax: ++41-1-739 22 15 >e-mail: christian.hoffmann at wsl.ch >www: http://www.wsl.ch/staff/christian.hoffmann/ > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >Send "info", "help", or "[un]subscribe" >(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._