Johan Jackson
2008-Apr-19 06:19 UTC
[R] multiple separators in sep argument for read.table?
Hello, Is there any way to add multiple separators in the sep= argument in read.table? I would like to be able to create different columns if I see a white space OR a "/". Thanks in advance, JJ [[alternative HTML version deleted]]
Prof Brian Ripley
2008-Apr-19 06:38 UTC
[R] multiple separators in sep argument for read.table?
On Sat, 19 Apr 2008, Johan Jackson wrote:> Hello, > > Is there any way to add multiple separators in the sep= argument in > read.table? I would like to be able to create different columns if I see a > white space OR a "/".No. read.table() uses scan(), and that requires 'sep' to be a single character (if specified). You can read your dataset by readLines, change "/" to, say, "\t" by gsub() and then use read.table() on a textConnection() from the resulting character vector. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
(Ted Harding)
2008-Apr-19 07:52 UTC
[R] multiple separators in sep argument for read.table?
On 19-Apr-08 06:19:09, Johan Jackson wrote:> Hello, > Is there any way to add multiple separators in the sep= argument > in read.table? I would like to be able to create different columns > if I see a white space OR a "/". > > Thanks in advance, > JJAs well as Brian Ripley's suggestion for how to do it withnin R, if you have access to the 'awk' program (as on all Unix/Linux systems and, in principle, installable in Windows) then you can pre-process the file outside of R on the following lines. First, here is a test file "temp.txt": R1C1 R1C2;R1C3 R2C1,R2C2 R2C3 R3C1,R3C2;R3C3 where each line has 3 fields, separated by any of " " or "," or ";" and it is desired to obtain a purely comma-separated version of it. awk ' BEGIN{FS="[ ]|[;]|[,]";OFS=","};{$1=$1};{print $0} ' < temp.txt > temp2.txt produces a file temp2.txt with contents R1C1,R1C2,R1C3 R2C1,R2C2,R2C3 R3C1,R3C2,R3C3 The logic is that the intialisation BEGIN{FS="[ ]|[;]|[,]"} ; OFS=","} sets up the Field Separator variable FS as a regular expression which matches any one of " " ";" "," and the Output Field Separator OFS to be ",". $0 denotes the entire input line, and the "$1=$1" causes the first field to be re-computed (to be equal to itself) so that the whole input line $0 is re-computed at which point the OFS is then set to "," in $0. Hence an 'awk' program to handle the case you describe could be awk ' BEGIN{FS="[ ]|[/]";OFS=" "};{$1=$1};{print $0} ' < myrawfile > myfinalfile It gets slightly more interesting if your "white space" separating two fields might be any number of consecutive spaces or a TAB, say. In that case something like awk ' BEGIN{FS="[ ][ ]*|[;]|[,]|[\t]";OFS=","};{$1=$1};{print $0} ' < myrawfile > myfinalfile might be needed. Here "[ ][ ]*" means "one space followed by zero or more spaces", and "\t" is the notation for TAB. If I change the test file above to R1C1 R1C2;R1C3 R2C1,R2C2 R2C3 R3C1,R3C2;R3C3 where the long blank in the first line is 3 consecutive " ", and the long blank in the second line is a single TAB, then the second 'awk' program above generates exactly the same output as before. Just a thought! I'm always tempted to suggest that people use 'awk' in conjunction with R, not only to deal with the kind of relatively simple substitutions you describe, but also for exploring and cleaning up the sort of mess that people can send you after exporting a CSV file from an Excel spreadsheet, etc. (It would go on for too long, to give examples of this sort of thing.) With best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 19-Apr-08 Time: 08:52:56 ------------------------------ XFMail ------------------------------