I need to import a large number of simple, space-delimited text files with a few columns of data each. The one quirk is that some rows are missing data and some contain junk text at the end of each line. A typical file might look like: a b c d 1 2 3 x 4 5 6 7 8 9 x 1 2 3 x c c 4 5 6 x 7 8 9 x I'm trying to avoid having to pre-process the text files, as they all sit on an ftp site that I don't manage. My initial approach was just to read the files using a read.table() statement with the arguments flush and fill set to TRUE. For example, to import the above text file I tried: read.table(file="ftp://ftp.example.dta", header=T, row.names=NULL, fill=T, flush=T) However, R throws the error "more columns than column names" and won't import the file. Interestingly, if I move the extra text "c c" from line 5 to line 6 in the data file, read.table() reads the file just fine, and ignores the "c c". So, my first question is, why does simply moving these data down a row solve this problem? Next, I decided to try reading the file with the scan() function and it worked perfectly: data.frame(scan(file="ftp://ftp.example.dta", what=list(a=0, b=0, c=0, d=""), sep=" ", skip=1, flush=T, fill=T)) I'm new to R, but as I understand it read.table() is based on the scan() function. This makes me wonder if there is an additional argument I can add to read.table() to make it import the file successfully, as scan() was able to do. Any help in this regard would be very much appreciated. I'd also really like to hear folks' perspectives on the merits of scan() versus read.table() (e.g. when is scan() the best option?). Cheers [[alternative HTML version deleted]]
Hi Roark,>From my experience, this error is because of problem with reading theheaders, or problem with the "sep" parameter in read.table Try something like read.table(... ,sep ="\t") (This is for tab delimited files) Others might give more ideas. Cheers, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Fri, Jan 28, 2011 at 6:23 AM, H Roark <hrbuilder@hotmail.com> wrote:> > I need to import a large number of simple, space-delimited text files with > a few columns of data each. The one quirk is that some rows are missing data > and some contain junk text at the end of each line. A typical file might > look like: > > a b c d > 1 2 3 x > 4 5 6 > 7 8 9 x > 1 2 3 x c c > 4 5 6 x > 7 8 9 x > > I'm trying to avoid having to pre-process the text files, as they all sit > on an ftp site that I don't manage. My initial approach was just to read > the files using a read.table() statement with the arguments flush and fill > set to TRUE. For example, to import the above text file I tried: > > read.table(file="ftp://ftp.example.dta", header=T, row.names=NULL, fill=T, > flush=T) > > However, R throws the error "more columns than column names" and won't > import the file. > > Interestingly, if I move the extra text "c c" from line 5 to line 6 in the > data file, read.table() reads the file just fine, and ignores the "c c". > So, my first question is, why does simply moving these data down a row > solve this problem? > > Next, I decided to try reading the file with the scan() function and it > worked perfectly: > > data.frame(scan(file="ftp://ftp.example.dta", what=list(a=0, b=0, c=0, > d=""), sep=" ", skip=1, flush=T, fill=T)) > > I'm new to R, but as I understand it read.table() is based on the scan() > function. This makes me wonder if there is an additional argument I can add > to read.table() to make it import the file successfully, as scan() was able > to do. Any help in this regard would be very much appreciated. I'd also > really like to hear folks' perspectives on the merits of scan() versus > read.table() (e.g. when is scan() the best option?). > > Cheers > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Thu, Jan 27, 2011 at 11:23 PM, H Roark <hrbuilder at hotmail.com> wrote:> > I need to import a large number of simple, space-delimited text files with a few columns of data each. The one quirk is that some rows are missing data and some contain junk text at the end of each line. A typical file might look like: > > a b c d > 1 2 3 x > 4 5 6 > 7 8 9 x > 1 2 3 x c c > 4 5 6 x > 7 8 9 x > > I'm trying to avoid having to pre-process the text files, as they all sit on an ftp site that I don't manage. ?My initial approach was just to read the files using a read.table() statement with the arguments flush and fill set to TRUE. For example, to import the above text file I tried: > > read.table(file="ftp://ftp.example.dta", header=T, row.names=NULL, fill=T, flush=T) > > However, R throws the error "more columns than column names" and won't import the file. > > Interestingly, if I move the extra text "c c" from line 5 to line 6 in the data file, read.table() reads the file just fine, and ignores the "c c". ?So, my first question is, why does simply moving these data down a row solve this problem? > > Next, I decided to try reading the file with the scan() function and it worked perfectly: > > data.frame(scan(file="ftp://ftp.example.dta", what=list(a=0, b=0, c=0, d=""), sep=" ", skip=1, flush=T, fill=T)) > > I'm new to R, but as I understand it read.table() is based on the scan() function. This makes me wonder if there is an additional argument I can add to read.table() to make it import the file successfully, as scan() was able to do. ?Any help in this regard would be very much appreciated. ?I'd also really like to hear folks' perspectives on the merits of scan() versus read.table() (e.g. when is scan() the best option?). >Read the header into nms and then the data into DF and then put them together: con <- file("myfile.dat") nms <- scan(con, what = "", nlines = 1) DF <- read.table(con, fill = TRUE) DF <- setNames(DF[seq_along(nms)], nms) or just read it twice: first the one line of the header and then the data: nms <- unlist(read.table("myfile.dat", nrows = 1)) DF <- read.table("myfile.dat", fill = TRUE, skip = 1) DF <- setNames(DF[seq_along(nms)], nms) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On 2011-01-27 20:23, H Roark wrote:> > I need to import a large number of simple, space-delimited text files with a few columns of data each. The one quirk is that some rows are missing data and some contain junk text at the end of each line. A typical file might look like: > > a b c d > 1 2 3 x > 4 5 6 > 7 8 9 x > 1 2 3 x c c > 4 5 6 x > 7 8 9 x > > I'm trying to avoid having to pre-process the text files, as they all sit on an ftp site that I don't manage. My initial approach was just to read the files using a read.table() statement with the arguments flush and fill set to TRUE. For example, to import the above text file I tried: > > read.table(file="ftp://ftp.example.dta", header=T, row.names=NULL, fill=T, flush=T) > > However, R throws the error "more columns than column names" and won't import the file. > > Interestingly, if I move the extra text "c c" from line 5 to line 6 in the data file, read.table() reads the file just fine, and ignores the "c c". So, my first question is, why does simply moving these data down a row solve this problem? >Note this comment in the Details section of ?read.table: "The number of data columns is determined by looking at the first five lines of input ..." Peter Ehlers> Next, I decided to try reading the file with the scan() function and it worked perfectly: > > data.frame(scan(file="ftp://ftp.example.dta", what=list(a=0, b=0, c=0, d=""), sep=" ", skip=1, flush=T, fill=T)) > > I'm new to R, but as I understand it read.table() is based on the scan() function. This makes me wonder if there is an additional argument I can add to read.table() to make it import the file successfully, as scan() was able to do. Any help in this regard would be very much appreciated. I'd also really like to hear folks' perspectives on the merits of scan() versus read.table() (e.g. when is scan() the best option?). > > Cheers > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Seemingly Similar Threads
- Repeating the same calculation across multiple pairs of variables
- Seeking a more efficient way to read in a file
- Efficient way to determine if a data frame has missing observations
- moving onto returning a data.frame?
- How does the cex parameter scale circles?