Dear R-listers: I want to import a reasonably big file into a table. (15797 x 257 columns). The file is tab delimited with NA in every empty space. I have reproduced what I have used as my read.table instruction. I have read the R-dataImportExport FAQ and still couldn't solve my problem. (I might have missed it, of course). I'm using R.2.01 in a Mac G4, 10.3.7. I can import the file, but one of the columns "invades the other", meaning that the if there is an empty space marked as NA on the first column, it gets the value of the second column. I tried to import four different files (details below) and I think the problem is with the number of columns (with less columns it works) workarounds: a) I can separate my file into several files, import them and then make one file in R b) try to learn basic commands in awk? perl? any advice on this? another question (much less important) I have a binnary file in Splus for this object. I exported the object in Splus as it says in the FAQ (dump.data). But data.restore doesn't exist as a function. Is it because I'm using a Mac? details of what I did: ## a) importing a shorter version of my file (58 columns); I get the "invading" behaviour and a column of row.names that I don't understand where it comes from. (UNIQID should be empty and 1006 should be in All.FB.Id> AllFBImpFields <- read.table('AllFBAllFieldsNAShorter.txt', fill=T, header=T,+ row.names=paste('a',1:15797, sep=''), + as.is=T, nrows=15797)> AllFBImpFields[1:2,1:5]row.names UNIQID All.FB.Id All.FB.5 All.FB.4 a1 <NA> 10006 <NA> <NA> <NA> a2 <NA> 10007 <NA> <NA> <NA> ## b) Importing only 5 cols of the previous file. It works. there is no "invasion" and the col row.names is not inserted> AllFB5Cols <- read.table('AllFB5Cols.txt', fill=T, header=T,+ row.names=paste('a',1:15797, sep=''), + as.is=T, nrows=15797)> AllFB5Cols[1:2,1:5]UNIQID All.FB.Id Symbol FB.gn CG.name a1 <NA> 10006 p53 FBgn0039044 CG10873 a2 <NA> 10007 Gr94a FBgn0041225 CG31280 ## c) importing file with 4 rows, 58 columns; invasion behaviour and a warning that I don't get in a) although the file is the same for the first 4 rows> x4rowsAllCol <- read.table('AllFB4rowsAllCols.txt', fill=T, header=T,+ row.names=paste('a',1:4, sep=''), + as.is=T, nrows=4) Warning message: incomplete final line found by readTableHeader on `AllFB4rowsAllCols.txt'> x4rowsAllCol[1:2,1:5]row.names UNIQID All.FB.Id All.FB.5 All.FB.4 a1 NA 10006 NA NA NA a2 NA 10007 NA NA NA ## d) importing file with 4 rows and 4 cols, result is like b) but gives the same warning as c!)> x4rows5cols <- read.table('AllFB4rows5cols.txt', fill=T, header=T,+ row.names=paste('a',1:4, sep=''), + as.is=T, nrows=4) Warning message: incomplete final line found by readTableHeader on `AllFB4rows5cols.txt'> x4rows5cols[1:2,1:5]UNIQID All.FB.Id All.FB.5 All.FB.4 All.FB.3 a1 NA 10006 NA NA NA a2 NA 10007 NA NA NA
On Wed, 2005-01-19 at 04:25 +0000, Tiago R Magalhaes wrote:> Dear R-listers: > > I want to import a reasonably big file into a table. (15797 x 257 > columns). The file is tab delimited with NA in every empty space.Tiago, Have you tried to use read.table() explicitly defining the field delimiting character as a tab to see if that changes anything? Try the following: AllFBImpFields <- read.table('AllFBAllFieldsNAShorter.txt', header = TRUE, row.names=paste('a',1:15797, sep=''), as.is = TRUE, sep = "\t") I added the 'sep = "\t"' argument at the end. Also, leave out the 'fill = TRUE', which can cause problems. You do not need this unless your source file has a varying number of fields per line. Note that you do not need to specify the 'nrows' argument unless you generally want something less than all of the rows. Using the combination of 'skip' and 'nrows', you can read a subset of rows from the middle of the input file. See if that helps. Usually when there are column alignment problems, it is because the rows are not being consistently parsed into fields, which is frequently the result of not having the proper delimiting character specified. The last thought is to be sure that a '#' is not in your data file. This is interpreted as a comment character by default, which means that anything after it on a row will be ignored. HTH, Marc Schwartz
On Wed, 19 Jan 2005, Tiago R Magalhaes wrote:> another question (much less important) I have a binnary file in Splus for > this object. I exported the object in Splus as it says in the FAQ > (dump.data).Whose FAQ? data.dump is not mentioned in the R FAQ.> But data.restore doesn't exist as a function. Is it because I'm using a > Mac?It is in package foreign: please consult the `R Data Import/Export Manual'. There are details you need to follow, including loading package foreign. -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Thanks very much Mark and Prof Ripley a) using sep='\t' when using read.table() helps somewhat there is still a problem:I cannot get all the lines: df <- read.table('file.txt', fill=T, header=T, sep='\t') dim(df) 9543 195 while with the shorter file (11 cols) I get all the rows dim(df) 15797 11 I have looked at row 9544 where the file seems to stop reading, but I cannot see in any of the cols an obvious reason for this to happen. Any ideas why? Maybe there is one col that is stopping the reading process and that column is not one of the 11 that are present in the smaller file. b) fill=T is necessary without fill=T, I get an error: "line 1892 did not have 195 elements" c) help page for read.table I reread the help file for read.table and I would suggest to change it. From what I think I am reading, the '\t' would not be needed to work in my file, but it actually is:from the help page: If 'sep = ""' (the default for 'read.table') the separator is "white space", that is one or more spaces, tabs or newlines. d) I incorrectly mentioned the FAQ in relation with data.restore. Where I actually saw data.restore mentioned was in the `R Data Import/Export Manual', which I read (even more than once...) failing to read the first paragraph of section where it's stated that the foreign package is used. it works! (with source): in Splus 6.1, windows 2000: dump('file') in R2.01, Mac 10.3.7: source('file') I get a list, where the first element is the data.frame I want the column names have value added to them>On Wed, 2005-01-19 at 04:25 +0000, Tiago R Magalhaes wrote: > > Dear R-listers: >> >> I want to import a reasonably big file into a table. (15797 x 257 >> columns). The file is tab delimited with NA in every empty space. > >Tiago, > >Have you tried to use read.table() explicitly defining the field >delimiting character as a tab to see if that changes anything? > >Try the following: > >AllFBImpFields <- read.table('AllFBAllFieldsNAShorter.txt', > header = TRUE, > row.names=paste('a',1:15797, sep=''), > as.is = TRUE, > sep = "\t") > >I added the 'sep = "\t"' argument at the end. > >Also, leave out the 'fill = TRUE', which can cause problems. You do not >need this unless your source file has a varying number of fields per >line. > >Note that you do not need to specify the 'nrows' argument unless you >generally want something less than all of the rows. Using the >combination of 'skip' and 'nrows', you can read a subset of rows from >the middle of the input file. > >See if that helps. Usually when there are column alignment problems, it >is because the rows are not being consistently parsed into fields, which >is frequently the result of not having the proper delimiting character >specified. > >The last thought is to be sure that a '#' is not in your data file. This >is interpreted as a comment character by default, which means that >anything after it on a row will be ignored. > >HTH, > >Marc Schwartz