marc_schwartz at comcast.net
2007-May-15 07:06 UTC
[Rd] read.table() can't read in this table (But Splus can) (PR#9687)
On Mon, 2007-05-14 at 23:41 +0200, vax9000 at gmail.com wrote:> Full_Name: vax, 9000 > Version: 2.4.0, 2.2.1 > OS: 2.4.0: Mac OS X; 2.2.1: Linux > Submission from: (NULL) (192.35.79.70) > > > To reproduce this bug, first go to the website "http://llmpp.nih.gov/DLBCL/" and > download the 14.8M data set "Web Figure 1 Data file". The direct link is > "http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data". Save it as "datafile.txt" > > Then, start R, type in command "x <- read.table("datafile.txt", header=TRUE, > sep="\t")". The data has 7400 lines, but not all lines could be read in by R. > > Easier test data set: > Use the command "head -n 100 datafile.txt > shortdatafile.txt" to extract the > first 100 lines. The R command "x <- read.table("datafile.txt", header=TRUE, > sep="\t")" could not read in even this 100 lines of data. > > But Splus can, with the same command. What is wrong?Using R version 2.5.0 Patched:> DF <- read.table("http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data", header = TRUE, sep = "\t")Warning message: number of items read is not a multiple of the number of columns So I tried it with 'fill = TRUE' and that seems to work, which suggests that perhaps something is going on with the data file structure: DF <- read.table("http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data", header = TRUE, sep = "\t", fill = TRUE)> str(DF)'data.frame': 4734 obs. of 295 variables: $ UNIQID : int 27481 17013 24751 27498 27486 30984 17293 28329 27459 27482 ... $ NAME : Factor w/ 4040 levels "||*AA037178|Hs.179661|FK506 binding protein 1A (12kD)",..: 3444 3445 3446 3444 3445 657 1788 3121 3119 3119 ... $ MLC94.46_LYM009_de.novo.untreated : num 0.234 0.452 0.405 0.115 0.249 ... $ MLC96.45_LYM186_de.novo.untreated : num -0.1725 -0.0387 -0.0413 -0.0242 -0.1028 ... $ MLC91.27_LYM427_de.novo.untreated : num 0.200 0.175 0.195 0.223 0.179 ... $ MLC96.84_LYM225_transformed : num -0.213 -0.325 -0.200 -0.199 -0.155 ... $ MLC95.43_LYM095_de.novo.untreated : num -0.1197 0.0038 -0.0213 -0.0705 -0.0755 ... $ MLC91.28_LYM428_de.novo.untreated : num -0.3729 0.0047 -0.2220 -0.3373 -0.2808 ... $ MLC94.50_LYM004_de.novo.untreated : num -0.195 -0.224 -0.126 -0.161 -0.199 ... $ MLC95.46_LYM098_de.novo.untreated : num 0.489 0.611 0.577 0.661 0.519 ... $ MLC95.62_LYM114_de.novo.untreated : num 0.390 0.657 0.747 0.723 0.731 ... $ MLC95.85_LYM137_de.novo.untreated : num -0.277 -0.564 -0.297 -0.140 -0.513 ... .. I would update your version of R and then re-try this. HTH, Marc Schwartz
Liaw, Andy
2007-May-15 13:11 UTC
[Rd] read.table() can't read in this table (But Splus can)
It's the quoting character(s). This following seems to read the file in correctly: R> DF <- read.table("http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data", + header = TRUE, sep = "\t", quote="") R> str(DF) 'data.frame': 7399 obs. of 295 variables: [...] If I have to guess, it's the "3-prime" or "5-prime" that occurs commonly in biology... I don't think Mr. 9000 Vax can blame R for this. Best, Andy From: marc_schwartz at comcast.net> > On Mon, 2007-05-14 at 23:41 +0200, vax9000 at gmail.com wrote: > > Full_Name: vax, 9000 > > Version: 2.4.0, 2.2.1 > > OS: 2.4.0: Mac OS X; 2.2.1: Linux > > Submission from: (NULL) (192.35.79.70) > > > > > > To reproduce this bug, first go to the website > "http://llmpp.nih.gov/DLBCL/" and > > download the 14.8M data set "Web Figure 1 Data file". The > direct link is > > "http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data". Save it as > "datafile.txt" > > > > Then, start R, type in command "x <- > read.table("datafile.txt", header=TRUE, > > sep="\t")". The data has 7400 lines, but not all lines > could be read in by R. > > > > Easier test data set: > > Use the command "head -n 100 datafile.txt > > shortdatafile.txt" to extract the > > first 100 lines. The R command "x <- > read.table("datafile.txt", header=TRUE, > > sep="\t")" could not read in even this 100 lines of data. > > > > But Splus can, with the same command. What is wrong? > > Using R version 2.5.0 Patched: > > > DF <- > read.table("http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data", > header = TRUE, sep = "\t") > Warning message: > number of items read is not a multiple of the number of columns > > > So I tried it with 'fill = TRUE' and that seems to work, > which suggests > that perhaps something is going on with the data file structure: > > DF <- read.table("http://llmpp.nih.gov/DLBCL/NEJM_Web_Fig1data", > header = TRUE, sep = "\t", fill = TRUE) > > > str(DF) > 'data.frame': 4734 obs. of 295 variables: > $ UNIQID : int 27481 > 17013 24751 27498 27486 30984 17293 28329 27459 27482 ... > $ NAME : Factor w/ > 4040 levels "||*AA037178|Hs.179661|FK506 binding protein 1A > (12kD)",..: 3444 3445 3446 3444 3445 657 1788 3121 3119 3119 ... > $ MLC94.46_LYM009_de.novo.untreated : num 0.234 > 0.452 0.405 0.115 0.249 ... > $ MLC96.45_LYM186_de.novo.untreated : num -0.1725 > -0.0387 -0.0413 -0.0242 -0.1028 ... > $ MLC91.27_LYM427_de.novo.untreated : num 0.200 > 0.175 0.195 0.223 0.179 ... > $ MLC96.84_LYM225_transformed : num -0.213 > -0.325 -0.200 -0.199 -0.155 ... > $ MLC95.43_LYM095_de.novo.untreated : num -0.1197 > 0.0038 -0.0213 -0.0705 -0.0755 ... > $ MLC91.28_LYM428_de.novo.untreated : num -0.3729 > 0.0047 -0.2220 -0.3373 -0.2808 ... > $ MLC94.50_LYM004_de.novo.untreated : num -0.195 > -0.224 -0.126 -0.161 -0.199 ... > $ MLC95.46_LYM098_de.novo.untreated : num 0.489 > 0.611 0.577 0.661 0.519 ... > $ MLC95.62_LYM114_de.novo.untreated : num 0.390 > 0.657 0.747 0.723 0.731 ... > $ MLC95.85_LYM137_de.novo.untreated : num -0.277 > -0.564 -0.297 -0.140 -0.513 ... > .. > > > I would update your version of R and then re-try this. > > HTH, > > Marc Schwartz > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > >------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments,...{{dropped}}