Pradeep Bisht
2016-Jan-17 15:31 UTC
[R] Reading a tab delimted file of varying length using read.table
Hello Experts , Being a SAS developer I am finding it difficult to perform some of data cleaning in R that are quite easy to perform in SAS . I have been trying to read a .dat file and after a lot of attempts have failed to find a solution . Maybe R doesn't have the functionality right now or I am not looking in the right place . Here is my code . f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat <http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet> ", header=T, sep="\t", colClasses = c("numeric", "character", "character","character", "double", "character" ) ) The error i get i ?s? this . Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '912-15yearsNoNo10.546No' Also does read.table always calls scan in background to do its job . If so why use read.table in first place . Pradeep? [[alternative HTML version deleted]]
Rolf Fankhauser
2016-Jan-17 21:42 UTC
[R] Reading a tab delimted file of varying length using read.table
Hello Pradeep I downloaded divorce.dat but I could not find tabs between the columns. You defined tab as separator, so your columns should be separated by tabs. Therefore read.table reads the whole first line and wants to save the result as numeric because you defined the first column as numeric. That's my interpretation So, use tab, comma or semicolon as delimiter then it should work. Rolf Pradeep Bisht wrote:> Hello Experts , > > Being a SAS developer I am finding it difficult to perform some of data > cleaning in R that are quite easy to perform in SAS . > > I have been trying to read a .dat file and after a lot of attempts have > failed to find a solution . Maybe R doesn't have the functionality right > now or I am not looking in the right place . Here is my code . > > f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat > <http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet> > ", > header=T, > sep="\t", > colClasses = c("numeric", "character", "character","character", "double", > "character" ) ) > The error i get i > ?s? > this . > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > scan() expected 'a real', got '912-15yearsNoNo10.546No' > > Also does read.table always calls scan in background to do its job . If so > why use read.table in first place . > > Pradeep?
Ben Tupper
2016-Jan-17 21:46 UTC
[R] Reading a tab delimted file of varying length using read.table
Hi Pradeep, Any software would be challenged to determine the boundaries between your columns. ff <- 'http://data.princeton.edu/wws509/datasets/divorce.dat' txt <- readLines(ff) head(txt) # [1] " id heduc heblack mixed years div " " 9 12-15 years No No 10.546 No " # [3] " 11 < 12 years No No 34.943 No " " 13 < 12 years No No 2.834 Yes " # [5] " 15 < 12 years No No 17.532 Yes " " 33 12-15 years No No 1.418 No You don't have tab delimiters but instead have space delimiters (well sort of). Your second column has either one ("12-15 years") or two ("< 12 years") spaces embedded in the values. That will mess up any scheme using spaces to delineate the columns. Perhaps you can read this as fixed width - see ?read.fwf - but you'll have to fiddle with the width specifications. Cheers, Ben> On Jan 17, 2016, at 10:31 AM, Pradeep Bisht <pradeep.bisht0303 at gmail.com> wrote: > > Hello Experts , > > Being a SAS developer I am finding it difficult to perform some of data > cleaning in R that are quite easy to perform in SAS . > > I have been trying to read a .dat file and after a lot of attempts have > failed to find a solution . Maybe R doesn't have the functionality right > now or I am not looking in the right place . Here is my code . > > f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat > <http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet> > ", > header=T, > sep="\t", > colClasses = c("numeric", "character", "character","character", "double", > "character" ) ) > The error i get i > ?s? > this . > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > scan() expected 'a real', got '912-15yearsNoNo10.546No' > > Also does read.table always calls scan in background to do its job . If so > why use read.table in first place . > > Pradeep? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.Ben Tupper Bigelow Laboratory for Ocean Sciences 60 Bigelow Drive, P.O. Box 380 East Boothbay, Maine 04544 http://www.bigelow.org
Uwe Ligges
2016-Jan-17 21:48 UTC
[R] Reading a tab delimted file of varying length using read.table
This is not a tab delimited file (as you apparently assume given the code), but a fixed width format, hence I'd try: url <- "http://data.princeton.edu/wws509/datasets/divorce.dat" widths <- c(9, 13, 10, 8, 10, 6) f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE) names(f5) <- as.character(unlist(read.fwf(url, widths = widths, strip.white=TRUE, n=1))) Not sure why reading it simply with header=TRUE des not work, but no time to investiagte this now. Best, Uwe Ligges On 17.01.2016 16:31, Pradeep Bisht wrote:> Hello Experts , > > Being a SAS developer I am finding it difficult to perform some of data > cleaning in R that are quite easy to perform in SAS . > > I have been trying to read a .dat file and after a lot of attempts have > failed to find a solution . Maybe R doesn't have the functionality right > now or I am not looking in the right place . Here is my code . > > f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat > <http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet> > ", > header=T, > sep="\t", > colClasses = c("numeric", "character", "character","character", "double", > "character" ) ) > The error i get i > ?s? > this . > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : > scan() expected 'a real', got '912-15yearsNoNo10.546No' > > Also does read.table always calls scan in background to do its job . If so > why use read.table in first place . > > Pradeep? > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Pradeep Bisht
2016-Jan-17 22:52 UTC
[R] Reading a tab delimted file of varying length using read.table
A Big thanks to everyone to help me solve this problem . My bad I assumed the file is delimited by tab which it was not . Its a fixed width file and the code that Uwe gave is just perfect . It was cleaver to skip the first row since the delimiter cannot be specified in this case .I added few more things to it and got the desired solution . Here is the code ?? url <- "http://data.princeton.edu/wws509/datasets/divorce.dat" widths <- c(9, 13, 10, 8, 10, 6) f5 <- read.fwf(url, widths = widths, skip = 1, nrow=10, strip.white = TRUE, col.names=c("id","heduc","heblack","mixed","years","div"), colClasses = c("numeric", "character", "character","character", "double", "character" ) ) Regards Pradeep Singh On Sun, Jan 17, 2016 at 4:48 PM, Uwe Ligges <ligges at statistik.tu-dortmund.de> wrote:> This is not a tab delimited file (as you apparently assume given the > code), but a fixed width format, hence I'd try: > > url <- "http://data.princeton.edu/wws509/datasets/divorce.dat" > widths <- c(9, 13, 10, 8, 10, 6) > f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE) > > names(f5) <- as.character(unlist(read.fwf(url, widths = widths, > strip.white=TRUE, n=1))) > > Not sure why reading it simply with header=TRUE des not work, but no time > to investiagte this now. > > Best, > Uwe Ligges > > > > On 17.01.2016 16:31, Pradeep Bisht wrote: > >> Hello Experts , >> >> Being a SAS developer I am finding it difficult to perform some of data >> cleaning in R that are quite easy to perform in SAS . >> >> I have been trying to read a .dat file and after a lot of attempts have >> failed to find a solution . Maybe R doesn't have the functionality right >> now or I am not looking in the right place . Here is my code . >> >> f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat >> < >> http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet >> > >> ", >> header=T, >> sep="\t", >> colClasses = c("numeric", "character", "character","character", "double", >> "character" ) ) >> The error i get i >> ?s? >> this . >> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, >> : >> scan() expected 'a real', got '912-15yearsNoNo10.546No' >> >> Also does read.table always calls scan in background to do its job . If so >> why use read.table in first place . >> >> Pradeep? >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >>[[alternative HTML version deleted]]
Rolf Turner
2016-Jan-17 23:01 UTC
[R] Reading a tab delimted file of varying length using read.table
On 18/01/16 10:48, Uwe Ligges wrote:> This is not a tab delimited file (as you apparently assume given the > code), but a fixed width format, hence I'd try: > > url <- "http://data.princeton.edu/wws509/datasets/divorce.dat" > widths <- c(9, 13, 10, 8, 10, 6) > f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE) > > names(f5) <- as.character(unlist(read.fwf(url, widths = widths, > strip.white=TRUE, n=1))) > > Not sure why reading it simply with header=TRUE des not work, but no > time to investiagte this now.Dear Uwe, I have fiddled around a bit and the situation seems to me to be of the nature of a bug in read.fwf. It would seem that in order for header=TRUE to work, the entries of the header need to be separated by the sep delimiter which defaults to "\t". In the case in question the entries are separated by blanks, so presumably the header gets read in as a single entity, rather than 6 such, leading to a mismatch between the length of the header and the number of columns. It seems that the specified widths get ignored when the header line is dealt with. It also seems that if one specifies sep="" then the header gets read correctly but then strings of blanks get interpreted as field separators throughout and then blanks within the fields result in the wrong number of columns. I think that the code of read.fwf is easy enough to fix; a slight adjustment will make the header get treated the same way as the body of the file. I don't see any problems/drawbacks with so-doing, and experimenting with my modified function resulted in the divorce data being read in with header=TRUE with no problems. If this mod is made, I see no reason to keep the "sep" argument in read.fwf --- except maybe for backward compatibility issues, and I don't think there would be any since it never worked properly anyhow. cheers, Rolf P. S. I can send you my modified version of read.fwf off-list if this would be of any use to you. R. -- Technical Editor ANZJS Department of Statistics University of Auckland Phone: +64-9-373-7599 ext. 88276