thr3ads.net - R help - [R] Reading a tab delimted file of varying length using read.table [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Pradeep Bisht

2016-Jan-17 15:31 UTC

[R] Reading a tab delimted file of varying length using read.table

Hello Experts  ,

Being a SAS developer I am finding it difficult to perform some of data
cleaning in R that are quite easy to perform in SAS .

I have been trying to read a .dat file and after a lot of attempts have
failed to find a solution . Maybe R doesn't have the functionality right
now or I am not looking in the right place . Here is my code .

f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat
<http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet>
",
header=T,
sep="\t",
colClasses = c("numeric", "character",
"character","character", "double",
"character" ) )
The error i get i
?s?
this .
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
:
scan() expected 'a real', got '912-15yearsNoNo10.546No'

Also does read.table always calls scan in background to do its job . If so
why use read.table in first place .

Pradeep?

	[[alternative HTML version deleted]]

Rolf Fankhauser

2016-Jan-17 21:42 UTC

head link

[R] Reading a tab delimted file of varying length using read.table

Hello Pradeep

I downloaded divorce.dat but I could not find tabs between the columns.
You defined tab as separator, so your columns should be separated by tabs.
Therefore read.table reads the whole first line and wants to save the 
result as numeric because you defined the first column as numeric.

That's my interpretation
So, use tab, comma or semicolon as delimiter then it should work.

Rolf

Pradeep Bisht wrote:> Hello Experts  ,
>
> Being a SAS developer I am finding it difficult to perform some of data
> cleaning in R that are quite easy to perform in SAS .
>
> I have been trying to read a .dat file and after a lot of attempts have
> failed to find a solution . Maybe R doesn't have the functionality
right
> now or I am not looking in the right place . Here is my code .
>
> f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat
>
<http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet>
> ",
> header=T,
> sep="\t",
> colClasses = c("numeric", "character",
"character","character", "double",
> "character" ) )
> The error i get i
> ?s?
> this .
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
> scan() expected 'a real', got '912-15yearsNoNo10.546No'
>
> Also does read.table always calls scan in background to do its job . If so
> why use read.table in first place .
>
> Pradeep?

Ben Tupper

2016-Jan-17 21:46 UTC

head link

[R] Reading a tab delimted file of varying length using read.table

Hi Pradeep,

Any software would be challenged to determine the boundaries between your
columns.

ff <- 'http://data.princeton.edu/wws509/datasets/divorce.dat'
txt <- readLines(ff)
head(txt)
# [1] "       id        heduc   heblack   mixed     years   div  "
"       9   12-15 years        No      No    10.546    No  "
# [3] "      11    < 12 years        No      No    34.943    No  "
"      13    < 12 years        No      No     2.834   Yes  "
# [5] "      15    < 12 years        No      No    17.532   Yes  "
"      33   12-15 years        No      No     1.418    No

You don't have tab delimiters but instead have space delimiters (well sort
of).  Your second column has either one ("12-15 years") or two
("< 12 years") spaces embedded in the values.  That will mess up
any scheme using spaces to delineate the columns.

Perhaps you can read this as fixed width - see ?read.fwf - but you'll have
to fiddle with the width specifications.

Cheers,
Ben

> On Jan 17, 2016, at 10:31 AM, Pradeep Bisht <pradeep.bisht0303 at
gmail.com> wrote:
> 
> Hello Experts  ,
> 
> Being a SAS developer I am finding it difficult to perform some of data
> cleaning in R that are quite easy to perform in SAS .
> 
> I have been trying to read a .dat file and after a lot of attempts have
> failed to find a solution . Maybe R doesn't have the functionality
right
> now or I am not looking in the right place . Here is my code .
> 
> f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat
>
<http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet>
> ",
> header=T,
> sep="\t",
> colClasses = c("numeric", "character",
"character","character", "double",
> "character" ) )
> The error i get i
> ?s?
> this .
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
> scan() expected 'a real', got '912-15yearsNoNo10.546No'
> 
> Also does read.table always calls scan in background to do its job . If so
> why use read.table in first place .
> 
> Pradeep?
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


Ben Tupper
Bigelow Laboratory for Ocean Sciences
60 Bigelow Drive, P.O. Box 380
East Boothbay, Maine 04544
http://www.bigelow.org

Uwe Ligges

2016-Jan-17 21:48 UTC

head link

[R] Reading a tab delimted file of varying length using read.table

This is not a tab delimited file (as you apparently assume given the 
code), but a fixed width format, hence I'd try:

url <- "http://data.princeton.edu/wws509/datasets/divorce.dat"
widths <- c(9, 13, 10, 8, 10, 6)
f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE)

names(f5) <- as.character(unlist(read.fwf(url, widths = widths, 
strip.white=TRUE, n=1)))

Not sure why reading it simply with header=TRUE des not work, but no 
time to investiagte this now.

Best,
Uwe Ligges



On 17.01.2016 16:31, Pradeep Bisht wrote:> Hello Experts  ,
>
> Being a SAS developer I am finding it difficult to perform some of data
> cleaning in R that are quite easy to perform in SAS .
>
> I have been trying to read a .dat file and after a lot of attempts have
> failed to find a solution . Maybe R doesn't have the functionality
right
> now or I am not looking in the right place . Here is my code .
>
> f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat
>
<http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet>
> ",
> header=T,
> sep="\t",
> colClasses = c("numeric", "character",
"character","character", "double",
> "character" ) )
> The error i get i
> ?s?
> this .
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> :
> scan() expected 'a real', got '912-15yearsNoNo10.546No'
>
> Also does read.table always calls scan in background to do its job . If so
> why use read.table in first place .
>
> Pradeep?
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Pradeep Bisht

2016-Jan-17 22:52 UTC

head link

[R] Reading a tab delimted file of varying length using read.table

A Big thanks to everyone to help me solve this problem .
My bad I assumed the file is delimited by tab which it was not . Its a
fixed width file and the code that Uwe gave is just perfect .
It was cleaver to skip the first row since the delimiter cannot be
specified in this case .I added few more things to it and got the desired
solution .
Here is the code

??
url <- "http://data.princeton.edu/wws509/datasets/divorce.dat"
widths <- c(9, 13, 10, 8, 10, 6)
f5 <- read.fwf(url, widths = widths,
               skip = 1,
               nrow=10,
               strip.white = TRUE,
              
col.names=c("id","heduc","heblack","mixed","years","div"),
               colClasses = c("numeric", "character",
"character","character", "double",
"character" )
               )

Regards
Pradeep Singh

On Sun, Jan 17, 2016 at 4:48 PM, Uwe Ligges <ligges at
statistik.tu-dortmund.de> wrote:
> This is not a tab delimited file (as you apparently assume given the
> code), but a fixed width format, hence I'd try:
>
> url <- "http://data.princeton.edu/wws509/datasets/divorce.dat"
> widths <- c(9, 13, 10, 8, 10, 6)
> f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE)
>
> names(f5) <- as.character(unlist(read.fwf(url, widths = widths,
> strip.white=TRUE, n=1)))
>
> Not sure why reading it simply with header=TRUE des not work, but no time
> to investiagte this now.
>
> Best,
> Uwe Ligges
>
>
>
> On 17.01.2016 16:31, Pradeep Bisht wrote:
>
>> Hello Experts  ,
>>
>> Being a SAS developer I am finding it difficult to perform some of data
>> cleaning in R that are quite easy to perform in SAS .
>>
>> I have been trying to read a .dat file and after a lot of attempts have
>> failed to find a solution . Maybe R doesn't have the functionality
right
>> now or I am not looking in the right place . Here is my code .
>>
>>
f5=read.table("http://data.princeton.edu/wws509/datasets/divorce.dat
>> <
>>
http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fdata%2Eprinceton%2Eedu%2Fwws509%2Fdatasets%2Fdivorce%2Edat&urlhash=GVbR&_t=tracking_anet
>> >
>> ",
>> header=T,
>> sep="\t",
>> colClasses = c("numeric", "character",
"character","character", "double",
>> "character" ) )
>> The error i get i
>> ?s?
>> this .
>> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings,
>> :
>> scan() expected 'a real', got '912-15yearsNoNo10.546No'
>>
>> Also does read.table always calls scan in background to do its job . If
so
>> why use read.table in first place .
>>
>> Pradeep?
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
	[[alternative HTML version deleted]]

Rolf Turner

2016-Jan-17 23:01 UTC

head link

[R] Reading a tab delimted file of varying length using read.table

On 18/01/16 10:48, Uwe Ligges wrote:> This is not a tab delimited file (as you apparently assume given the
> code), but a fixed width format, hence I'd try:
>
> url <- "http://data.princeton.edu/wws509/datasets/divorce.dat"
> widths <- c(9, 13, 10, 8, 10, 6)
> f5 <- read.fwf(url, widths = widths, skip = 1, strip.white = TRUE)
>
> names(f5) <- as.character(unlist(read.fwf(url, widths = widths,
> strip.white=TRUE, n=1)))
>
> Not sure why reading it simply with header=TRUE des not work, but no
> time to investiagte this now.
Dear Uwe,

I have fiddled around a bit and the situation seems to me to be of the 
nature of a bug in read.fwf.  It would seem that in order for 
header=TRUE to work, the entries of the header need to be separated by
the sep delimiter which defaults to "\t".  In the case in question the
entries are separated by blanks, so presumably the header gets read in 
as a single entity, rather than 6 such, leading to a mismatch between 
the length of the header and the number of columns.

It seems that the specified widths get ignored when the header line is 
dealt with.

It also seems that if one specifies sep="" then the header gets read 
correctly but then strings of blanks get interpreted as field separators 
throughout and then blanks within the fields result in the
wrong number of columns.

I think that the code of read.fwf is easy enough to fix; a slight 
adjustment will make the header get treated the same way as the body of 
the file.

I don't see any problems/drawbacks with so-doing, and experimenting with 
my modified function resulted in the divorce data being read in with 
header=TRUE with no problems.

If this mod is made, I see no reason to keep the "sep" argument in 
read.fwf --- except maybe for backward compatibility issues, and I don't 
think there would be any since it never worked properly anyhow.

cheers,

Rolf

P. S. I can send you my modified version of read.fwf off-list if this 
would be of any use to you.

R.

-- 
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

R help - Jan 2016 - Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table

[R] Reading a tab delimted file of varying length using read.table