thr3ads.net - R help - [R] Failing on reading a "slightly big" dataset [Jul 2004]

If this information is useful, please help other people find it:
Share via:

Ajay Shah

2004-Jul-05 10:14 UTC

[R] Failing on reading a "slightly big" dataset

I have a file with 4 columns per line, all pipe delimited.

$ wc -l cmie_firm_data.text 
89325 cmie_firm_data.text
$ ls -al cmie_firm_data.text 
-rw-r--r--    1 ajayshah ajayshah  4415637 Jul  5 15:25 cmie_firm_data.text
$ awk -F\| '(NF != 4)' cmie_firm_data.text 
$ head cmie_firm_data.text 
All figures are for the year 20030331|||
Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA (Rs.
thousand)
'R' INVEST PVT. LTD.|-510.45|0.18|-510.27
20 MICRONS LTD.|60700|41200|101900
20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33
21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33
21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10
21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77
21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22
21ST CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34

When I try to read this into R, I get a mysterious error, and then it
reads only 38,244 observations. Any idea what might be going wrong?
In case it matters, I'm confident this is a Unix text file; no DOSisms
of CR-LF here.

$ R --vanilla < picture.R 

R : Copyright 2004, The R Foundation for Statistical Computing
Version 1.9.1  (2004-06-21), ISBN 3-900051-00-3
> firms <- read.table("cmie_firm_data.text", sep="|",
skip=2,+                     col.names=c("name",
"gva.less.interest",
+                       "interest", "gva"))
Warning message: 
number of items read is not a multiple of the number of columns 
> summary(firms)                                            name       gva.less.interest   
 20 MICRONS LTD.                              :    1   Min.   :-1.049e+07  
 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.    :    1   1st Qu.: 1.720e+01  
 21ST CENTURY AUTOMOTIVE INDIA LTD.           :    1   Median : 5.664e+02  
 21ST CENTURY ENTERTAINMENT PVT. LTD.         :    1   Mean   : 4.858e+04  
 21ST CENTURY EQUIPMENTS PVT. LTD.            :    1   3rd Qu.: 2.587e+03  
 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.:    1   Max.   : 1.968e+08  
 (Other)                                      :38244   NA's   : 1.000e+00  
    interest              gva            
 Min.   :0.000e+00   Min.   :-2.349e+06  
 1st Qu.:5.500e-01   1st Qu.: 4.909e+01  
 Median :4.462e+01   Median : 7.711e+02  
 Mean   :6.301e+03   Mean   : 4.565e+04  
 3rd Qu.:5.788e+02   3rd Qu.: 3.436e+03  
 Max.   :9.558e+06   Max.   : 2.004e+08  
 NA's   :2.530e+02   NA's   : 2.530e+02  

Sniffing, I find that the last observation in the data frame `firms'
is wonky:
> print(firms[38249,])                          name gva.less.interest interest     gva
38249 YOUNG POLYMERS PVT. LTD.           2542.08   652.71
3194.79> system("grep -n '^YOUNG POLY' cmie*")88904:YOUNG POLYMERS PVT. LTD.|2542.08|652.71|3194.79

where we see that YOUNG POLYMERS was observation 88,904 in the
file. How did it become #38249 to R?

And,
> print(firms[38250,])
makes him go nuts --

38250 YOUNG WOMENS CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53\nYOUNGMAN WOOL
LEN MILLS LTD.|5395.08|6316.75|11711.83\nYOUNGSTAR CONSTRUCTION PVT. LTD.|850.71
|128.07|978.78\nYOUR INVESTMENT (INDIA) LTD.|90.85|0|90.85\nYOURCHOICE CHIT FUND
 PVT. LTD.|0|0|0\nYOUTH FORUM TOWERS & CONSTRUCTION PVT.
LTD.|289.79|1.75|291.54
\nYOUTH PROMOTERS PVT. LTD.|104.87|30.23|135.10\nYU BO INVST. CO. PVT. LTD.|708.
08|5209.60|5917.68\nYU TECHNOLOGIES PVT. LTD.|923.79|46.69|970.48\nYU-MEN TRADEL
INK PVT. LTD.|-321.47|14.35|-307.12\nYUCCA AGENCIES PVT. LTD.|1243.49|464.33|170
7.82\nYUCON EXPORTS PVT. LTD.|503.30|326.49|829.79\nYUCON MARKETING &
INVSTS. PV
T. LTD.|-4.73|0.20|-4.53\nYUG MARKETING PVT. LTD.|2696.58|304.42|3001\nYUG TRADE
RS PVT. LTD.|-12.25|0.15|-12.10\nYUGAL CHIT FUND & TRADING CO. PVT.
LTD.|-7.52|0
.20|-7.32\nYUGAL KISHORE FABRICS & GARMENTS PVT.
LTD.|914.82|1298.56|2213.38\nYU
GANTAR ENGINEERS PVT. LTD.|193.56|0.69|194.25\nYUGANTAR INVESTMENTS LTD.|44.38|0
.06|44.44\nYUGANTAR TRADING PVT. LTD.|-4.81|0|-4.81\nYUGO INTRACO PVT. LTD.|1588
.49|29.16|1617.65\nYUGSUDO (INDIA) ENGG. SERVICE

(deleted).

The file is fine:

$ grep -n 'YOUNG WOM' cmie_firm_data.text 
88905:YOUNG WOMEN'S CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53

Any idea what might be going on?

My machine is linux 2.4.17 #2 (got new Debian packages for 1.9.1 today).

-- 
Ajay Shah                                                   Consultant
ajayshah at mayin.org                      Department of Economic Affairs
http://www.mayin.org/ajayshah           Ministry of Finance, New Delhi

Prof Brian Ripley

2004-Jul-05 10:25 UTC

head link

[R] Failing on reading a "slightly big" dataset

You are asking read.table to interpret both quote and comment characters
in your file.  You do seem to have quotes -- are they always matched?

Please read through the Data Import/Export manual and check out all the 
options.

On Mon, 5 Jul 2004, Ajay Shah wrote:
> I have a file with 4 columns per line, all pipe delimited.
> 
> $ wc -l cmie_firm_data.text 
> 89325 cmie_firm_data.text
> $ ls -al cmie_firm_data.text 
> -rw-r--r--    1 ajayshah ajayshah  4415637 Jul  5 15:25 cmie_firm_data.text
> $ awk -F\| '(NF != 4)' cmie_firm_data.text 
> $ head cmie_firm_data.text 
> All figures are for the year 20030331|||
> Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA (Rs.
thousand)
> 'R' INVEST PVT. LTD.|-510.45|0.18|-510.27
> 20 MICRONS LTD.|60700|41200|101900
> 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33
> 21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33
> 21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10
> 21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77
> 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22
> 21ST CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34
> 
> When I try to read this into R, I get a mysterious error, and then it
> reads only 38,244 observations. Any idea what might be going wrong?
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

David Khabie-Zeitoune

2004-Jul-05 10:26 UTC

head link

[R] Failing on reading a "slightly big" dataset

Try specifying quote=NULL as an argument to read.table. It could be that
one of your fields has a quote symbol in it.

-----Original Message-----
From: Ajay Shah [mailto:ajayshah at mayin.org] 
Sent: 05 July 2004 11:15
To: r-help
Subject: [R] Failing on reading a "slightly big" dataset


I have a file with 4 columns per line, all pipe delimited.

$ wc -l cmie_firm_data.text 
89325 cmie_firm_data.text
$ ls -al cmie_firm_data.text 
-rw-r--r--    1 ajayshah ajayshah  4415637 Jul  5 15:25
cmie_firm_data.text
$ awk -F\| '(NF != 4)' cmie_firm_data.text 
$ head cmie_firm_data.text 
All figures are for the year 20030331|||
Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA 
Company|(Rs. thousand)
'R' INVEST PVT. LTD.|-510.45|0.18|-510.27
20 MICRONS LTD.|60700|41200|101900
20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33
21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33
21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10
21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77
21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22 21ST
CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34

When I try to read this into R, I get a mysterious error, and then it
reads only 38,244 observations. Any idea what might be going wrong? In
case it matters, I'm confident this is a Unix text file; no DOSisms of
CR-LF here.

$ R --vanilla < picture.R 

R : Copyright 2004, The R Foundation for Statistical Computing Version
1.9.1  (2004-06-21), ISBN 3-900051-00-3
> firms <- read.table("cmie_firm_data.text", sep="|",
skip=2,+                     col.names=c("name",
"gva.less.interest",
+                       "interest", "gva"))
Warning message: 
number of items read is not a multiple of the number of columns 
> summary(firms)                                            name       gva.less.interest

 20 MICRONS LTD.                              :    1   Min.
:-1.049e+07  
 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.    :    1   1st Qu.:
1.720e+01  
 21ST CENTURY AUTOMOTIVE INDIA LTD.           :    1   Median :
5.664e+02  
 21ST CENTURY ENTERTAINMENT PVT. LTD.         :    1   Mean   :
4.858e+04  
 21ST CENTURY EQUIPMENTS PVT. LTD.            :    1   3rd Qu.:
2.587e+03  
 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.:    1   Max.   :
1.968e+08  
 (Other)                                      :38244   NA's   :
1.000e+00  
    interest              gva            
 Min.   :0.000e+00   Min.   :-2.349e+06  
 1st Qu.:5.500e-01   1st Qu.: 4.909e+01  
 Median :4.462e+01   Median : 7.711e+02  
 Mean   :6.301e+03   Mean   : 4.565e+04  
 3rd Qu.:5.788e+02   3rd Qu.: 3.436e+03  
 Max.   :9.558e+06   Max.   : 2.004e+08  
 NA's   :2.530e+02   NA's   : 2.530e+02  

Sniffing, I find that the last observation in the data frame `firms' is
wonky:
> print(firms[38249,])                          name gva.less.interest interest     gva
38249 YOUNG POLYMERS PVT. LTD.           2542.08   652.71
3194.79> system("grep -n '^YOUNG POLY' cmie*")88904:YOUNG POLYMERS PVT. LTD.|2542.08|652.71|3194.79

where we see that YOUNG POLYMERS was observation 88,904 in the file. How
did it become #38249 to R?

And,
> print(firms[38250,])
makes him go nuts --

38250 YOUNG WOMENS CHRISTIAN ASSN. OF
INDIA|9477.71|24.82|9502.53\nYOUNGMAN WOOL LEN MILLS
LTD.|5395.08|6316.75|11711.83\nYOUNGSTAR CONSTRUCTION PVT. LTD.|850.71
|128.07|978.78\nYOUR INVESTMENT (INDIA) LTD.|90.85|0|90.85\nYOURCHOICE 
|CHIT FUND
 PVT. LTD.|0|0|0\nYOUTH FORUM TOWERS & CONSTRUCTION PVT.
LTD.|289.79|1.75|291.54 \nYOUTH PROMOTERS PVT.
LTD.|104.87|30.23|135.10\nYU BO INVST. CO. PVT. LTD.|708.
08|5209.60|5917.68\nYU TECHNOLOGIES PVT. 
08|LTD.|923.79|46.69|970.48\nYU-MEN TRADEL
INK PVT. LTD.|-321.47|14.35|-307.12\nYUCCA AGENCIES PVT.
LTD.|1243.49|464.33|170 7.82\nYUCON EXPORTS PVT.
LTD.|503.30|326.49|829.79\nYUCON MARKETING & INVSTS. PV T.
LTD.|-4.73|0.20|-4.53\nYUG MARKETING PVT. LTD.|2696.58|304.42|3001\nYUG
TRADE RS PVT. LTD.|-12.25|0.15|-12.10\nYUGAL CHIT FUND & TRADING CO.
PVT. LTD.|-7.52|0 .20|-7.32\nYUGAL KISHORE FABRICS & GARMENTS PVT.
LTD.|914.82|1298.56|2213.38\nYU GANTAR ENGINEERS PVT.
LTD.|193.56|0.69|194.25\nYUGANTAR INVESTMENTS LTD.|44.38|0
.06|44.44\nYUGANTAR TRADING PVT. LTD.|-4.81|0|-4.81\nYUGO INTRACO PVT.
LTD.|1588 .49|29.16|1617.65\nYUGSUDO (INDIA) ENGG. SERVICE

(deleted).

The file is fine:

$ grep -n 'YOUNG WOM' cmie_firm_data.text 
88905:YOUNG WOMEN'S CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53

Any idea what might be going on?

My machine is linux 2.4.17 #2 (got new Debian packages for 1.9.1 today).

-- 
Ajay Shah                                                   Consultant
ajayshah at mayin.org                      Department of Economic Affairs
http://www.mayin.org/ajayshah           Ministry of Finance, New Delhi

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jul 2004 - Failing on reading a "slightly big" dataset

[R] Failing on reading a "slightly big" dataset

[R] Failing on reading a "slightly big" dataset

[R] Failing on reading a "slightly big" dataset

Apparently Analagous Threads