I have a file with 4 columns per line, all pipe delimited. $ wc -l cmie_firm_data.text 89325 cmie_firm_data.text $ ls -al cmie_firm_data.text -rw-r--r-- 1 ajayshah ajayshah 4415637 Jul 5 15:25 cmie_firm_data.text $ awk -F\| '(NF != 4)' cmie_firm_data.text $ head cmie_firm_data.text All figures are for the year 20030331||| Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA (Rs. thousand) 'R' INVEST PVT. LTD.|-510.45|0.18|-510.27 20 MICRONS LTD.|60700|41200|101900 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33 21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33 21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10 21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22 21ST CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34 When I try to read this into R, I get a mysterious error, and then it reads only 38,244 observations. Any idea what might be going wrong? In case it matters, I'm confident this is a Unix text file; no DOSisms of CR-LF here. $ R --vanilla < picture.R R : Copyright 2004, The R Foundation for Statistical Computing Version 1.9.1 (2004-06-21), ISBN 3-900051-00-3> firms <- read.table("cmie_firm_data.text", sep="|", skip=2,+ col.names=c("name", "gva.less.interest", + "interest", "gva")) Warning message: number of items read is not a multiple of the number of columns> summary(firms)name gva.less.interest 20 MICRONS LTD. : 1 Min. :-1.049e+07 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD. : 1 1st Qu.: 1.720e+01 21ST CENTURY AUTOMOTIVE INDIA LTD. : 1 Median : 5.664e+02 21ST CENTURY ENTERTAINMENT PVT. LTD. : 1 Mean : 4.858e+04 21ST CENTURY EQUIPMENTS PVT. LTD. : 1 3rd Qu.: 2.587e+03 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.: 1 Max. : 1.968e+08 (Other) :38244 NA's : 1.000e+00 interest gva Min. :0.000e+00 Min. :-2.349e+06 1st Qu.:5.500e-01 1st Qu.: 4.909e+01 Median :4.462e+01 Median : 7.711e+02 Mean :6.301e+03 Mean : 4.565e+04 3rd Qu.:5.788e+02 3rd Qu.: 3.436e+03 Max. :9.558e+06 Max. : 2.004e+08 NA's :2.530e+02 NA's : 2.530e+02 Sniffing, I find that the last observation in the data frame `firms' is wonky:> print(firms[38249,])name gva.less.interest interest gva 38249 YOUNG POLYMERS PVT. LTD. 2542.08 652.71 3194.79> system("grep -n '^YOUNG POLY' cmie*")88904:YOUNG POLYMERS PVT. LTD.|2542.08|652.71|3194.79 where we see that YOUNG POLYMERS was observation 88,904 in the file. How did it become #38249 to R? And,> print(firms[38250,])makes him go nuts -- 38250 YOUNG WOMENS CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53\nYOUNGMAN WOOL LEN MILLS LTD.|5395.08|6316.75|11711.83\nYOUNGSTAR CONSTRUCTION PVT. LTD.|850.71 |128.07|978.78\nYOUR INVESTMENT (INDIA) LTD.|90.85|0|90.85\nYOURCHOICE CHIT FUND PVT. LTD.|0|0|0\nYOUTH FORUM TOWERS & CONSTRUCTION PVT. LTD.|289.79|1.75|291.54 \nYOUTH PROMOTERS PVT. LTD.|104.87|30.23|135.10\nYU BO INVST. CO. PVT. LTD.|708. 08|5209.60|5917.68\nYU TECHNOLOGIES PVT. LTD.|923.79|46.69|970.48\nYU-MEN TRADEL INK PVT. LTD.|-321.47|14.35|-307.12\nYUCCA AGENCIES PVT. LTD.|1243.49|464.33|170 7.82\nYUCON EXPORTS PVT. LTD.|503.30|326.49|829.79\nYUCON MARKETING & INVSTS. PV T. LTD.|-4.73|0.20|-4.53\nYUG MARKETING PVT. LTD.|2696.58|304.42|3001\nYUG TRADE RS PVT. LTD.|-12.25|0.15|-12.10\nYUGAL CHIT FUND & TRADING CO. PVT. LTD.|-7.52|0 .20|-7.32\nYUGAL KISHORE FABRICS & GARMENTS PVT. LTD.|914.82|1298.56|2213.38\nYU GANTAR ENGINEERS PVT. LTD.|193.56|0.69|194.25\nYUGANTAR INVESTMENTS LTD.|44.38|0 .06|44.44\nYUGANTAR TRADING PVT. LTD.|-4.81|0|-4.81\nYUGO INTRACO PVT. LTD.|1588 .49|29.16|1617.65\nYUGSUDO (INDIA) ENGG. SERVICE (deleted). The file is fine: $ grep -n 'YOUNG WOM' cmie_firm_data.text 88905:YOUNG WOMEN'S CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53 Any idea what might be going on? My machine is linux 2.4.17 #2 (got new Debian packages for 1.9.1 today). -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi
You are asking read.table to interpret both quote and comment characters in your file. You do seem to have quotes -- are they always matched? Please read through the Data Import/Export manual and check out all the options. On Mon, 5 Jul 2004, Ajay Shah wrote:> I have a file with 4 columns per line, all pipe delimited. > > $ wc -l cmie_firm_data.text > 89325 cmie_firm_data.text > $ ls -al cmie_firm_data.text > -rw-r--r-- 1 ajayshah ajayshah 4415637 Jul 5 15:25 cmie_firm_data.text > $ awk -F\| '(NF != 4)' cmie_firm_data.text > $ head cmie_firm_data.text > All figures are for the year 20030331||| > Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA (Rs. thousand) > 'R' INVEST PVT. LTD.|-510.45|0.18|-510.27 > 20 MICRONS LTD.|60700|41200|101900 > 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33 > 21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33 > 21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10 > 21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77 > 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22 > 21ST CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34 > > When I try to read this into R, I get a mysterious error, and then it > reads only 38,244 observations. Any idea what might be going wrong?-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
David Khabie-Zeitoune
2004-Jul-05 10:26 UTC
[R] Failing on reading a "slightly big" dataset
Try specifying quote=NULL as an argument to read.table. It could be that one of your fields has a quote symbol in it. -----Original Message----- From: Ajay Shah [mailto:ajayshah at mayin.org] Sent: 05 July 2004 11:15 To: r-help Subject: [R] Failing on reading a "slightly big" dataset I have a file with 4 columns per line, all pipe delimited. $ wc -l cmie_firm_data.text 89325 cmie_firm_data.text $ ls -al cmie_firm_data.text -rw-r--r-- 1 ajayshah ajayshah 4415637 Jul 5 15:25 cmie_firm_data.text $ awk -F\| '(NF != 4)' cmie_firm_data.text $ head cmie_firm_data.text All figures are for the year 20030331||| Company|GVA Less Interest (Rs. thousand)|Interest (Rs. thousand)|GVA Company|(Rs. thousand) 'R' INVEST PVT. LTD.|-510.45|0.18|-510.27 20 MICRONS LTD.|60700|41200|101900 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD.|50|0.33|50.33 21ST CENTURY AUTOMOTIVE INDIA LTD.|201.14|0.19|201.33 21ST CENTURY ENTERTAINMENT PVT. LTD.|-6.10|0|-6.10 21ST CENTURY EQUIPMENTS PVT. LTD.|-1599.53|1262.76|-336.77 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.|140.48|1.74|142.22 21ST CENTURY PEST CONTROL SERVICES LTD.|50.21|7.13|57.34 When I try to read this into R, I get a mysterious error, and then it reads only 38,244 observations. Any idea what might be going wrong? In case it matters, I'm confident this is a Unix text file; no DOSisms of CR-LF here. $ R --vanilla < picture.R R : Copyright 2004, The R Foundation for Statistical Computing Version 1.9.1 (2004-06-21), ISBN 3-900051-00-3> firms <- read.table("cmie_firm_data.text", sep="|", skip=2,+ col.names=c("name", "gva.less.interest", + "interest", "gva")) Warning message: number of items read is not a multiple of the number of columns> summary(firms)name gva.less.interest 20 MICRONS LTD. : 1 Min. :-1.049e+07 20TH CENTURY FOX CORPN. (INDIA) PVT. LTD. : 1 1st Qu.: 1.720e+01 21ST CENTURY AUTOMOTIVE INDIA LTD. : 1 Median : 5.664e+02 21ST CENTURY ENTERTAINMENT PVT. LTD. : 1 Mean : 4.858e+04 21ST CENTURY EQUIPMENTS PVT. LTD. : 1 3rd Qu.: 2.587e+03 21ST CENTURY INFRASTRUCTURE (INDIA) PVT. LTD.: 1 Max. : 1.968e+08 (Other) :38244 NA's : 1.000e+00 interest gva Min. :0.000e+00 Min. :-2.349e+06 1st Qu.:5.500e-01 1st Qu.: 4.909e+01 Median :4.462e+01 Median : 7.711e+02 Mean :6.301e+03 Mean : 4.565e+04 3rd Qu.:5.788e+02 3rd Qu.: 3.436e+03 Max. :9.558e+06 Max. : 2.004e+08 NA's :2.530e+02 NA's : 2.530e+02 Sniffing, I find that the last observation in the data frame `firms' is wonky:> print(firms[38249,])name gva.less.interest interest gva 38249 YOUNG POLYMERS PVT. LTD. 2542.08 652.71 3194.79> system("grep -n '^YOUNG POLY' cmie*")88904:YOUNG POLYMERS PVT. LTD.|2542.08|652.71|3194.79 where we see that YOUNG POLYMERS was observation 88,904 in the file. How did it become #38249 to R? And,> print(firms[38250,])makes him go nuts -- 38250 YOUNG WOMENS CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53\nYOUNGMAN WOOL LEN MILLS LTD.|5395.08|6316.75|11711.83\nYOUNGSTAR CONSTRUCTION PVT. LTD.|850.71 |128.07|978.78\nYOUR INVESTMENT (INDIA) LTD.|90.85|0|90.85\nYOURCHOICE |CHIT FUND PVT. LTD.|0|0|0\nYOUTH FORUM TOWERS & CONSTRUCTION PVT. LTD.|289.79|1.75|291.54 \nYOUTH PROMOTERS PVT. LTD.|104.87|30.23|135.10\nYU BO INVST. CO. PVT. LTD.|708. 08|5209.60|5917.68\nYU TECHNOLOGIES PVT. 08|LTD.|923.79|46.69|970.48\nYU-MEN TRADEL INK PVT. LTD.|-321.47|14.35|-307.12\nYUCCA AGENCIES PVT. LTD.|1243.49|464.33|170 7.82\nYUCON EXPORTS PVT. LTD.|503.30|326.49|829.79\nYUCON MARKETING & INVSTS. PV T. LTD.|-4.73|0.20|-4.53\nYUG MARKETING PVT. LTD.|2696.58|304.42|3001\nYUG TRADE RS PVT. LTD.|-12.25|0.15|-12.10\nYUGAL CHIT FUND & TRADING CO. PVT. LTD.|-7.52|0 .20|-7.32\nYUGAL KISHORE FABRICS & GARMENTS PVT. LTD.|914.82|1298.56|2213.38\nYU GANTAR ENGINEERS PVT. LTD.|193.56|0.69|194.25\nYUGANTAR INVESTMENTS LTD.|44.38|0 .06|44.44\nYUGANTAR TRADING PVT. LTD.|-4.81|0|-4.81\nYUGO INTRACO PVT. LTD.|1588 .49|29.16|1617.65\nYUGSUDO (INDIA) ENGG. SERVICE (deleted). The file is fine: $ grep -n 'YOUNG WOM' cmie_firm_data.text 88905:YOUNG WOMEN'S CHRISTIAN ASSN. OF INDIA|9477.71|24.82|9502.53 Any idea what might be going on? My machine is linux 2.4.17 #2 (got new Debian packages for 1.9.1 today). -- Ajay Shah Consultant ajayshah at mayin.org Department of Economic Affairs http://www.mayin.org/ajayshah Ministry of Finance, New Delhi ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html