thr3ads.net - R help - [R] Loading large .pxt and .asc datasets causes issues. [Feb 2016]

If this information is useful, please help other people find it:
Share via:

Torvon

2016-Feb-23 19:13 UTC

[R] Loading large .pxt and .asc datasets causes issues.

Hi,

I want to load a dataset into R. This dataset is available in two formats:
.XPT and .ASC. The dataset is available at
http://www.cdc.gov/brfss/annual_data/annual_2006.htm.

They are about 40mb zipped, and about 500mb unzipped.

I can get the .xpt data to load, using:
> library(hmisc)
> data <- sasxport.get("CDBRFS06.XPT")
The data look fine, no error messages. However, the data only contains 302
columns, which is less than it should have (according to the
documentation). It does not contain my variables of interest, so either the
documentation or the data file is wrong, and I want to make sure it's not
the data file.

Hence I wanted to see if I get the same results loading the .ASC file.
However, multiple ways to do so have failed.
> library(adehabitat)
> import.asc("CDBRFS06.asc")
Results in:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: scan() expected 'a real', got
'1191.8808943.38209868648.960119'
> library(SDMTools)
> read.asc("CDBRFS06.asc")
Results in:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: scan() expected 'a real', got
'1191.8808943.38209868648.960119' In
addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote,
skip, nlines, na.strings, : number of items read is not a multiple of the
number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip,
nlines, na.strings, : number of items read is not a multiple of the number
of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : number of items read is not a multiple of the number of
columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
na.strings, : number of items read is not a multiple of the number of
columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs
introduced by coercion to integer range

Thank you for your help.
   Eiko

	[[alternative HTML version deleted]]

Jan van der Laan

2016-Feb-23 21:07 UTC

head link

[R] Loading large .pxt and .asc datasets causes issues.

First, the file does contain 302 columns; the variable layout 
(http://www.cdc.gov/brfss/annual_data/2006/varlayout_table_06.htm) 
contains 302 columns. So, reading the SASS file probably works correctly.

Second, the read.asc function you use is for reading geographic raster 
files, not fixed width files.

Below, I show how you could read the file using the LaF package (sorry 
for the long dump of variable files; copy-pasted them from the page 
linked to above):

columns <- "StartingColumn  VariableName    FieldLength
1    _STATE    2
3    _GEOSTR    2
5    _DENSTR2    1
6    PRECALL    1
7    REPNUM    5
12    REPDEPTH    2
14    FMONTH    2
16    IDATE    8
16    IMONTH    2
18    IDAY    2
20    IYEAR    4
24    INTVID    3
27    DISPCODE    3
30    SEQNO    10
30    _PSU    10
40    NATTMPTS    2
42    NRECSEL    6
48    NRECSTR    9
57    CTELENUM    1
58    CELLFON1    1
59    PVTRESID    1
60    NUMADULT    2
62    NUMMEN    2
64    NUMWOMEN    2
73    GENHLTH    1
74    PHYSHLTH    2
76    MENTHLTH    2
78    POORHLTH    2
80    HLTHPLAN    1
81    PERSDOC2    1
82    MEDCOST    1
83    CHECKUP    1
84    EXERANY2    1
85    DIABETE2    1
86    LASTDEN3    1
87    RMVTETH3    1
88    DENCLEAN    1
89    CVDINFR3    1
90    CVDCRHD3    1
91    CVDSTRK3    1
92    ASTHMA2    1
93    ASTHNOW    1
94    QLACTLM2    1
95    USEEQUIP    1
96    SMOKE100    1
97    SMOKDAY2    1
98    STOPSMK2    1
99    AGE    2
101    HISPANC2    1
102    MRACE    6
108    ORACE2    1
109    MARITAL    1
110    CHILDREN    2
112    EDUCA    1
113    EMPLOY    1
114    INCOME2    2
116    WEIGHT2    4
120    HEIGHT3    4
124    CTYCODE    3
132    NUMHHOL2    1
133    NUMPHON2    1
134    TELSERV2    1
135    SEX    1
136    PREGNANT    1
137    VETERAN    1
138    DRNKANY4    1
139    ALCDAY4    3
142    AVEDRNK2    2
144    DRNK3GE5    2
146    MAXDRNKS    2
148    FLUSHOT3    1
149    FLUSPRY2    1
162    PNEUVAC3    1
163    HEPBVAC    1
164    HEPBRSN    1
165    FALL3MN2    2
167    FALLINJ2    2
169    SEATBELT    1
170    DRINKDRI    2
172    HADMAM    1
173    HOWLONG    1
174    PROFEXAM    1
175    LENGEXAM    1
176    HADPAP2    1
177    LASTPAP2    1
178    HADHYST2    1
179    PSATEST    1
180    PSATIME    1
181    DIGRECEX    1
182    DRETIME    1
183    PROSTATE    1
184    BLDSTOOL    1
185    LSTBLDS2    1
186    HADSIGM3    1
187    LASTSIG2    1
188    HIVTST5    1
189    HIVTSTD2    6
195    WHRTST7    2
197    HIVRDTST    1
198    EMTSUPRT    1
199    LSATISFY    1
200    RCSBIRTH    6
206    RCSGENDR    1
207    RCHISLAT    1
208    RCSRACE    6
214    RCSBRACE    1
215    RCSRELN1    1
216    DRHPCH    1
217    HAVHPCH    1
218    CIFLUSH2    1
219    RCVFVCH2    6
225    RNOFVCH2    2
227    CASTHDX2    1
228    CASTHNO2    1
229    DIABAGE2    2
231    INSULIN    1
232    DIABPILL    1
233    BLDSUGAR    3
236    FEETCHK2    3
239    FEETSORE    1
240    DOCTDIAB    2
242    CHKHEMO3    2
244    FEETCHK    2
246    EYEEXAM    1
247    DIABEYE    1
248    DIABEDU    1
249    VIDFCLT2    1
250    VIREDIF2    1
251    VIPRFVS2    1
252    VINOCRE2    2
254    VIEYEXM2    1
255    VIINSUR2    1
256    VICTRCT2    1
257    VIGLUMA2    1
258    VIMACDG2    1
259    VIATWRK2    1
260    PAINACT2    2
262    QLMENTL2    2
264    QLSTRES2    2
266    QLREST2    2
268    QLHLTH2    2
270    ASTHMAGE    2
272    ASATTACK    1
273    ASERVIST    2
275    ASDRVIST    2
277    ASRCHKUP    2
279    ASACTLIM    3
282    ASYMPTOM    1
283    ASNOSLEP    1
284    ASTHMED2    1
285    ASINHALR    1
286    BRTHCNT3    1
287    TYPCNTR4    2
289    NOBCUSE2    2
291    FPCHLDFT    1
292    FPCHLDHS    1
293    VITAMINS    1
294    MULTIVIT    1
295    FOLICACD    1
296    TAKEVIT    3
299    RECOMMEN    1
300    HOUSESMK    1
301    INDOORS    1
302    SMKPUBLC    1
303    SMKWORK    1
304    IAQHTSRC    1
305    IAQGASAP    1
306    IAQHTDYS    3
309    IAQCODTR    1
310    IAQMOLD    1
311    HEWTRSRC    1
312    HEWTRDRK    1
313    HECHMHOM    3
316    HECHMYRD    3
319    RRCLASS2    1
320    RRCOGNT2    1
321    RRATWORK    1
322    RRHCARE2    1
323    RRPHYSM1    1
324    RREMTSM1    1
325    ADPLEASR    2
327    ADDOWN    2
329    ADSLEEP    2
331    ADENERGY    2
333    ADEAT    2
335    ADFAIL    2
337    ADTHINK    2
339    ADMOVE    2
341    ADANXEV    1
342    ADDEPEV    1
343    SVSAFE    1
344    SVSEXTCH    1
345    SVNOTCH    1
346    SVEHDSE1    1
347    SVHDSX12    1
348    SVEANOS1    1
349    SVNOSX12    1
350    SVRELAT2    2
352    SVGENDER    1
353    IPVSAFE    1
354    IPVTHRAT    1
355    IPVPHYV1    1
356    IPVPHHRT    1
357    IPVUWSEX    1
358    IPVPVL12    1
359    IPVSXINJ    1
360    IPVRELT1    2
362    GPWELPRD    1
363    GPVACPLN    1
364    GP3DYWTR    1
365    GP3DYFOD    1
366    GP3DYPRS    1
367    GPBATRAD    1
368    GPFLSLIT    1
369    GPMNDEVC    1
370    GPNOTEVC    2
372    GPEMRCOM    1
373    GPEMRINF    1
741    QSTVER    1
742    QSTLANG    2
800    _STSTR    5
805    _STRWT    10
815    _RAW    10
825    _WT2    10
835    _POSTSTR    10
845    _FINALWT    10
935    _REGION    2
937    _AGEG_    2
939    _SEXG_    1
940    _RACEG3_    1
941    _RACEG4_    1
942    _IMPAGE    2
944    _IMPNPH    1
945    _ITSCF1    10
955    _ITSCF2    10
965    _ITSPOST    10
975    _ITSFINL    10
993    MSCODE    1
994    CRACEORG    6
1000    CRACEASC    6
1006    _CRACE    2
1008    _CSEXG_    1
1009    _CRACEG_    1
1010    _CAGEG_    3
1033    _RAWCH    10
1063    _WT2CH    10
1093    _POSTCH    10
1123    _CHILDWT    10
1133    _RAWHH    10
1143    _WT2HH    10
1153    _POSTHH    10
1163    _HOUSEWT    10
1173    _RFHLTH    1
1174    _TOTINDA    1
1175    _EXTETH2    1
1176    _ALTETH2    1
1177    _DENVST1    1
1178    _LTASTHM    1
1179    _CASTHMA    1
1180    _ASTHMST    1
1181    _SMOKER3    1
1182    _RFSMOK3    1
1183    MRACEORG    6
1189    MRACEASC    6
1195    _PRACE    2
1197    _MRACE    2
1199    _RACEG2    1
1200    _RACEGR2    1
1201    _RACE_G    1
1202    _CNRACE    1
1203    _CNRACEC    1
1204    RACE2    1
1205    _AGEG5YR    2
1207    _AGE65YR    1
1208    _AGE_G    1
1209    HTIN3    3
1212    HTM3    3
1215    WTKG2    5
1220    _BMI4    4
1224    _BMI4CAT    1
1225    _RFBMI4    1
1226    _CHLDCNT    1
1227    _EDUCAG    1
1228    _INCOMG    1
1229    DROCDY2_    3
1232    _RFBING4    1
1233    _DRNKDY3    4
1237    _DRNKMO3    4
1241    _RFDRHV3    1
1242    _RFDRMN3    1
1243    _RFDRWM3    1
1244    _FLSHOT3    1
1245    _PNEUMO2    1
1246    _RFSEAT2    1
1247    _RFSEAT3    1
1248    _RFMAM2Y    1
1249    _MAM502Y    1
1250    _RFPAP32    1
1251    _RFPSA2Y    1
1252    _RFBLDST    1
1253    _RFSIGM2    1
1254    _AIDTST2    1"
columns <- read.table(textConnection(columns), header=TRUE, 
stringsAsFactors = FALSE)

library(LaF)

laf <- laf_open_fwf(filename = "CDBRFS06.ASC", column_names = 
columns$VariableName,
   column_widths = columns$FieldLength, column_types =
rep("character",
nrow(columns)))

# You now have a connection to the file; you can index this connection 
as you would a data.frame
# read all data
data <- laf[,]
# read the first 5 columns
data <- laf[, 1:5]
# read a random sample of rows
data <- laf[sample(nrow(laf), 10), ]


HTH,

Jan


On 23-02-16 20:13, Torvon wrote:> Hi,
>
> I want to load a dataset into R. This dataset is available in two formats:
> .XPT and .ASC. The dataset is available at
> http://www.cdc.gov/brfss/annual_data/annual_2006.htm.
>
> They are about 40mb zipped, and about 500mb unzipped.
>
> I can get the .xpt data to load, using:
>
>> library(hmisc)
>> data <- sasxport.get("CDBRFS06.XPT")
> The data look fine, no error messages. However, the data only contains 302
> columns, which is less than it should have (according to the
> documentation). It does not contain my variables of interest, so either the
> documentation or the data file is wrong, and I want to make sure it's
not
> the data file.
>
> Hence I wanted to see if I get the same results loading the .ASC file.
> However, multiple ways to do so have failed.
>
>> library(adehabitat)
>> import.asc("CDBRFS06.asc")
> Results in:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : scan() expected 'a real', got
'1191.8808943.38209868648.960119'
>
>> library(SDMTools)
>> read.asc("CDBRFS06.asc")
> Results in:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : scan() expected 'a real', got
'1191.8808943.38209868648.960119' In
> addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote,
> skip, nlines, na.strings, : number of items read is not a multiple of the
> number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip,
> nlines, na.strings, : number of items read is not a multiple of the number
> of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, : number of items read is not a multiple of the number of
> columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, : number of items read is not a multiple of the number of
> columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs
> introduced by coercion to integer range
>
> Thank you for your help.
>     Eiko
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Federman, Douglas

2016-Feb-23 21:39 UTC

head link

[R] Loading large .pxt and .asc datasets causes issues.

You might want to look at Anthony Damico's work at

http://www.asdfree.com/search/label/behavioral%20risk%20factor%20surveillance%20system%20%28brfss%29

--
Better name for the general practitioner might be multispecialist. 
~Martin H. Fischer (1879-1962)


-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Torvon
Sent: Tuesday, February 23, 2016 2:13 PM
To: r-help at r-project.org
Subject: [R] Loading large .pxt and .asc datasets causes issues.

Hi,

I want to load a dataset into R. This dataset is available in two formats:
.XPT and .ASC. The dataset is available at
http://www.cdc.gov/brfss/annual_data/annual_2006.htm.

They are about 40mb zipped, and about 500mb unzipped.

I can get the .xpt data to load, using:
> library(hmisc)
> data <- sasxport.get("CDBRFS06.XPT")
The data look fine, no error messages. However, the data only contains 302
columns, which is less than it should have (according to the documentation). It
does not contain my variables of interest, so either the documentation or the
data file is wrong, and I want to make sure it's not the data file.

Hence I wanted to see if I get the same results loading the .ASC file.
However, multiple ways to do so have failed.
> library(adehabitat)
> import.asc("CDBRFS06.asc")
Results in:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: scan() expected 'a real', got
'1191.8808943.38209868648.960119'
> library(SDMTools)
> read.asc("CDBRFS06.asc")
Results in:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: scan() expected 'a real', got
'1191.8808943.38209868648.960119' In
addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, skip,
nlines, na.strings, : number of items read is not a multiple of the number of
columns 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
: number of items read is not a multiple of the number of columns 3: In
scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of
items read is not a multiple of the number of columns 4: In scan(file, what,
nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a
multiple of the number of columns 5: In scan(file, nmax = nl * nc, skip = 6,
quiet = TRUE) : NAs introduced by coercion to integer range

Thank you for your help.
   Eiko

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Anthony Damico

2016-Feb-24 03:02 UTC

head link

[R] Loading large .pxt and .asc datasets causes issues.

hi eiko, LaF is incompatible with survey data, that road is a dead-end.
this code below will painlessly load brfss into R, review the link douglas
sent for analysis examples and change `years.to.download <- ` to 2006 only
if you just want a single year of microdata.  glhf


# install.packages( c("MonetDB.R", "MonetDBLite" ,
"survey" , "SAScii" ,
"descr" , "downloader" , "digest" ) ,
repos=c("
http://dev.monetdb.org/Assets/R/", "http://cran.rstudio.com/"))

# setInternet2( FALSE )                        # # only windows users need
this line
# options( encoding = "windows-1252" )        # # only macintosh and
*nix
users need this line
library(downloader)
# setwd( "C:/My Directory/BRFSS/" )
years.to.download <- 1984:2014
source_url( "
https://raw.githubusercontent.com/ajdamico/asdfree/master/Behavioral%20Risk%20Factor%20Surveillance%20System/download%20all%20microdata.R"
, prompt = FALSE , echo = TRUE )





On Tue, Feb 23, 2016 at 4:39 PM, Federman, Douglas <
Douglas.Federman at utoledo.edu> wrote:
> You might want to look at Anthony Damico's work at
>
>
>
http://www.asdfree.com/search/label/behavioral%20risk%20factor%20surveillance%20system%20%28brfss%29
>
> --
> Better name for the general practitioner might be multispecialist.
> ~Martin H. Fischer (1879-1962)
>
>
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Torvon
> Sent: Tuesday, February 23, 2016 2:13 PM
> To: r-help at r-project.org
> Subject: [R] Loading large .pxt and .asc datasets causes issues.
>
> Hi,
>
> I want to load a dataset into R. This dataset is available in two formats:
> .XPT and .ASC. The dataset is available at
> http://www.cdc.gov/brfss/annual_data/annual_2006.htm.
>
> They are about 40mb zipped, and about 500mb unzipped.
>
> I can get the .xpt data to load, using:
>
> > library(hmisc)
> > data <- sasxport.get("CDBRFS06.XPT")
>
> The data look fine, no error messages. However, the data only contains 302
> columns, which is less than it should have (according to the
> documentation). It does not contain my variables of interest, so either the
> documentation or the data file is wrong, and I want to make sure it's
not
> the data file.
>
> Hence I wanted to see if I get the same results loading the .ASC file.
> However, multiple ways to do so have failed.
>
> > library(adehabitat)
> > import.asc("CDBRFS06.asc")
>
> Results in:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : scan() expected 'a real', got
'1191.8808943.38209868648.960119'
>
> > library(SDMTools)
> > read.asc("CDBRFS06.asc")
>
> Results in:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,
> : scan() expected 'a real', got
'1191.8808943.38209868648.960119' In
> addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote,
> skip, nlines, na.strings, : number of items read is not a multiple of the
> number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip,
> nlines, na.strings, : number of items read is not a multiple of the number
> of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, : number of items read is not a multiple of the number of
> columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, : number of items read is not a multiple of the number of
> columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs
> introduced by coercion to integer range
>
> Thank you for your help.
>    Eiko
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - Feb 2016 - Loading large .pxt and .asc datasets causes issues.

[R] Loading large .pxt and .asc datasets causes issues.

[R] Loading large .pxt and .asc datasets causes issues.

[R] Loading large .pxt and .asc datasets causes issues.

[R] Loading large .pxt and .asc datasets causes issues.