Hi, I want to load a dataset into R. This dataset is available in two formats: .XPT and .ASC. The dataset is available at http://www.cdc.gov/brfss/annual_data/annual_2006.htm. They are about 40mb zipped, and about 500mb unzipped. I can get the .xpt data to load, using:> library(hmisc) > data <- sasxport.get("CDBRFS06.XPT")The data look fine, no error messages. However, the data only contains 302 columns, which is less than it should have (according to the documentation). It does not contain my variables of interest, so either the documentation or the data file is wrong, and I want to make sure it's not the data file. Hence I wanted to see if I get the same results loading the .ASC file. However, multiple ways to do so have failed.> library(adehabitat) > import.asc("CDBRFS06.asc")Results in: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1191.8808943.38209868648.960119'> library(SDMTools) > read.asc("CDBRFS06.asc")Results in: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1191.8808943.38209868648.960119' In addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs introduced by coercion to integer range Thank you for your help. Eiko [[alternative HTML version deleted]]
Jan van der Laan
2016-Feb-23 21:07 UTC
[R] Loading large .pxt and .asc datasets causes issues.
First, the file does contain 302 columns; the variable layout (http://www.cdc.gov/brfss/annual_data/2006/varlayout_table_06.htm) contains 302 columns. So, reading the SASS file probably works correctly. Second, the read.asc function you use is for reading geographic raster files, not fixed width files. Below, I show how you could read the file using the LaF package (sorry for the long dump of variable files; copy-pasted them from the page linked to above): columns <- "StartingColumn VariableName FieldLength 1 _STATE 2 3 _GEOSTR 2 5 _DENSTR2 1 6 PRECALL 1 7 REPNUM 5 12 REPDEPTH 2 14 FMONTH 2 16 IDATE 8 16 IMONTH 2 18 IDAY 2 20 IYEAR 4 24 INTVID 3 27 DISPCODE 3 30 SEQNO 10 30 _PSU 10 40 NATTMPTS 2 42 NRECSEL 6 48 NRECSTR 9 57 CTELENUM 1 58 CELLFON1 1 59 PVTRESID 1 60 NUMADULT 2 62 NUMMEN 2 64 NUMWOMEN 2 73 GENHLTH 1 74 PHYSHLTH 2 76 MENTHLTH 2 78 POORHLTH 2 80 HLTHPLAN 1 81 PERSDOC2 1 82 MEDCOST 1 83 CHECKUP 1 84 EXERANY2 1 85 DIABETE2 1 86 LASTDEN3 1 87 RMVTETH3 1 88 DENCLEAN 1 89 CVDINFR3 1 90 CVDCRHD3 1 91 CVDSTRK3 1 92 ASTHMA2 1 93 ASTHNOW 1 94 QLACTLM2 1 95 USEEQUIP 1 96 SMOKE100 1 97 SMOKDAY2 1 98 STOPSMK2 1 99 AGE 2 101 HISPANC2 1 102 MRACE 6 108 ORACE2 1 109 MARITAL 1 110 CHILDREN 2 112 EDUCA 1 113 EMPLOY 1 114 INCOME2 2 116 WEIGHT2 4 120 HEIGHT3 4 124 CTYCODE 3 132 NUMHHOL2 1 133 NUMPHON2 1 134 TELSERV2 1 135 SEX 1 136 PREGNANT 1 137 VETERAN 1 138 DRNKANY4 1 139 ALCDAY4 3 142 AVEDRNK2 2 144 DRNK3GE5 2 146 MAXDRNKS 2 148 FLUSHOT3 1 149 FLUSPRY2 1 162 PNEUVAC3 1 163 HEPBVAC 1 164 HEPBRSN 1 165 FALL3MN2 2 167 FALLINJ2 2 169 SEATBELT 1 170 DRINKDRI 2 172 HADMAM 1 173 HOWLONG 1 174 PROFEXAM 1 175 LENGEXAM 1 176 HADPAP2 1 177 LASTPAP2 1 178 HADHYST2 1 179 PSATEST 1 180 PSATIME 1 181 DIGRECEX 1 182 DRETIME 1 183 PROSTATE 1 184 BLDSTOOL 1 185 LSTBLDS2 1 186 HADSIGM3 1 187 LASTSIG2 1 188 HIVTST5 1 189 HIVTSTD2 6 195 WHRTST7 2 197 HIVRDTST 1 198 EMTSUPRT 1 199 LSATISFY 1 200 RCSBIRTH 6 206 RCSGENDR 1 207 RCHISLAT 1 208 RCSRACE 6 214 RCSBRACE 1 215 RCSRELN1 1 216 DRHPCH 1 217 HAVHPCH 1 218 CIFLUSH2 1 219 RCVFVCH2 6 225 RNOFVCH2 2 227 CASTHDX2 1 228 CASTHNO2 1 229 DIABAGE2 2 231 INSULIN 1 232 DIABPILL 1 233 BLDSUGAR 3 236 FEETCHK2 3 239 FEETSORE 1 240 DOCTDIAB 2 242 CHKHEMO3 2 244 FEETCHK 2 246 EYEEXAM 1 247 DIABEYE 1 248 DIABEDU 1 249 VIDFCLT2 1 250 VIREDIF2 1 251 VIPRFVS2 1 252 VINOCRE2 2 254 VIEYEXM2 1 255 VIINSUR2 1 256 VICTRCT2 1 257 VIGLUMA2 1 258 VIMACDG2 1 259 VIATWRK2 1 260 PAINACT2 2 262 QLMENTL2 2 264 QLSTRES2 2 266 QLREST2 2 268 QLHLTH2 2 270 ASTHMAGE 2 272 ASATTACK 1 273 ASERVIST 2 275 ASDRVIST 2 277 ASRCHKUP 2 279 ASACTLIM 3 282 ASYMPTOM 1 283 ASNOSLEP 1 284 ASTHMED2 1 285 ASINHALR 1 286 BRTHCNT3 1 287 TYPCNTR4 2 289 NOBCUSE2 2 291 FPCHLDFT 1 292 FPCHLDHS 1 293 VITAMINS 1 294 MULTIVIT 1 295 FOLICACD 1 296 TAKEVIT 3 299 RECOMMEN 1 300 HOUSESMK 1 301 INDOORS 1 302 SMKPUBLC 1 303 SMKWORK 1 304 IAQHTSRC 1 305 IAQGASAP 1 306 IAQHTDYS 3 309 IAQCODTR 1 310 IAQMOLD 1 311 HEWTRSRC 1 312 HEWTRDRK 1 313 HECHMHOM 3 316 HECHMYRD 3 319 RRCLASS2 1 320 RRCOGNT2 1 321 RRATWORK 1 322 RRHCARE2 1 323 RRPHYSM1 1 324 RREMTSM1 1 325 ADPLEASR 2 327 ADDOWN 2 329 ADSLEEP 2 331 ADENERGY 2 333 ADEAT 2 335 ADFAIL 2 337 ADTHINK 2 339 ADMOVE 2 341 ADANXEV 1 342 ADDEPEV 1 343 SVSAFE 1 344 SVSEXTCH 1 345 SVNOTCH 1 346 SVEHDSE1 1 347 SVHDSX12 1 348 SVEANOS1 1 349 SVNOSX12 1 350 SVRELAT2 2 352 SVGENDER 1 353 IPVSAFE 1 354 IPVTHRAT 1 355 IPVPHYV1 1 356 IPVPHHRT 1 357 IPVUWSEX 1 358 IPVPVL12 1 359 IPVSXINJ 1 360 IPVRELT1 2 362 GPWELPRD 1 363 GPVACPLN 1 364 GP3DYWTR 1 365 GP3DYFOD 1 366 GP3DYPRS 1 367 GPBATRAD 1 368 GPFLSLIT 1 369 GPMNDEVC 1 370 GPNOTEVC 2 372 GPEMRCOM 1 373 GPEMRINF 1 741 QSTVER 1 742 QSTLANG 2 800 _STSTR 5 805 _STRWT 10 815 _RAW 10 825 _WT2 10 835 _POSTSTR 10 845 _FINALWT 10 935 _REGION 2 937 _AGEG_ 2 939 _SEXG_ 1 940 _RACEG3_ 1 941 _RACEG4_ 1 942 _IMPAGE 2 944 _IMPNPH 1 945 _ITSCF1 10 955 _ITSCF2 10 965 _ITSPOST 10 975 _ITSFINL 10 993 MSCODE 1 994 CRACEORG 6 1000 CRACEASC 6 1006 _CRACE 2 1008 _CSEXG_ 1 1009 _CRACEG_ 1 1010 _CAGEG_ 3 1033 _RAWCH 10 1063 _WT2CH 10 1093 _POSTCH 10 1123 _CHILDWT 10 1133 _RAWHH 10 1143 _WT2HH 10 1153 _POSTHH 10 1163 _HOUSEWT 10 1173 _RFHLTH 1 1174 _TOTINDA 1 1175 _EXTETH2 1 1176 _ALTETH2 1 1177 _DENVST1 1 1178 _LTASTHM 1 1179 _CASTHMA 1 1180 _ASTHMST 1 1181 _SMOKER3 1 1182 _RFSMOK3 1 1183 MRACEORG 6 1189 MRACEASC 6 1195 _PRACE 2 1197 _MRACE 2 1199 _RACEG2 1 1200 _RACEGR2 1 1201 _RACE_G 1 1202 _CNRACE 1 1203 _CNRACEC 1 1204 RACE2 1 1205 _AGEG5YR 2 1207 _AGE65YR 1 1208 _AGE_G 1 1209 HTIN3 3 1212 HTM3 3 1215 WTKG2 5 1220 _BMI4 4 1224 _BMI4CAT 1 1225 _RFBMI4 1 1226 _CHLDCNT 1 1227 _EDUCAG 1 1228 _INCOMG 1 1229 DROCDY2_ 3 1232 _RFBING4 1 1233 _DRNKDY3 4 1237 _DRNKMO3 4 1241 _RFDRHV3 1 1242 _RFDRMN3 1 1243 _RFDRWM3 1 1244 _FLSHOT3 1 1245 _PNEUMO2 1 1246 _RFSEAT2 1 1247 _RFSEAT3 1 1248 _RFMAM2Y 1 1249 _MAM502Y 1 1250 _RFPAP32 1 1251 _RFPSA2Y 1 1252 _RFBLDST 1 1253 _RFSIGM2 1 1254 _AIDTST2 1" columns <- read.table(textConnection(columns), header=TRUE, stringsAsFactors = FALSE) library(LaF) laf <- laf_open_fwf(filename = "CDBRFS06.ASC", column_names = columns$VariableName, column_widths = columns$FieldLength, column_types = rep("character", nrow(columns))) # You now have a connection to the file; you can index this connection as you would a data.frame # read all data data <- laf[,] # read the first 5 columns data <- laf[, 1:5] # read a random sample of rows data <- laf[sample(nrow(laf), 10), ] HTH, Jan On 23-02-16 20:13, Torvon wrote:> Hi, > > I want to load a dataset into R. This dataset is available in two formats: > .XPT and .ASC. The dataset is available at > http://www.cdc.gov/brfss/annual_data/annual_2006.htm. > > They are about 40mb zipped, and about 500mb unzipped. > > I can get the .xpt data to load, using: > >> library(hmisc) >> data <- sasxport.get("CDBRFS06.XPT") > The data look fine, no error messages. However, the data only contains 302 > columns, which is less than it should have (according to the > documentation). It does not contain my variables of interest, so either the > documentation or the data file is wrong, and I want to make sure it's not > the data file. > > Hence I wanted to see if I get the same results loading the .ASC file. > However, multiple ways to do so have failed. > >> library(adehabitat) >> import.asc("CDBRFS06.asc") > Results in: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : scan() expected 'a real', got '1191.8808943.38209868648.960119' > >> library(SDMTools) >> read.asc("CDBRFS06.asc") > Results in: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : scan() expected 'a real', got '1191.8808943.38209868648.960119' In > addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, > skip, nlines, na.strings, : number of items read is not a multiple of the > number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip, > nlines, na.strings, : number of items read is not a multiple of the number > of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : number of items read is not a multiple of the number of > columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : number of items read is not a multiple of the number of > columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs > introduced by coercion to integer range > > Thank you for your help. > Eiko > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Federman, Douglas
2016-Feb-23 21:39 UTC
[R] Loading large .pxt and .asc datasets causes issues.
You might want to look at Anthony Damico's work at http://www.asdfree.com/search/label/behavioral%20risk%20factor%20surveillance%20system%20%28brfss%29 -- Better name for the general practitioner might be multispecialist. ~Martin H. Fischer (1879-1962) -----Original Message----- From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Torvon Sent: Tuesday, February 23, 2016 2:13 PM To: r-help at r-project.org Subject: [R] Loading large .pxt and .asc datasets causes issues. Hi, I want to load a dataset into R. This dataset is available in two formats: .XPT and .ASC. The dataset is available at http://www.cdc.gov/brfss/annual_data/annual_2006.htm. They are about 40mb zipped, and about 500mb unzipped. I can get the .xpt data to load, using:> library(hmisc) > data <- sasxport.get("CDBRFS06.XPT")The data look fine, no error messages. However, the data only contains 302 columns, which is less than it should have (according to the documentation). It does not contain my variables of interest, so either the documentation or the data file is wrong, and I want to make sure it's not the data file. Hence I wanted to see if I get the same results loading the .ASC file. However, multiple ways to do so have failed.> library(adehabitat) > import.asc("CDBRFS06.asc")Results in: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1191.8808943.38209868648.960119'> library(SDMTools) > read.asc("CDBRFS06.asc")Results in: Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '1191.8808943.38209868648.960119' In addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : number of items read is not a multiple of the number of columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs introduced by coercion to integer range Thank you for your help. Eiko [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Anthony Damico
2016-Feb-24 03:02 UTC
[R] Loading large .pxt and .asc datasets causes issues.
hi eiko, LaF is incompatible with survey data, that road is a dead-end. this code below will painlessly load brfss into R, review the link douglas sent for analysis examples and change `years.to.download <- ` to 2006 only if you just want a single year of microdata. glhf # install.packages( c("MonetDB.R", "MonetDBLite" , "survey" , "SAScii" , "descr" , "downloader" , "digest" ) , repos=c(" http://dev.monetdb.org/Assets/R/", "http://cran.rstudio.com/")) # setInternet2( FALSE ) # # only windows users need this line # options( encoding = "windows-1252" ) # # only macintosh and *nix users need this line library(downloader) # setwd( "C:/My Directory/BRFSS/" ) years.to.download <- 1984:2014 source_url( " https://raw.githubusercontent.com/ajdamico/asdfree/master/Behavioral%20Risk%20Factor%20Surveillance%20System/download%20all%20microdata.R" , prompt = FALSE , echo = TRUE ) On Tue, Feb 23, 2016 at 4:39 PM, Federman, Douglas < Douglas.Federman at utoledo.edu> wrote:> You might want to look at Anthony Damico's work at > > > http://www.asdfree.com/search/label/behavioral%20risk%20factor%20surveillance%20system%20%28brfss%29 > > -- > Better name for the general practitioner might be multispecialist. > ~Martin H. Fischer (1879-1962) > > > -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Torvon > Sent: Tuesday, February 23, 2016 2:13 PM > To: r-help at r-project.org > Subject: [R] Loading large .pxt and .asc datasets causes issues. > > Hi, > > I want to load a dataset into R. This dataset is available in two formats: > .XPT and .ASC. The dataset is available at > http://www.cdc.gov/brfss/annual_data/annual_2006.htm. > > They are about 40mb zipped, and about 500mb unzipped. > > I can get the .xpt data to load, using: > > > library(hmisc) > > data <- sasxport.get("CDBRFS06.XPT") > > The data look fine, no error messages. However, the data only contains 302 > columns, which is less than it should have (according to the > documentation). It does not contain my variables of interest, so either the > documentation or the data file is wrong, and I want to make sure it's not > the data file. > > Hence I wanted to see if I get the same results loading the .ASC file. > However, multiple ways to do so have failed. > > > library(adehabitat) > > import.asc("CDBRFS06.asc") > > Results in: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : scan() expected 'a real', got '1191.8808943.38209868648.960119' > > > library(SDMTools) > > read.asc("CDBRFS06.asc") > > Results in: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, > : scan() expected 'a real', got '1191.8808943.38209868648.960119' In > addition: Warning messages: 1: In scan(file, what, nmax, sep, dec, quote, > skip, nlines, na.strings, : number of items read is not a multiple of the > number of columns 2: In scan(file, what, nmax, sep, dec, quote, skip, > nlines, na.strings, : number of items read is not a multiple of the number > of columns 3: In scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : number of items read is not a multiple of the number of > columns 4: In scan(file, what, nmax, sep, dec, quote, skip, nlines, > na.strings, : number of items read is not a multiple of the number of > columns 5: In scan(file, nmax = nl * nc, skip = 6, quiet = TRUE) : NAs > introduced by coercion to integer range > > Thank you for your help. > Eiko > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]