RHelpPlease
2012-Mar-10 00:04 UTC
[R] Subsetting a data.frame -> Read in with FWF format from .DAT file
Hi there, I am having trouble subsetting a data frame by a conditional via one column (of many). I read the file into R through "read.fwf," where I specified column widths. Original data is .DAT. I then utilized "names" function to read in column headings. For one column, PRVDR_NUM, I wish to further amend the entire data set, but only have PRVDR_NUM == 050108. This is where I'm having trouble. I've tried code like this: newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108) #OR newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ] #OR providernum <- data.frame(newdim(PRVDR_NUM = c(050108)) newinpatient <- merge(providernum, oldinpatient) With checking "class" at one point, I gathered that R interprets PRVDR_NUM as a factor, not a number .. so I've understood a potential reason why I would have errors (with code above). So, I then tried something like this: newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM) [oldinpatient$PRVDR_NUM])) numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM) bestprvdr <- numericprvdr[,-2] I thought that with converting PRVDR_NUM to numeric, then one of the three options above would be satisfied. But, that has not worked either. (I did confirm that the factor -> numeric worked, which it did) Though R reads the three options (above) with no errors, upon performing a "dim" check I receive the output: 0 93. The columns are correct, but rows (obviously) are not. (I did confirm that the desired value exists multiple times in the noted column, so 0 is definitely incorrect) As well, I would like to work with PRVDR_NUM as a variable alone, but I've found that with any of these variables/column names, I have to use "allinpatient$PRVDR_NUM." R does not recognize PRVDR_NUM alone. Why? More and more I think my problem is more foundational, meaning using the read.fwf function in the first place? Not using the read.fwf function correctly? Again, I've made enough progress with other variables & data sets of this type I've been fine so far, but now & future I need to repeat this code enough times where help in better understanding my errors & a more elegant/efficient solution would be greatly appreciated. Also note that R does not read all 93 columns as factors. Why would R interpret this six-wide column as a factor, but the nine-wide column next door as numeric? Your help is most appreciated! -- View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-Read-in-with-FWF-format-from-DAT-file-tp4461051p4461051.html Sent from the R help mailing list archive at Nabble.com.
R. Michael Weylandt
2012-Mar-10 05:09 UTC
[R] Subsetting a data.frame -> Read in with FWF format from .DAT file
Inline. On Fri, Mar 9, 2012 at 7:04 PM, RHelpPlease <rrumple at trghcsolutions.com> wrote:> Hi there, > I am having trouble subsetting a data frame by a conditional via one column > (of many). > > I read the file into R through "read.fwf," where I specified column widths. > Original data is .DAT. ?I then utilized "names" function to read in column > headings.The easiest way for us to do diagnostics is if we can see your data: the easiest way for us to see your data is for you to use dput(head(oldinpatient, 30)) so we can get a plain text (email friendly) version of it.> > For one column, PRVDR_NUM, I wish to further amend the entire data set, but > only have PRVDR_NUM == 050108. ?This is where I'm having trouble. > > I've tried code like this: > > newinpatient <- subset(oldinpatient, oldinpatient$PRVDR_NUM == 050108) > #OR > newinpatient <- oldinpatient[oldinpatient$PRVDR_NUM == 050108, ] > #ORThe two above this have a chance of working (and, once we figure out what's going on, are good R idioms that should stay in your vocabulary (though strictly speaking, the second "oldinpatient" in the first is unnecessary due to some evaluation tricks); the two below are no good so don't try that anymore.> providernum <- data.frame(newdim(PRVDR_NUM = c(050108)) > newinpatient <- merge(providernum, oldinpatient) > > With checking "class" at one point, I gathered that R interprets PRVDR_NUM > as a factor, not a number .. so I've understood a potential reason why I > would have errors (with code above). ?So, I then tried something like this:Yes, it's a terrible legacy.... most I/O functions let you set the option stringsAsFactors = FALSE to avoid this....> > newPRVDR_NUM <- format(as.numeric(levels(oldinpatient$PRVDR_NUM) > [oldinpatient$PRVDR_NUM]))This is almost right, though I think format sends things back to character (and undoes as.numeric) -- I find this idiom a little clearer (though, admittedly, still strange): as.numeric(as.character(oldinpatient$PRVDR_NUM))> numericprvdr <- data.frame(oldinpatient, newPRVDR_NUM) > bestprvdr <- numericprvdr[,-2] > > I thought that with converting PRVDR_NUM to numeric, then one of the three > options above would be satisfied. ?But, that has not worked either. ?(I did > confirm that the factor -> numeric worked, which it did) >If it did work, these lines wouldn't: I think your earlier attempts would have worked after conversion to numeric, but the format() gets you back in trouble.> Though R reads the three options (above) with no errors, upon performing a > "dim" check I receive the output: 0 93. ?The columns are correct, but rows > (obviously) are not. ?(I did confirm that the desired value exists multiple > times in the noted column, so 0 is definitely incorrect) > > As well, I would like to work with PRVDR_NUM as a variable alone, but I've > found that with any of these variables/column names, I have to use > "allinpatient$PRVDR_NUM." ?R does not recognize PRVDR_NUM alone. ?Why?Different question: the short answer is that, unlike SAS/SPSS, R can take multiple data sets on at the same time, so you have to direct it to which one you want. If you want to save keystrokes in a line where you refer to a data set multiple times, you can use with(), e.g., DATS <- data.frame(x = 1:5, y = 1:5, z = 11:15) DATS$x + DATS$y + DATS$z with(DATS, x + y + z) # same> > More and more I think my problem is more foundational, meaning using the > read.fwf function in the first place? ?Not using the read.fwf function > correctly? ?Again, I've made enough progress with other variables & data > sets of this type I've been fine so far, but now & future I need to repeat > this code enough times where help in better understanding my errors & a more > elegant/efficient solution would be greatly appreciated.I think you're fine with the read.fwf() function -- though if .DAT is a common file format, someone else might have done the heavy lifting for you already. The definitive place to read all this is the R I/O manual --- http://cran.r-project.org/doc/manuals/R-data.html -- but it's not the easiest read.> > Also note that R does not read all 93 columns as factors. ?Why would R > interpret this six-wide column as a factor, but the nine-wide column next > door as numeric?It has to do with what appear to be strings and what appear to be numbers (and that line is not where you may think) -- anything that is not totally unambiguously numeric becomes a string and, by default, strings become factors -- hence, many factors. Michael PS -- Thanks for showing what you've tried.> > Your help is most appreciated! > > -- > View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-Read-in-with-FWF-format-from-DAT-file-tp4461051p4461051.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.