Jeff Johnson
2014-Jan-22 23:58 UTC
[R] subset and na.rm not really suppressing <NA> values
I have a dataset "mydf" with a field EMAIL_ADDRESS. When importing, I
specified:
mydf <- read.csv(file = extract, header = TRUE, stringsAsFactors = FALSE,
na.strings=c("NA",""))
I've also tried setting na.strings=
c("NA","","<NA>") but I don't know if
it's appropriate to put <NA> there.
I'm running
a <- subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select EMAIL_ADDRESS)
dput(head(a,5))
structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_)), .Names =
"EMAIL_ADDRESS",
row.names = c(17L,
22L, 23L, 24L, 30L), class = "data.frame")
The results show a lot of <NA> values on screen and in the dput statement.
I don't quite understand why it is doing that. I would have expected it to
exclude those since I had the na.rm = TRUE statement. Do you have any
suggestions?
Thanks!
--
Jeff
[[alternative HTML version deleted]]
Jeff Newmiller
2014-Jan-23 02:59 UTC
[R] subset and na.rm not really suppressing <NA> values
I don't think na.rm is a valid at parameter for the subset function. I would
normally use the is.na function to logically test for NA values. I also
don't know where your VALID_EMAIL variable is coming from.
a <- subset(mydf, !is.na(EMAIL_ADDRESS))
The na.strings argument to read.csv and friends is used to help recognise
strings in the input that should be treated as NA. If you don't see
"<NA>" in your input file then it will have no effect on the
data import.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
Jeff Johnson <mrjefftoyou at gmail.com> wrote:>I have a dataset "mydf" with a field EMAIL_ADDRESS. When
importing, I
>specified:
>mydf <- read.csv(file = extract, header = TRUE, stringsAsFactors
>FALSE,
>na.strings=c("NA",""))
>
>I've also tried setting na.strings=
c("NA","","<NA>") but I don't know
>if
>it's appropriate to put <NA> there.
>
>I'm running
>a <- subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select
>EMAIL_ADDRESS)
>dput(head(a,5))
>
>structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_,
>NA_character_, NA_character_, NA_character_)), .Names
>"EMAIL_ADDRESS",
>row.names = c(17L,
>22L, 23L, 24L, 30L), class = "data.frame")
>
>The results show a lot of <NA> values on screen and in the dput
>statement.
>
>I don't quite understand why it is doing that. I would have expected it
>to
>exclude those since I had the na.rm = TRUE statement. Do you have any
>suggestions?
>
>Thanks!
peter dalgaard
2014-Jan-24 10:16 UTC
[R] subset and na.rm not really suppressing <NA> values
subset.data.frame() does not have an na.rm argument! -pd On 23 Jan 2014, at 00:58 , Jeff Johnson <mrjefftoyou at gmail.com> wrote:> I have a dataset "mydf" with a field EMAIL_ADDRESS. When importing, I > specified: > mydf <- read.csv(file = extract, header = TRUE, stringsAsFactors = FALSE, > na.strings=c("NA","")) > > I've also tried setting na.strings= c("NA","","<NA>") but I don't know if > it's appropriate to put <NA> there. > > I'm running > a <- subset(mydf, VALID_EMAIL == FALSE, na.rm = TRUE, select > EMAIL_ADDRESS) > dput(head(a,5)) > > structure(list(EMAIL_ADDRESS = c(NA_character_, NA_character_, > NA_character_, NA_character_, NA_character_)), .Names = "EMAIL_ADDRESS", > row.names = c(17L, > 22L, 23L, 24L, 30L), class = "data.frame") > > The results show a lot of <NA> values on screen and in the dput statement. > > I don't quite understand why it is doing that. I would have expected it to > exclude those since I had the na.rm = TRUE statement. Do you have any > suggestions? > > Thanks! > -- > Jeff > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Peter Dalgaard, Professor Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com