Paul Miller
2012-Jan-24 16:54 UTC
[R] Checking for invalid dates: Code works but needs improvement
Hello Everyone,
Still new to R. Wrote some code that finds and prints invalid dates (see below).
This code works but I suspect it's not very good. If someone could show me a
better way, I'd greatly appreciate it.
Here is some information about what I'm trying to accomplish. My sense is
that the R date functions are best at identifying invalid dates when fed
character data in their default format. So my code converts the input dates to
character, breaks them apart using strsplit, and then reformats them. It then
identifies which dates are "missing" in the sense that the month or
year are unknown and prints out any remaining invalid date values.
As I see it, the code has at least 4 shortcomings.
1. It's too long. My understanding is that skilled programmers can usually
or often complete tasks like this in a few lines.
2. It's not vectorized. I started out trying to do something that was
vectorized but ran into problems with the strsplit function. I looked at the
help file and it appears this function will only accept a single character
vector.
3. It prints out the incorrect dates but doesn't indicate which date
variable they belong to. I tried various things with paste but never came up
with anything that worked. Ideally, I'd like to get something that looks
roughly like:
Error: Invalid date values in birthDT
"21931-11-23"
"1933-06-31"
Error: Invalid date values in diagnosisDT
"2010-02-30"
4. There's no way to specify names for input and output data. I imagine this
would be fairly easy to specify this in the arguments to a function but am not
sure how to incorporate it into a for loop.
Thanks,
Paul
##########################################
#### Code for detecting invalid dates ####
##########################################
#### Test Data ####
connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940 02/30/2010 03/17/2011
3 06/17/1935 12/20/2008 07/un/2011
4 05/31/1937 01/18/2007 04/30/2011
5 06/31/1933 05/16/2009 11/20/un
")
TestDates <- data.frame(scan(connection,
list(Patient=0, birthDT="", diagnosisDT="",
metastaticDT="")))
close(connection)
TestDates
class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)
#### List of Date Variables ####
DateNames <- c("birthDT", "diagnosisDT",
"metastaticDT")
#### Read Dates ####
for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day=="un"] <- "15"
TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep =
"-"))
is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
TestDates$Date <- as.Date(TestDates[DateNames][[i]],
format="%Y-%m-%d")
TestDates$Invalid <- ifelse(is.na(TestDates$Date) &
!is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 )
{ TestDates[DateNames][[i]] <- TestDates$Date } else
{ print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date,
Invalid))
}
TestDates
class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)
Rui Barradas
2012-Jan-24 17:54 UTC
[R] Checking for invalid dates: Code works but needs improvement
Hello,
Point 3 is very simple, instead of 'print' use 'cat'.
Unlike 'print' it allows for several arguments and (very) simple
formating.
{ cat("Error: Invalid date values in", DateNames[[i]],
"\n",
TestDates[DateNames][[i]][TestDates$Invalid==1], "\n")
}
Rui Barradas
--
View this message in context:
http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4324533.html
Sent from the R help mailing list archive at Nabble.com.
Paul Miller
2012-Jan-26 15:54 UTC
[R] Checking for invalid dates: Code works but needs improvement
Sorry, sent this earlier but forgot to add an informative subject line. Am
resending, in the hopes of getting further replies. My apologies. Hope this is
OK.
Paul
Hi Rui,
Thanks for your reply to my post. My code still has various shortcomings but at
least now it is fully functional.
It may be that, as I transition to using R, I'll have to live with some less
than ideal code, at least at the outset. I'll just have to write and
re-write my code as I improve.
Appreciate your help.
Paul
Message: 66
Date: Tue, 24 Jan 2012 09:54:57 -0800 (PST)
From: Rui Barradas <ruipbarradas at sapo.pt>
To: r-help at r-project.org
Subject: Re: [R] Checking for invalid dates: Code works but needs
improvement
Message-ID: <1327427697928-4324533.post at n4.nabble.com>
Content-Type: text/plain; charset=us-ascii
Hello,
Point 3 is very simple, instead of 'print' use 'cat'.
Unlike 'print' it allows for several arguments and (very) simple
formating.
{ cat("Error: Invalid date values in", DateNames[[i]],
"\n",
TestDates[DateNames][[i]][TestDates$Invalid==1], "\n")
}
Rui Barradas
Message: 53
Date: Tue, 24 Jan 2012 08:54:49 -0800 (PST)
From: Paul Miller <pjmiller_57 at yahoo.com>
To: r-help at r-project.org
Subject: [R] Checking for invalid dates: Code works but needs
improvement
Message-ID:
<1327424089.1149.YahooMailClassic at web161604.mail.bf1.yahoo.com>
Content-Type: text/plain; charset=us-ascii
Hello Everyone,
Still new to R. Wrote some code that finds and prints invalid dates (see below).
This code works but I suspect it's not very good. If someone could show me a
better way, I'd greatly appreciate it.
Here is some information about what I'm trying to accomplish. My sense is
that the R date functions are best at identifying invalid dates when fed
character data in their default format. So my code converts the input dates to
character, breaks them apart using strsplit, and then reformats them. It then
identifies which dates are "missing" in the sense that the month or
year are unknown and prints out any remaining invalid date values.
As I see it, the code has at least 4 shortcomings.
1. It's too long. My understanding is that skilled programmers can usually
or often complete tasks like this in a few lines.
2. It's not vectorized. I started out trying to do something that was
vectorized but ran into problems with the strsplit function. I looked at the
help file and it appears this function will only accept a single character
vector.
3. It prints out the incorrect dates but doesn't indicate which date
variable they belong to. I tried various things with paste but never came up
with anything that worked. Ideally, I'd like to get something that looks
roughly like:
Error: Invalid date values in birthDT
"21931-11-23"
"1933-06-31"
Error: Invalid date values in diagnosisDT
"2010-02-30"
4. There's no way to specify names for input and output data. I imagine this
would be fairly easy to specify this in the arguments to a function but am not
sure how to incorporate it into a for loop.
Thanks,
Paul
##########################################
#### Code for detecting invalid dates ####
##########################################
#### Test Data ####
connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1940 02/30/2010 03/17/2011
3 06/17/1935 12/20/2008 07/un/2011
4 05/31/1937 01/18/2007 04/30/2011
5 06/31/1933 05/16/2009 11/20/un
")
TestDates <- data.frame(scan(connection,
list(Patient=0, birthDT="", diagnosisDT="",
metastaticDT="")))
close(connection)
TestDates
class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)
#### List of Date Variables ####
DateNames <- c("birthDT", "diagnosisDT",
"metastaticDT")
#### Read Dates ####
for (i in seq(TestDates[DateNames])){
TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]])
TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/")
TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1])
TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2])
TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3])
TestDates$Day[TestDates$Day=="un"] <- "15"
TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep =
"-"))
is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T
is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T
TestDates$Date <- as.Date(TestDates[DateNames][[i]],
format="%Y-%m-%d")
TestDates$Invalid <- ifelse(is.na(TestDates$Date) &
!is.na(TestDates[DateNames][[i]]), 1, 0)
if( sum(TestDates$Invalid)==0 )
{ TestDates[DateNames][[i]] <- TestDates$Date } else
{ print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) }
TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date,
Invalid))
}
TestDates
class(TestDates$birthDT)
class(TestDates$diagnosisDT)
class(TestDates$metastaticDT)
Gabor Grothendieck
2012-Jan-26 18:07 UTC
[R] Checking for invalid dates: Code works but needs improvement
On Tue, Jan 24, 2012 at 11:54 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:> Hello Everyone, > > Still new to R. Wrote some code that finds and prints invalid dates (see below). This code works but I suspect it's not very good. If someone could show me a better way, I'd greatly appreciate it. > > Here is some information about what I'm trying to accomplish. My sense is that the R date functions are best at identifying invalid dates when fed character data in their default format. So my code converts the input dates to character, breaks them apart using strsplit, and then reformats them. It then identifies which dates are "missing" in the sense that the month or year are unknown and prints out any remaining invalid date values. > > As I see it, the code has at least 4 shortcomings. > > 1. It's too long. My understanding is that skilled programmers can usually or often complete tasks like this in a few lines. > > 2. It's not vectorized. I started out trying to do something that was vectorized but ran into problems with the strsplit function. I looked at the help file and it appears this function will only accept a single character vector. > > 3. It prints out the incorrect dates but doesn't indicate which date variable they belong to. I tried various things with paste but never came up with anything that worked. Ideally, I'd like to get something that looks roughly like: > > Error: Invalid date values in birthDT > > "21931-11-23" > "1933-06-31" > > Error: Invalid date values in diagnosisDT > > "2010-02-30" > > 4. There's no way to specify names for input and output data. I imagine this would be fairly easy to specify this in the arguments to a function but am not sure how to incorporate it into a for loop. > > Thanks, > > Paul > > ########################################## > #### Code for detecting invalid dates #### > ########################################## > > #### Test Data #### > > connection <- textConnection(" > 1 11/23/21931 05/23/2009 un/17/2011 > 2 06/20/1940 ?02/30/2010 03/17/2011 > 3 06/17/1935 ?12/20/2008 07/un/2011 > 4 05/31/1937 ?01/18/2007 04/30/2011 > 5 06/31/1933 ?05/16/2009 11/20/un > ") > > TestDates <- data.frame(scan(connection, > ? ? ? ? ? ? ? ? list(Patient=0, birthDT="", diagnosisDT="", metastaticDT=""))) > > close(connection) > > TestDates > > class(TestDates$birthDT) > class(TestDates$diagnosisDT) > class(TestDates$metastaticDT) > > #### List of Date Variables #### > > DateNames <- c("birthDT", "diagnosisDT", "metastaticDT") > > #### Read Dates #### > > for (i in seq(TestDates[DateNames])){ > TestDates[DateNames][[i]] <- as.character(TestDates[DateNames][[i]]) > TestDates$ParsedDT <- strsplit(TestDates[DateNames][[i]],"/") > TestDates$Month <- sapply(TestDates$ParsedDT,function(x)x[1]) > TestDates$Day <- sapply(TestDates$ParsedDT,function(x)x[2]) > TestDates$Year <- sapply(TestDates$ParsedDT,function(x)x[3]) > TestDates$Day[TestDates$Day=="un"] <- "15" > TestDates[DateNames][[i]] <- with(TestDates, paste(Year, Month, Day, sep = "-")) > is.na( TestDates[DateNames][[i]] [TestDates$Month=="un"] ) <- T > is.na( TestDates[DateNames][[i]] [TestDates$Year=="un"] ) <- T > TestDates$Date <- as.Date(TestDates[DateNames][[i]], format="%Y-%m-%d") > TestDates$Invalid <- ifelse(is.na(TestDates$Date) & !is.na(TestDates[DateNames][[i]]), 1, 0) > if( sum(TestDates$Invalid)==0 ) > ? ? ? ?{ TestDates[DateNames][[i]] <- TestDates$Date } else > ? ? ? ?{ print ( TestDates[DateNames][[i]][TestDates$Invalid==1]) } > TestDates <- subset(TestDates, select = -c(ParsedDT, Month, Day, Year, Date, Invalid)) > } > > TestDates > > class(TestDates$birthDT) > class(TestDates$diagnosisDT) > class(TestDates$metastaticDT)If s is a vector of character strings representing dates then bad is a logical vector which is TRUE for the bad ones and FALSE for the good ones (adjust as needed if a different date range is valid) so s[bad] is the bad inputs and the output d is a "Date" vector with NAs for the bad ones: x <- gsub("un", 15, s) d <- as.Date(x, "%m/%d/%Y") bad <- is.na(d) | d < as.Date("1900-01-01") | d > Sys.Date() d[bad] <- NA -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Rui Barradas
2012-Jan-27 04:18 UTC
[R] Checking for invalid dates: Code works but needs improvement
Hello, again. I now have a more complete answer to your points.> 1. It's too long. My understanding is that skilled programmers can usually > or often complete tasks like this in a few lines.It's not very shorter but it's more readable. (The programmer is always suspect)> 2. It's not vectorized. I started out trying to do something that was > vectorized > but ran into problems with the strsplit function. I looked at the help > file and > it appears this function will only accept a single character vector.All but one instructions are vectorized. And the one that is not only loops for a few column names. Use 'unlist' on the 'strsplit' function's output to give a vector.> 4. There's no way to specify names for input and output data. I imagine > this would > be fairly easy to specify this in the arguments to a function but am not > sure how to > incorporate it into a for loop.You can now specify any matrix or data.frame, but it will only process the columns with dates. (This is not true, it will process anything with a '/' on it. Pay attention.) Near the beginning of your code include the following:> TestDates <- data.frame(scan(connection, > list(Patient=0, birthDT="", diagnosisDT="", > metastaticDT=""))) > > close(connection)TDSaved <- TestDates # to avoid reopenning the connection And then, after all of it, fun <- function(Dat){ f <- function(jj, DF){ x <- as.character(DF[, jj]) x <- unlist(strsplit(x, "/")) n <- length(x) M <- x[seq(1, n, 3)] D <- x[seq(2, n, 3)] Y <- x[seq(3, n, 3)] D[D == "un"] <- "15" Y <- ifelse(nchar(Y) > 4 | Y < 1900, NA, Y) x <- as.Date(paste(Y, M, D, sep="-"), format="%Y-%m-%d") if(any(is.na(x))) cat("Warning: Invalid date values in", jj, "\n", as.character(DF[is.na(x), jj]), "\n") x } colinx <- colnames(as.data.frame(Dat)) Dat <- data.frame(sapply(colinx, function(j) f(j, Dat))) for(i in colinx) class(Dat[[i]]) <- "Date" Dat } TD <- TDSaved TD[, DateNames] <- fun(TD[, DateNames]) TD Had fun in writing it. Good luck. Rui Barradas -- View this message in context: http://r.789695.n4.nabble.com/Checking-for-invalid-dates-Code-works-but-needs-improvement-tp4324356p4332529.html Sent from the R help mailing list archive at Nabble.com.
Paul Miller
2012-Jan-30 13:44 UTC
[R] Checking for invalid dates: Code works but needs improvement
Hi Rui, Marc, and Gabor,
Thanks for your replies to my question. All were helpful and it was interesting
to see how different people approach various aspects of the same problem.
Spent some time this weekend looking at Rui's solution, which is certainly
much clearer than my own. Managed to figure out pretty much all the details of
how it works. Also managed to tweak it slightly in order to make it do exactly
what I wanted. (See revised code below.)
Still have a couple of questions though. The first concerns the insertion of the
code "Y > 2012" to set year values beyond 2012 to NA (on line 10 of
the function below). When I add this (or use it in place of "nchar(Y) >
4"), the code succesfully finds the problem date "05/16/2015".
After that though, it produces the following error message:
Error in if (any(is.na(x) & M != "un" & Y != "un"))
cat("Warning: Invalid date values in", : missing value where
TRUE/FALSE needed
Why is this happening? If the code correctly correctly handles the date
"06/20/1840" without producing an error, why can't it do likelwise
with "05/16/2015"?
The second question is why it's necessary to put "x" on line 15
following "cat("Warning ...)". I know that I don't get any
date columns if I don't include this but am not sure why.
The third question is whether it's possible to change the class of the date
variables without using a for loop. I played around with this a little but
didn't find a vectorized alternative. It may be that this is not really
important. It's just that I've read in several places that for loops
should be avoided wherever possible.
Thanks,
Paul
##########################################
#### Code for detecting invalid dates ####
##########################################
#### Test Data ####
connection <- textConnection("
1 11/23/21931 05/23/2009 un/17/2011
2 06/20/1840 02/30/2010 03/17/2011
3 06/17/1935 12/20/2008 07/un/2011
4 05/31/1937 01/18/2007 04/30/2011
5 06/31/1933 05/16/2015 11/20/un
")
TestDates <- data.frame(scan(connection,
list(Patient=0, birthDT="", diagnosisDT="",
metastaticDT="")))
close(connection)
#### Input Data ####
TDSaved <- TestDates
#### List of Date Variables ####
DateNames <- c("birthDT", "diagnosisDT",
"metastaticDT")
#### Date Function ####
fun <- function(Dat){
f <- function(jj, DF){
x <- as.character(DF[, jj])
x <- unlist(strsplit(x, "/"))
n <- length(x)
M <- x[seq(1, n, 3)]
D <- x[seq(2, n, 3)]
Y <- x[seq(3, n, 3)]
D[D == "un"] <- "15"
Y <- ifelse(nchar(Y) > 4 | Y > 2012 | Y < 1900, NA, Y)
x <- as.Date(paste(Y, M, D, sep="-"),
format="%Y-%m-%d")
if(any(is.na(x) & M != "un" & Y != "un"))
cat("Warning: Invalid date values in", jj, "\n",
as.character(DF[is.na(x), jj]), "\n")
x
}
Dat <- data.frame(sapply(names(Dat), function(j) f(j, Dat)))
for(i in names(Dat)) class(Dat[[i]]) <- "Date"
Dat
}
#### Output Data ####
TD <- TDSaved
#### Read Dates ####
TD[, DateNames] <- fun(TD[, DateNames])
TD
Paul Miller
2012-Feb-04 14:06 UTC
[R] Checking for invalid dates: Code works but needs improvement
Hi David and Rui, Sorry to be so slow in replying. Thank you both for pointing out that the problem with my code was that I was using comparison operators on mixed data types. This is something I'll have to be more careful about in the future. In an earlier email, David talked about how R can seem uncooperative or even "unfair" when you're just starting out. I too have had this experience, but it seems less "unfair" each time I use it. This time, I was able to write inelegant but functional code to solve my problem. Last time, I wasn't able to solve a much simpler problem at all. So I guess that's a kind of progress. At this point, I have serviceable code for checking my dates. I can improve this when I begin to develop some real skill as an R programmer, but it will do nicely for now. Thanks everyone for your help with this. Paul