Josip Dasovic
2008-Dec-10 20:41 UTC
[R] Confusion with Converting Factors to Dates using as.date
Dear R-Helpers: I'm having a problem getting dates into the correct format. I have a data frame, which is based on a .csv file that I imported into R via read.table. R has converted my date variables to factors; when I use the as.Date command, most of the values are converted "correctly" (and by this I guess I mean converted "as I wish them to be") but some have not been. Here's what I have: str(pk.df) 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... ... I then convert them to a date class using st_date.new<-as.Date(st_date, "%m/%d/%y") This _seems_ to work... str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 10149 ... But notice the 4th observation; I would like it to be 1963, not 2063. st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" NA NA st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 01/01/52 01/01/59 01/01/63 ... 12/31/96 I thought that the problem might be that I was converting a factor, so I first converted the variable to a character type (although I understand that this is done automatically) and then to date class, but I still had the same problem. Does anybody know how I can solve this and why I am getting this behavior? One more tidbit: the earliest date for which the date conversion is "correct" is 1969-04-15, while the most recent date for which the century is "incorrect" is 1967-11-05. Thanks, Josip Research Associate Human Security Report Project School for International Studies Simon Fraser University Suite 7200--515 W. Hastings St. Vancouver, BC V6B 5K3 Canada
Peter Dalgaard
2008-Dec-10 21:16 UTC
[R] Confusion with Converting Factors to Dates using as.date
Josip Dasovic wrote:> Dear R-Helpers: > > I'm having a problem getting dates into the correct format. I have a > data frame, which is based on a .csv file that I imported into R via > read.table. > > R has converted my date variables to factors; when I use the as.Date > command, most of the values are converted "correctly" (and by this I > guess I mean converted "as I wish them to be") but some have not > been. > > Here's what I have: str(pk.df) > > 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 > 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 > levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... > ... > > I then convert them to a date class using > > st_date.new<-as.Date(st_date, "%m/%d/%y") > > This _seems_ to work... > > str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 > 10149 ... > > But notice the 4th observation; I would like it to be 1963, not 2063. > > > st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA > "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" > NA NA > > st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> > 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 > 01/01/52 01/01/59 01/01/63 ... 12/31/96 > > > I thought that the problem might be that I was converting a factor, > so I first converted the variable to a character type (although I > understand that this is done automatically) and then to date class, > but I still had the same problem. Does anybody know how I can solve > this and why I am getting this behavior? One more tidbit: the > earliest date for which the date conversion is "correct" is > 1969-04-15, while the most recent date for which the century is > "incorrect" is 1967-11-05.Well, to quote ?strptime: '%y' Year without century (00-99). If you use this on input, which century you get is system-specific. So don't! Often values up to 68 (or 69) are prefixed by 20 and 69 (or 70) to 99 by 19. -- O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Marc Schwartz
2008-Dec-10 21:25 UTC
[R] Confusion with Converting Factors to Dates using as.date
on 12/10/2008 02:41 PM Josip Dasovic wrote:> Dear R-Helpers: > > I'm having a problem getting dates into the correct format. I have a > data frame, which is based on a .csv file that I imported into R via > read.table. > > R has converted my date variables to factors; when I use the as.Date > command, most of the values are converted "correctly" (and by this I > guess I mean converted "as I wish them to be") but some have not > been. > > Here's what I have: str(pk.df) > > 'data.frame': 206 obs. of 134 variables: $ uniqid : int 010 > 015 120 130 210 245 320 330 415 ... $ st_date : Factor w/ 154 > levels "01/01/48","01/01/51",..: 46 27 NA 12 118 NA 63 127 NA NA ... > ... > > I then convert them to a date class using > > st_date.new<-as.Date(st_date, "%m/%d/%y") > > This _seems_ to work... > > str(st_date.new) Class 'Date' num [1:206] 8150 8466 NA 33982 > 10149 ... > > But notice the 4th observation; I would like it to be 1963, not 2063. > > > st_date.new[1:10] [1] "1992-04-25" "1993-03-07" NA > "2063-01-15" "1997-10-15" [6] NA "1991-05-31" "1994-11-20" > NA NA > > st_date[1:10] [1] 04/25/92 03/07/93 <NA> 01/15/63 10/15/97 <NA> > 05/31/91 [8] 11/20/94 <NA> <NA> 154 Levels: 01/01/48 01/01/51 > 01/01/52 01/01/59 01/01/63 ... 12/31/96 > > > I thought that the problem might be that I was converting a factor, > so I first converted the variable to a character type (although I > understand that this is done automatically) and then to date class, > but I still had the same problem. Does anybody know how I can solve > this and why I am getting this behavior? One more tidbit: the > earliest date for which the date conversion is "correct" is > 1969-04-15, while the most recent date for which the century is > "incorrect" is 1967-11-05. > > Thanks, JosipThis is the consequence of using a two digit year rather than a four digit year, which BTW, was one of the Y2K issues raised a decade ago... As per ?strptime: %y Year without century (00?99). If you use this on input, which century you get is system-specific. So don't! Often values up to 68 (or 69) are prefixed by 20 and 69 (or 70) to 99 by 19. If you know that all of your dates are going to be before 2000, you can do the following, by using a regex to convert the two digit year to a four digit year and then use as.Date() with '%Y': st_date <- "01/15/63"> sub("([0-9]{2})$", "19\\1", st_date)[1] "01/15/1963"> as.Date(sub("([0-9]{2})$", "19\\1", st_date), format = "%m/%d/%Y")[1] "1963-01-15" The better option is to ensure that the source of your data outputs or exports dates with a four digit year, before importing into R. See ?sub and ?regex HTH, Marc Schwartz