thr3ads.net - R help - [R] Problem with ddply in the plyr-package: surprising output of a date-column [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Christoph Jäckel

2011-Apr-25 17:19 UTC

[R] Problem with ddply in the plyr-package: surprising output of a date-column

Hi Together,

I have a problem with the plyr package - more precisely with the ddply
function - and would be very grateful for any help. I hope the example
here is precise enough for someone to identify the problem. Basically,
in this step I want to identify observations that are identical in
terms of certain identifiers (ID1, ID2, ID3) and just want to save
those observations (in this step, without deleting any rows or
manipulating any data) in a separate data.frame. However, I get the
warning message below and the column with dates is messed up.
Interestingly, the value column (the type is factor here, but if you
change that with as.integer it doesn't make any difference) is handled
correctly. Any idea what I do wrong?

df <-
data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),

Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
                 Value=c(1,2,3,4,5,6,7)))
df[,1] <- as.character(df[,1])
df[,2] <- as.character(df[,2])
df$Date   <- strptime(df$Date,"%Y-%m-%d")

#Apparently there are two observation that have the same IDs: ID1=2 and ID1=4
ddply(df,.(ID1,ID2,ID3),nrow)
#I want to save those IDs in a separate data.frame, so the desired output is:
df[c(2:3,6:7),]

#My idea: Write a custom function that only returns observations with
multiple rows.
#Seems to work except that the Date column doesn't make any sense anymore
#Warning message: In output[[var]][rng] <- df[[var]]: number of items
to replace is not a multiple of replacement length
ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})

#Notice that it works perfectly if I only have one observation with
multiple rows
ddply(df[1:6,],.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})

Thanks in advance,

Christoph

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Christoph J?ckel (Dipl.-Kfm.)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Research Assistant

Chair for Financial Management and Capital Markets | Lehrstuhls f?r
Finanzmanagement und Kapitalm?rkte

TUM School of Management | Technische Universit?t M?nchen

Arcisstr. 21 | D-80333 M?nchen | Germany

Brian Diggs

2011-Apr-25 18:05 UTC

head link

[R] Problem with ddply in the plyr-package: surprising output of a date-column

On 4/25/2011 10:19 AM, Christoph J?ckel wrote:> Hi Together,
>
> I have a problem with the plyr package - more precisely with the ddply
> function - and would be very grateful for any help. I hope the example
> here is precise enough for someone to identify the problem. Basically,
> in this step I want to identify observations that are identical in
> terms of certain identifiers (ID1, ID2, ID3) and just want to save
> those observations (in this step, without deleting any rows or
> manipulating any data) in a separate data.frame. However, I get the
> warning message below and the column with dates is messed up.
> Interestingly, the value column (the type is factor here, but if you
> change that with as.integer it doesn't make any difference) is handled
> correctly. Any idea what I do wrong?
>
> df<-
data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>
>
Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
>                   Value=c(1,2,3,4,5,6,7)))
> df[,1]<- as.character(df[,1])
> df[,2]<- as.character(df[,2])
> df$Date<- strptime(df$Date,"%Y-%m-%d")
>
> #Apparently there are two observation that have the same IDs: ID1=2 and
ID1=4
> ddply(df,.(ID1,ID2,ID3),nrow)
> #I want to save those IDs in a separate data.frame, so the desired output
is:
> df[c(2:3,6:7),]
>
> #My idea: Write a custom function that only returns observations with
> multiple rows.
> #Seems to work except that the Date column doesn't make any sense
anymore
> #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> to replace is not a multiple of replacement length
> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>
> #Notice that it works perfectly if I only have one observation with
> multiple rows
> ddply(df[1:6,],.(ID1,ID2,ID3),function(df)
if(nrow(df)<=1){NULL}else{df})
Works for me:

 > df[c(2:3,6:7),]
   ID1 ID2 ID3      Date Value
2   2   b  v1 1985-05-2     2
3   2   b  v1 1985-05-3     3
6   4   e  v1 1985-05-6     6
7   4   e  v1 1985-05-7     7
 > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
   ID1 ID2 ID3      Date Value
1   2   b  v1 1985-05-2     2
2   2   b  v1 1985-05-3     3
3   4   e  v1 1985-05-6     6
4   4   e  v1 1985-05-7     7
 > sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.5.2

loaded via a namespace (and not attached):
[1] tools_2.13.0

A couple of things: there was just an update of plyr to 1.5.2; maybe 
that fixes what you are seeing?  Also, your df consists of only factors. 
  cbind-ing the data before turning it into a data.frame makes it a 
character matrix which gets converted to factors.

 > str(df)
'data.frame':   7 obs. of  5 variables:
  $ ID1  : Factor w/ 4 levels
"1","2","3","4": 1 2 2 3 3 4 4
  $ ID2  : Factor w/ 5 levels
"a","b","c","d",..: 1 2 2 3 4 5 5
  $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
  $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1
2 3 4 5 6 7
  $ Value: Factor w/ 7 levels
"1","2","3","4",..: 1 2 3 4 5 6 7

Maybe that has something to do with the odd "dates" since they are not
really dates at all, just string representations of factor levels. 
Compare with:

DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
	ID2=c('a','b','b','c','d','e','e'),
	ID3=c("v1","v1","v1","v1","v2","v1","v1"),
	Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
	
"1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
	Value=c(1,2,3,4,5,6,7))
str(DF)
#'data.frame':   7 obs. of  5 variables:
# $ ID1  : num  1 2 2 3 3 4 4
# $ ID2  : Factor w/ 5 levels
"a","b","c","d",..: 1 2 2 3 4 5 5
# $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
# $ Date : Date, format: "1985-05-01" "1985-05-02" ...
# $ Value: num  1 2 3 4 5 6 7

This version also works for me.

ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
#  ID1 ID2 ID3       Date Value
#1   2   b  v1 1985-05-02     2
#2   2   b  v1 1985-05-03     3
#3   4   e  v1 1985-05-06     6
#4   4   e  v1 1985-05-07     7
> Thanks in advance,
>
> Christoph
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Christoph J?ckel (Dipl.-Kfm.)
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Research Assistant
>
> Chair for Financial Management and Capital Markets | Lehrstuhls f?r
> Finanzmanagement und Kapitalm?rkte
>
> TUM School of Management | Technische Universit?t M?nchen
>
> Arcisstr. 21 | D-80333 M?nchen | Germany
>

-- 
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

Peter Ehlers

2011-Apr-25 18:11 UTC

head link

[R] Problem with ddply in the plyr-package: surprising output of a date-column

On 2011-04-25 10:19, Christoph J?ckel wrote:> Hi Together,
>
> I have a problem with the plyr package - more precisely with the ddply
> function - and would be very grateful for any help. I hope the example
> here is precise enough for someone to identify the problem. Basically,
> in this step I want to identify observations that are identical in
> terms of certain identifiers (ID1, ID2, ID3) and just want to save
> those observations (in this step, without deleting any rows or
> manipulating any data) in a separate data.frame. However, I get the
> warning message below and the column with dates is messed up.
> Interestingly, the value column (the type is factor here, but if you
> change that with as.integer it doesn't make any difference) is handled
> correctly. Any idea what I do wrong?
>
> df<-
data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
>
>
Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
>                   Value=c(1,2,3,4,5,6,7)))
> df[,1]<- as.character(df[,1])
> df[,2]<- as.character(df[,2])
> df$Date<- strptime(df$Date,"%Y-%m-%d")
>
> #Apparently there are two observation that have the same IDs: ID1=2 and
ID1=4
> ddply(df,.(ID1,ID2,ID3),nrow)
> #I want to save those IDs in a separate data.frame, so the desired output
is:
> df[c(2:3,6:7),]
>
> #My idea: Write a custom function that only returns observations with
> multiple rows.
> #Seems to work except that the Date column doesn't make any sense
anymore
> #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> to replace is not a multiple of replacement length
> ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>
> #Notice that it works perfectly if I only have one observation with
> multiple rows
> ddply(df[1:6,],.(ID1,ID2,ID3),function(df)
if(nrow(df)<=1){NULL}else{df})
I would characterize your problem as:
a) using strptime - this is what gives ddply() fits;

b) not using str() to check whether R agrees with
    you with respect to your data;

c) using cbind() inside data.frame(). This isn't
    wrong, but is rarely (in my experience) useful.

If you use as.Date (or even nothing) on your Date
variable, you'll find that ddply does what you want.
To see why it doesn't work with strptime, check
str(df) and then ?Posixlt. You've converted Date
values to lists.

My comment about cbind() is to warn you that your
Values variable, as you have constructed it, is
a factor.

Peter Ehlers
>
> Thanks in advance,
>
> Christoph
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Christoph J?ckel (Dipl.-Kfm.)
>
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Research Assistant
>
> Chair for Financial Management and Capital Markets | Lehrstuhls f?r
> Finanzmanagement und Kapitalm?rkte
>
> TUM School of Management | Technische Universit?t M?nchen
>
> Arcisstr. 21 | D-80333 M?nchen | Germany
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more reasonably related threads

R help - Apr 2011 - Problem with ddply in the plyr-package: surprising output of a date-column

[R] Problem with ddply in the plyr-package: surprising output of a date-column

[R] Problem with ddply in the plyr-package: surprising output of a date-column

[R] Problem with ddply in the plyr-package: surprising output of a date-column

Seemingly Similar Threads