Hello Group, I have a huge time series dataset with sample below. I am basically trying to read it into a zoo object with columns 1:6 to index. Zoo issues a warning that some of the rows have duplicated index. dput(z) structure(list(TrdTimestamp = structure(list(sec = c(19, 19, 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L, 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L, 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXt", "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "NIFTY", class = "factor"), InstTyp = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX", "OPTIDX" ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("CE", "FF", "PE"), class = "factor"), Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5, 349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names = c("TrdTimestamp", "Ticker", "InstTyp", "ExpDt", "OptTyp", "Strike", "TrdPrice", "TrdQty"), row.names = c(NA, 10L), class = "data.frame") Z should ideally have a unique index based on columns 1:6 ... but looks like some are duplicated. I want to get the count against each unique set .. i.e. z[,1:6] and the number of rows with that combination? I can put the unique index sets into a different object and do a rowcount using apply but was wondering if there is an easier way already? Thank you.
On Tue, Nov 23, 2010 at 11:04 AM, Santosh Srinivas <santosh.srinivas at gmail.com> wrote:> Hello Group, > > I have a huge time series dataset with sample below. I am basically trying > to read it into a zoo object with columns 1:6 to index. Zoo issues a warning > that some of the rows have duplicated index. > > dput(z) > structure(list(TrdTimestamp = structure(list(sec = c(19, 19, > 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L, > 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L, > 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), > ? ?year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, > ? ?109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, > ? ?4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L, > ? ?0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min", > "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class = c("POSIXt", > > "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), .Label = "NIFTY", class = "factor"), InstTyp = structure(c(2L, > 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX", "OPTIDX" > ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L, > 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, > 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), .Label = c("CE", "FF", "PE"), class = "factor"), > ? ?Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, > ? ?2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5, > ? ?349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L, > ? ?50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names = c("TrdTimestamp", > "Ticker", "InstTyp", "ExpDt", "OptTyp", "Strike", "TrdPrice", > "TrdQty"), row.names = c(NA, 10L), class = "data.frame") > > > > Z should ideally have a unique index based on columns 1:6 ... but looks like > some are duplicated. > I want to get the count against each unique set .. i.e. z[,1:6] and the > number of rows with that combination? > > I can put the unique index sets into a different object and do a rowcount > using apply but was wondering if there is an easier way already? >Data frames should not hold POSIXlt objects and its very confusing naming the data frame z since that makes one think its a zoo object. Without your code its hard to know what the problem is but perhaps you were reading the series in with Date index in which case all the rows would have the same date. Specify tz = "" to use POSIXct. Also note that you cannot have a mixture of character and numeric variables in a zoo object. A 2d zoo object is a matrix with an index. Assuming that DF is a data frame in long format with multiple stocks identified by the second column we can split it according to the stock like this and there are no duplicates, at least in the data above. library(zoo) DF <- z rm(z) DF2 <- transform(DF, TrdTimestamp = as.POSIXct(TrdTimestamp)) z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "") -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Thanks. Sorry about the naming (just cobbled the example). The rows are duplicate in the actual data. I am looking for a way to find the number of duplicate rows for each unique rowindex combination which is based on columns 1:6. -----Original Message----- From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] Sent: 23 November 2010 21:54 To: Santosh Srinivas Cc: r-help at r-project.org Subject: Re: [R] Finding the rows with the duplicated index On Tue, Nov 23, 2010 at 11:04 AM, Santosh Srinivas <santosh.srinivas at gmail.com> wrote:> Hello Group, > > I have a huge time series dataset with sample below. I am basically trying > to read it into a zoo object with columns 1:6 to index. Zoo issues awarning> that some of the rows have duplicated index. > > dput(z) > structure(list(TrdTimestamp = structure(list(sec = c(19, 19, > 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L, > 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L, > 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), > ? ?year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, > ? ?109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, > ? ?4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L, > ? ?0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", "min", > "hour", "mday", "mon", "year", "wday", "yday", "isdst"), class c("POSIXt", > > "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), .Label = "NIFTY", class = "factor"), InstTyp structure(c(2L, > 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX", "OPTIDX" > ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L, > 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, > 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, > 1L, 1L, 1L), .Label = c("CE", "FF", "PE"), class = "factor"), > ? ?Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, > ? ?2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5, > ? ?349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L, > ? ?50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names = c("TrdTimestamp", > "Ticker", "InstTyp", "ExpDt", "OptTyp", "Strike", "TrdPrice", > "TrdQty"), row.names = c(NA, 10L), class = "data.frame") > > > > Z should ideally have a unique index based on columns 1:6 ... but lookslike> some are duplicated. > I want to get the count against each unique set .. i.e. z[,1:6] and the > number of rows with that combination? > > I can put the unique index sets into a different object and do a rowcount > using apply but was wondering if there is an easier way already? >Data frames should not hold POSIXlt objects and its very confusing naming the data frame z since that makes one think its a zoo object. Without your code its hard to know what the problem is but perhaps you were reading the series in with Date index in which case all the rows would have the same date. Specify tz = "" to use POSIXct. Also note that you cannot have a mixture of character and numeric variables in a zoo object. A 2d zoo object is a matrix with an index. Assuming that DF is a data frame in long format with multiple stocks identified by the second column we can split it according to the stock like this and there are no duplicates, at least in the data above. library(zoo) DF <- z rm(z) DF2 <- transform(DF, TrdTimestamp = as.POSIXct(TrdTimestamp)) z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "") -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
On Tue, Nov 23, 2010 at 11:49 AM, Santosh Srinivas <santosh.srinivas at gmail.com> wrote:> Thanks. Sorry about the naming (just cobbled the example). > The rows are duplicate in the actual data. I am looking for a way to find > the number of duplicate rows for each unique rowindex combination which is > based on columns 1:6. > >Try this: Ag <- aggregate(TrdPrice ~., DF2[1:7], length) names(Ag)[7] <- "length" where DF2 was defined previously. Note that when you read in the data into zoo you can aggregate it at the same time using the aggregate argument to read.zoo. For example, note the aggregate argument in the following: z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "", aggregate function(x) tail(x, 1)) -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com