thr3ads.net - R help - [R] Finding the rows with the duplicated index [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Santosh Srinivas

2010-Nov-23 16:04 UTC

[R] Finding the rows with the duplicated index

Hello Group,

I have a huge time series dataset with sample below. I am basically trying
to read it into a zoo object with columns 1:6 to index. Zoo issues a warning
that some of the rows have duplicated index.

dput(z)
structure(list(TrdTimestamp = structure(list(sec = c(19, 19, 
18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L, 
4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L, 
14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L, 
    109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min",
"hour", "mday", "mon", "year",
"wday", "yday", "isdst"), class =
c("POSIXt",

"POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "NIFTY", class = "factor"), InstTyp =
structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX",
"OPTIDX"
), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L, 
20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 
20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = c("CE", "FF", "PE"), class =
"factor"),
    Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 
    2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5, 
    349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L, 
    50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names =
c("TrdTimestamp",
"Ticker", "InstTyp", "ExpDt", "OptTyp",
"Strike", "TrdPrice",
"TrdQty"), row.names = c(NA, 10L), class = "data.frame")



Z should ideally have a unique index based on columns 1:6 ... but looks like
some are duplicated.
I want to get the count against each unique set .. i.e. z[,1:6] and the
number of rows with that combination?

I can put the unique index sets into a different object and do a rowcount
using apply but was wondering if there is an easier way already?

Thank you.

Gabor Grothendieck

2010-Nov-23 16:23 UTC

head link

[R] Finding the rows with the duplicated index

On Tue, Nov 23, 2010 at 11:04 AM, Santosh Srinivas
<santosh.srinivas at gmail.com> wrote:> Hello Group,
>
> I have a huge time series dataset with sample below. I am basically trying
> to read it into a zoo object with columns 1:6 to index. Zoo issues a
warning
> that some of the rows have duplicated index.
>
> dput(z)
> structure(list(TrdTimestamp = structure(list(sec = c(19, 19,
> 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L,
> 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L,
> 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
> ? ?year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L,
> ? ?109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
> ? ?4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L,
> ? ?0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min",
> "hour", "mday", "mon", "year",
"wday", "yday", "isdst"), class =
c("POSIXt",
>
> "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = "NIFTY", class = "factor"),
InstTyp = structure(c(2L,
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX",
"OPTIDX"
> ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L,
> 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L,
> 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = c("CE", "FF", "PE"),
class = "factor"),
> ? ?Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L,
> ? ?2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5,
> ? ?349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L,
> ? ?50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names =
c("TrdTimestamp",
> "Ticker", "InstTyp", "ExpDt",
"OptTyp", "Strike", "TrdPrice",
> "TrdQty"), row.names = c(NA, 10L), class =
"data.frame")
>
>
>
> Z should ideally have a unique index based on columns 1:6 ... but looks
like
> some are duplicated.
> I want to get the count against each unique set .. i.e. z[,1:6] and the
> number of rows with that combination?
>
> I can put the unique index sets into a different object and do a rowcount
> using apply but was wondering if there is an easier way already?
>
Data frames should not hold POSIXlt objects and its very confusing
naming the data frame z since that makes one think its a zoo object.
Without your code its hard to know what the problem is but perhaps you
were reading the series in with Date index in which case all the rows
would have the same date.  Specify tz = "" to use POSIXct.  Also note
that you cannot have a mixture of character and numeric variables in a
zoo object.  A 2d zoo object is a matrix with an index.  Assuming that
DF is a data frame in long format with multiple stocks identified by
the second column we can split it according to the stock like this and
there are no duplicates, at least in the data above.

library(zoo)
DF <- z
rm(z)
DF2 <- transform(DF, TrdTimestamp = as.POSIXct(TrdTimestamp))
z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "")

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Santosh Srinivas

2010-Nov-23 16:49 UTC

head link

[R] Finding the rows with the duplicated index

Thanks. Sorry about the naming (just cobbled the example).
The rows are duplicate in the actual data. I am looking for a way to find
the number of duplicate rows for each unique rowindex combination which is
based on columns 1:6.


-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: 23 November 2010 21:54
To: Santosh Srinivas
Cc: r-help at r-project.org
Subject: Re: [R] Finding the rows with the duplicated index

On Tue, Nov 23, 2010 at 11:04 AM, Santosh Srinivas
<santosh.srinivas at gmail.com> wrote:> Hello Group,
>
> I have a huge time series dataset with sample below. I am basically trying
> to read it into a zoo object with columns 1:6 to index. Zoo issues a
warning> that some of the rows have duplicated index.
>
> dput(z)
> structure(list(TrdTimestamp = structure(list(sec = c(19, 19,
> 18, 10, 12, 43, 41, 59, 40, 29), min = c(58L, 57L, 39L, 37L,
> 4L, 5L, 26L, 45L, 24L, 16L), hour = c(11L, 12L, 10L, 12L, 14L,
> 14L, 15L, 14L, 11L, 11L), mday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), mon = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
> ? ?year = c(109L, 109L, 109L, 109L, 109L, 109L, 109L, 109L,
> ? ?109L, 109L), wday = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
> ? ?4L), yday = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), isdst = c(0L,
> ? ?0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min",
> "hour", "mday", "mon", "year",
"wday", "yday", "isdst"), class
c("POSIXt",
>
> "POSIXlt")), Ticker = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = "NIFTY", class = "factor"),
InstTyp structure(c(2L,
> 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("FUTIDX",
"OPTIDX"
> ), class = "factor"), ExpDt = c(20090129L, 20090129L, 20090129L,
> 20090129L, 20090129L, 20090129L, 20090129L, 20090129L, 20090129L,
> 20090129L), OptTyp = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 1L, 1L), .Label = c("CE", "FF", "PE"),
class = "factor"),
> ? ?Strike = c(2700L, 2700L, 2700L, 2700L, 2700L, 2700L, 2700L,
> ? ?2700L, 2700L, 2700L), TrdPrice = c(347.4, 340, 334.95, 335.5,
> ? ?349.95, 353, 380, 378.1, 340.25, 339), TrdQty = c(50L, 50L,
> ? ?50L, 50L, 50L, 50L, 50L, 50L, 50L, 50L)), .Names =
c("TrdTimestamp",
> "Ticker", "InstTyp", "ExpDt",
"OptTyp", "Strike", "TrdPrice",
> "TrdQty"), row.names = c(NA, 10L), class =
"data.frame")
>
>
>
> Z should ideally have a unique index based on columns 1:6 ... but looks
like> some are duplicated.
> I want to get the count against each unique set .. i.e. z[,1:6] and the
> number of rows with that combination?
>
> I can put the unique index sets into a different object and do a rowcount
> using apply but was wondering if there is an easier way already?
>
Data frames should not hold POSIXlt objects and its very confusing
naming the data frame z since that makes one think its a zoo object.
Without your code its hard to know what the problem is but perhaps you
were reading the series in with Date index in which case all the rows
would have the same date.  Specify tz = "" to use POSIXct.  Also note
that you cannot have a mixture of character and numeric variables in a
zoo object.  A 2d zoo object is a matrix with an index.  Assuming that
DF is a data frame in long format with multiple stocks identified by
the second column we can split it according to the stock like this and
there are no duplicates, at least in the data above.

library(zoo)
DF <- z
rm(z)
DF2 <- transform(DF, TrdTimestamp = as.POSIXct(TrdTimestamp))
z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "")

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Gabor Grothendieck

2010-Nov-23 17:04 UTC

head link

[R] Finding the rows with the duplicated index

On Tue, Nov 23, 2010 at 11:49 AM, Santosh Srinivas
<santosh.srinivas at gmail.com> wrote:> Thanks. Sorry about the naming (just cobbled the example).
> The rows are duplicate in the actual data. I am looking for a way to find
> the number of duplicate rows for each unique rowindex combination which is
> based on columns 1:6.
>
>
Try this:

   Ag <- aggregate(TrdPrice ~., DF2[1:7], length)
   names(Ag)[7] <- "length"

where DF2 was defined previously.

Note that when you read in the data into zoo you can aggregate it at
the same time using the aggregate argument to read.zoo.  For
example, note the aggregate argument in the following:

   z <- read.zoo(DF2[c(1:2, 4, 6:8)], split = 2, tz = "", aggregate
function(x) tail(x, 1))

-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

R help - Nov 2010 - Finding the rows with the duplicated index

[R] Finding the rows with the duplicated index

[R] Finding the rows with the duplicated index

[R] Finding the rows with the duplicated index

[R] Finding the rows with the duplicated index