thr3ads.net - R help - [R] Large dataset importing, columns merging and splitting [Jan 2010]

If this information is useful, please help other people find it:
Share via:

Manta

2010-Jan-26 15:48 UTC

[R] Large dataset importing, columns merging and splitting

Dear All,
I have a large data set that looks like this:

CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 300 0

First, I would like to import it by merging column 3 4 and 5, since that is
the timestamp. Then, I would like to aggregate the data by splitting them in
bins of 5 minutes size, therefore from 93000 up to 93459 etc, givin as
output the average price and volume in the 5 minutes bin.

Hope this helps,
Best,

Marco
-- 
View this message in context:
http://n4.nabble.com/Large-dataset-importing-columns-merging-and-splitting-tp1294668p1294668.html
Sent from the R help mailing list archive at Nabble.com.

Gabor Grothendieck

2010-Jan-26 16:23 UTC

head link

[R] Large dataset importing, columns merging and splitting

Try this using the development version of read.zoo in zoo (which we
source from the R-Forge on the fly).

We use "NULL" in colClasses for those columns we don't need but in
col.names we still have to include dummy names for
them.  Of what is left the index is the first three columns (1:3)
which we convert to chron class times in FUN and then truncate to 5
seconds in FUN2.  Finally we use aggregate = mean to average over the
5 second intervals.

Lines <- "CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 300 0
CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 40 51 73.25 100 0
CVX 20070201 9 40 52 73.25 100 0
CVX 20070201 9 40 53 73.25 300 0"


library(zoo)
source("http://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/pkg/zoo/R/read.zoo.R?rev=611&root=zoo")
library(chron)

z <- read.zoo(textConnection(Lines),
	colClasses = c("NULL", "NULL", "numeric",
"numeric", "numeric", "numeric",
		"numeric", "NULL"),
	col.names = c("V1", "V2", "V3", "V4",
"V5", "Price", "Volume", "V8"),
	index = 1:3,
	FUN = function(tt) times(paste(tt[,1], tt[,2], tt[,3], sep = ":")),
	FUN2 = function(tt) trunc(tt, "00:00:05"),
	aggregate = mean)

The result of running the above is:
> z         Price     Volume
09:30:50 73.25 32660.0000
09:40:50 73.25   166.6667

On Tue, Jan 26, 2010 at 10:48 AM, Manta <mantino84 at libero.it>
wrote:>
> Dear All,
> I have a large data set that looks like this:
>
> CVX 20070201 9 30 51 73.25 81400 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 300 0
>
> First, I would like to import it by merging column 3 4 and 5, since that is
> the timestamp. Then, I would like to aggregate the data by splitting them
in
> bins of 5 minutes size, therefore from 93000 up to 93459 etc, givin as
> output the average price and volume in the 5 minutes bin.
>
> Hope this helps,
> Best,
>
> Marco
> --
> View this message in context:
http://n4.nabble.com/Large-dataset-importing-columns-merging-and-splitting-tp1294668p1294668.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Tom Short

2010-Jan-26 18:55 UTC

head link

[R] Large dataset importing, columns merging and splitting

If you need more aggregations on the stock (I assume that's what the
first column is), I'd use the data.table package. It allows fast
indexing and merge operations. That's handy if you have other features
of a stock (like company size or industry sector) that you'd like to
include in the aggregation. Like Gabor, I'd probably use chron for
keeping track of the dates.

Here's some code to get you started:

Lines <- "CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 100 0
CVX 20070201 9 30 51 73.25 300 0
CVX 20070201 9 30 51 73.25 81400 0
CVX 20070201 9 40 51 74.25 100 0
CVX 20070201 9 40 52 74.25 100 0
CVX 20070201 9 40 53 74.25 300 0
CVX 20070301 9 30 51 74.25 100 0
CVX 20070301 9 30 51 74.25 100 0
CVX 20070301 9 30 51 74.25 300 0
CVX 20070301 9 30 51 74.25 81400 0
CVX 20070301 9 40 51 74.25 100 0
CVX 20070301 9 40 52 74.25 100 0
CVX 20070301 9 40 53 74.25 300 0
DVX 20070201 9 30 51 73.25 81400 0
DVX 20070201 9 30 51 73.25 100 0
DVX 20070201 9 30 51 73.25 100 0
DVX 20070201 9 30 51 73.25 300 0
DVX 20070201 9 30 51 73.25 81400 0
DVX 20070201 9 40 51 74.25 100 0
DVX 20070201 9 40 52 74.25 100 0
DVX 20070201 9 40 53 74.25 300 0
DVX 20070301 9 30 51 74.25 100 0
DVX 20070301 9 30 51 74.25 100 0
DVX 20070301 9 30 51 74.25 300 0
DVX 20070301 9 30 51 74.25 81400 0
DVX 20070301 9 40 51 74.25 100 0
DVX 20070301 9 40 52 74.25 100 0
DVX 20070301 9 40 53 74.25 300 0"


library(data.table)
library(chron)
dt <- data.table(read.table(textConnection(Lines),
? ? ? ? ? ? ? ? ? ? ? ? ? ?colClasses = c("character",
"numeric",
"numeric", "numeric", "numeric",
"numeric",
? ? ? ? ? ? ? ? ? ? ? ? ? ?"numeric", "numeric"),
? ? ? ? ? ? ? ? ? ? ? ? ? ?col.names = c("stock", "date",
"h", "m",
"s", "Price", "Volume", "xx")))
dt$date <- as.chron(as.Date(as.character(dt$date), format =
"%Y%m%d"))
+ dt$h/24 + dt$m/(60*24) + dt$s/(60*60*24)
dt$roundeddate <- as.integer(floor(as.numeric(dt$date) * (24 * 12))) #
data.table likes integers

dt[,list(meanprice = mean(Price), volume = sum(Volume)), by =
"roundeddate"]
dt[,list(meanprice = mean(Price), volume = sum(Volume)), by
"stock,roundeddate"]

You'd still probably want to turn the roundeddate back into a real
chron object. If you use aggregation a lot, the development version of
data.table has faster aggregations:
http://r-forge.r-project.org/projects/datatable/

- Tom

On Tue, Jan 26, 2010 at 11:23 AM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:> Try this using the development version of read.zoo in zoo (which we
> source from the R-Forge on the fly).
>
> We use "NULL" in colClasses for those columns we don't need
but in
> col.names we still have to include dummy names for
> them. ?Of what is left the index is the first three columns (1:3)
> which we convert to chron class times in FUN and then truncate to 5
> seconds in FUN2. ?Finally we use aggregate = mean to average over the
> 5 second intervals.
>
> Lines <- "CVX 20070201 9 30 51 73.25 81400 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 100 0
> CVX 20070201 9 30 51 73.25 300 0
> CVX 20070201 9 30 51 73.25 81400 0
> CVX 20070201 9 40 51 73.25 100 0
> CVX 20070201 9 40 52 73.25 100 0
> CVX 20070201 9 40 53 73.25 300 0"
>
>
> library(zoo)
>
source("http://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/pkg/zoo/R/read.zoo.R?rev=611&root=zoo")
> library(chron)
>
> z <- read.zoo(textConnection(Lines),
> ? ? ? ?colClasses = c("NULL", "NULL",
"numeric", "numeric", "numeric",
"numeric",
> ? ? ? ? ? ? ? ?"numeric", "NULL"),
> ? ? ? ?col.names = c("V1", "V2", "V3",
"V4", "V5", "Price", "Volume",
"V8"),
> ? ? ? ?index = 1:3,
> ? ? ? ?FUN = function(tt) times(paste(tt[,1], tt[,2], tt[,3], sep =
":")),
> ? ? ? ?FUN2 = function(tt) trunc(tt, "00:00:05"),
> ? ? ? ?aggregate = mean)
>
> The result of running the above is:
>
>> z
> ? ? ? ? Price ? ? Volume
> 09:30:50 73.25 32660.0000
> 09:40:50 73.25 ? 166.6667
>
> On Tue, Jan 26, 2010 at 10:48 AM, Manta <mantino84 at libero.it>
wrote:
>>
>> Dear All,
>> I have a large data set that looks like this:
>>
>> CVX 20070201 9 30 51 73.25 81400 0
>> CVX 20070201 9 30 51 73.25 100 0
>> CVX 20070201 9 30 51 73.25 100 0
>> CVX 20070201 9 30 51 73.25 300 0
>>
>> First, I would like to import it by merging column 3 4 and 5, since
that is
>> the timestamp. Then, I would like to aggregate the data by splitting
them in
>> bins of 5 minutes size, therefore from 93000 up to 93459 etc, givin as
>> output the average price and volume in the 5 minutes bin.
>>
>> Hope this helps,
>> Best,
>>
>> Marco
>> --
>> View this message in context:
http://n4.nabble.com/Large-dataset-importing-columns-merging-and-splitting-tp1294668p1294668.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Maybe Matching Threads

Search for more reasonably related threads

R help - Jan 2010 - Large dataset importing, columns merging and splitting

[R] Large dataset importing, columns merging and splitting

[R] Large dataset importing, columns merging and splitting

[R] Large dataset importing, columns merging and splitting

Maybe Matching Threads