thr3ads.net - R help - [R] Query about using timestamps returned by SQL as 'factor' for split [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Ted Byers

2010-Jul-08 18:23 UTC

[R] Query about using timestamps returned by SQL as 'factor' for split

I have a simple query as follows:

"SELECT
m_id,sale_date,YEAR(sale_date),WEEK(sale_date),return_type,DATEDIFF(return_date,sale_date)
AS elapsed_time FROM risk_input"

I can get, and view, all the data that that query returns.  The question is,
sale_date is a timestamp, and I need to call split to group this data by
m_id and the week in which the sale occurred.  Obviously, I would normally
need both YEAR and WEEK so that data from April this year is not combined
with that from last year (the system is non-autonomous).  And then I need to
use lapply to apply fitdistr to each subsample.

Obviously, I can handle all this data in either a data.frame or in a
data.table.

There are two aspects of the question.

1) Is there a function (or package) that will let me group (or regroup) time
series data into the week in which the data apply, properly taking into
account the year that applies, in a single call passing sale_date as the
argument?  If I can, then I can reduce the amount of data I draw from my
MySQL server and the computational load it bears.

2) The example provided for split splits only according to a single variable
(*g <- airquality$Month;l <- split(airquality, g)*).  How would that
example
be changed if there were two or more columns in the data.frame that are
needed to define the groups?  I.E. in my example, I'd need to group by m_id,
and the year and week values that can be computed from sale_date.

Thanks

Ted

	[[alternative HTML version deleted]]

Matthew Dowle

2010-Jul-09 16:54 UTC

head link

[R] Query about using timestamps returned by SQL as 'factor' forsplit

Hi Ted,

Well since you mentioned data.table (!) ...

If risk_input is a data.table consisting of 3 columns (m_id, sale_date, 
return_date) where the dates
are of class IDate (recently added to data.table by Tom) then try :

   risk_input[, fitdistr(return_date-sale_date,"normal"),
by=list(m_id,
year(sale_date), week(sale_date))]

Notice that the 'by' can contain expressions of columns, and lets you
group
by more than one expression.
You don't have to repeat the 'group by' expressions in the select,
as you
would do in SQL. data.table returns
those group columns automatically in the result, alongside the result of the 
j expression applied to each group.

If you need to aggregate by m_id, year and month rather than week another 
way is :

   risk_input[, fitdistr(return_date-sale_date,"normal"),
by=list(m_id,
round(sale_date,"month"))]

plyr and sqldf can do this task too by the way, and I'd highly recommend you
take a look at those packages.

There are also many excellent datetime classes around which you could also 
consider.

The reason we need IDate in data.table is because data.table uses radix 
sorting, see ?sort.list. That is ultra fast for
integers. Again radix is something Tom added to data.table. The radix 
algorithm (see wikipedia) is specifically
designed to sort integers only. We would use Date, but that is stored as 
numeric. IDate is the same as Date
but stored as integer.

HTH,
Matthew


"Ted Byers" <r.ted.byers at gmail.com> wrote in message 
news:AANLkTincHf3TFZKNDcWolRWSXEKgpfpJeS3F8M5tqLm3 at
mail.gmail.com...>I have a simple query as follows:
>
> "SELECT
>
m_id,sale_date,YEAR(sale_date),WEEK(sale_date),return_type,DATEDIFF(return_date,sale_date)
> AS elapsed_time FROM risk_input"
>
> I can get, and view, all the data that that query returns.  The question 
> is,
> sale_date is a timestamp, and I need to call split to group this data by
> m_id and the week in which the sale occurred.  Obviously, I would normally
> need both YEAR and WEEK so that data from April this year is not combined
> with that from last year (the system is non-autonomous).  And then I need 
> to
> use lapply to apply fitdistr to each subsample.
>
> Obviously, I can handle all this data in either a data.frame or in a
> data.table.
>
> There are two aspects of the question.
>
> 1) Is there a function (or package) that will let me group (or regroup) 
> time
> series data into the week in which the data apply, properly taking into
> account the year that applies, in a single call passing sale_date as the
> argument?  If I can, then I can reduce the amount of data I draw from my
> MySQL server and the computational load it bears.
>
> 2) The example provided for split splits only according to a single 
> variable
> (*g <- airquality$Month;l <- split(airquality, g)*).  How would that 
> example
> be changed if there were two or more columns in the data.frame that are
> needed to define the groups?  I.E. in my example, I'd need to group by 
> m_id,
> and the year and week values that can be computed from sale_date.
>
> Thanks
>
> Ted
>
> [[alternative HTML version deleted]]
>

Reasonably Related Threads

Search for more maybe matching threads

R help - Jul 2010 - Query about using timestamps returned by SQL as 'factor' for split

[R] Query about using timestamps returned by SQL as 'factor' for split

[R] Query about using timestamps returned by SQL as 'factor' forsplit

Reasonably Related Threads