Ted Byers
2010-Jun-30 21:34 UTC
[R] I need guidance on better data management in preparation for time series analysis
OK, I have managed to use some of the basic processes of getting data from my DB, passing it as a whole to something like fitdistr, &c. I know I can implement most of what I need using a brute force algorithm based on a series of nested loops. I also know I can handle some of this logic in a brute force method using a blend of perl and R, with considerable file IO. But some of what I need needs a smarter/faster way. To understand what I am after, consider the following. I have transaction data comprised of sales and refunds, each of which has a timestamp. The refund data has a timestamp representing when the refund was issued and an "original transaction ID" representing the sale it refunds. I have massaged this data in my schema so that there is a table that has a record for each refund, and this record includes, among other things, the timestamps for both the original sale and the refund. I can construct a SQL query to get these along with the elapsed time (in days, as a real number) between the sale and refund. For some merchants, I have such data going back years. I know, fromt he amount of data I have examined, the rate at which sales result in refunds changes through time, though I have not run tests to determine whether or not the changes I see are significant. In most cases, I can break the data for a merchant into weekly subsamples. Obviously, I can construct loops that iterate over merchant ID, and year/week (or day) covering the entire period for which I have data for a given merchant. What I am asking is, "Is there a smarter way?" I can't load all the data as there are many GB of data, but the data for individual merchants varies from a few hundred kB to a few dozen MB. Thus, I expect an outer loop iterating over merchant ID will be inevitable. But, is there a smarter way to apply fitdistr (or similar function) to samples represent sales in each week of each year (or each day of the year when there is sufficient data), and then test to see if the parameter of the exponential distribution that best fits the data varies significantly through time (there are both theoretical and empirical reasons to expect an exponential distribution, but the specific distribution doesn't really matter for the purpose of this question). That is one question I need to deal with. Is there a simple way to specify a function, a dataset and a rule for determining all the subsamples, and then tell R to apply the function to each subsample and then say whether or not the estimated parameters for the subsample are significantly different? Or do I have to resort to the simple brute force approach of using a set of nested loops to get what I need? The other question I have at present is more a statistical question: Integrating an exponential pdf over a given time period is simple enough, but I need to learn how confidence intervals for that integral to be computed when you have the estimate and std of the parameter for the exponential distribution from something like fitdistr. This gets to how to get confidence intervales when dealing with integrals of functions of uncertain numbers. Not only is there a confidence interval for the parameter of the exponential distribution, but to estimate how many refunds to expect for the next week, one not only needs the confidence intervals of the integral of the pdf over the next week for a given sample, but one needs to integrate this over all the samples that could produce a refund in the coming week. I'd appreciate any information anyone can provide, even if that consists of an URL that points to a resource that deals with the specific questions I have. I am afraid all the resources I have found searching so far have been at a more introductory level of simply making a connection to a DB and then submitting a SQL statement to it. Something in between that level and the level comprised of the maze of documentation for the plethora of relevant packages is needed here (there is such an embarrassment of riches, I find myself getting confused as to how to proceed). Thanks Ted [[alternative HTML version deleted]]