Stratos Laskarides
2010-Oct-21 20:20 UTC
[R] Limitations and scale of R, and performance issues if and when limit reached
Hi there Thank you for everyone's help in all my previous questions. By way of intro, I am a masters student in actuarial science at the University of Cape Town, and I am doing a project in R on some healthcare cost data. Just for clarity before I embark on further research may I please ask the following. I want to take the direction of modelling healh insurance claims data with Tweedie compound poisson models for over 2 million beneficiaries. I'd also like to work in a double GLM framework so that the dispersion parameter captures as much variance as possible. In addition, I'd like these results to somehow feed into a stochastic model application, which will form part of a Dynamic Financial Analysis model of a health insurer. My question is, in light of the above broad overview, how large must data sets be before R faces any performance problems or issues? In other words what "scale" can R handle? Thanks ever so much once again. Kind regards Stratos On Tue, Oct 12, 2010 at 11:31 AM, Dennis Murphy <djmuser@gmail.com> wrote:> Hi: > > On Tue, Oct 12, 2010 at 12:51 AM, Stratos Laskarides <stratlask@gmail.com > > wrote: > >> Dear Madam/Sir >> >> This may be quite a long shot... >> >> By way of intro, I am a masters student in actuarial science at the >> University of Cape Town, and I am doing a project in R on some healthcare >> cost data. During my coding in R I encountered an error message, which I >> then googled, but I am still unable to resolve the issue. >> >> I would like to please ask if and how it is possible to resolve the >> problem >> raised by the error message "Error: NA/NaN/Inf in foreign function call >> (arg >> 1) In addition: Warning message: *step size truncated due to divergence" >> *in >> R? >> > > That error message can arise if division by zero occurs somewhere in the > computation. Try using ftable() or some related function that will print > out your > complete table (4-way?) and check whether you have zero frequency in one > or more cells. If there are zero frequencies, that does not necessarily > explain > the problem, but it's a reasonable initial hypothesis. Merging some > categories to > get enough frequencies per cell may be useful if you do have zero > frequencies, > and then try the fit again to see if you get more sensible results. > > When the error is thrown, it can be useful to do > traceback() > > as it recalls the sequence of function calls that led up to the error, but > it helps to > have enough R experience to make heads or tails of the output :) > >> >> As for some background on my specific data and research problem at hand, I >> am fitting a gamma regression model to 13 000 lines of insurance claims >> data, which will be regressed against categorical variables such as Age >> Band, Gender, and Region. >> > > The more variables you have in the model, the greater the number of cell > combinations. A 15 x 2 x 5 combination of your three variables, for > example, would generate 150 combinations of the three variables, and it's > entirely possible for a few of those combinations to have small or zero > frequencies. > In addition, adding a new variable to the model would at least double the > number > of cells, spreading/thinning out the data even more. > >> >> Perhaps my problem arises because the data set is too large and the >> iteratively reweighted least squares algorithm therefore cannot converge, >> in >> which case I perhaps need another GLM type. Or maybe the categorical >> explanatory variables can take on too many values (e.g. there are 15 Age >> Bands, 5 Regions). >> > > If your response is continuous and positive valued with a right skewed > distribution, > then a Gamma model would appear to be sensible. > > The data set is not too large; successful GLMs have been fit with much > larger > data sets. Your second hypothesis sounds more plausible, though. > > HTH, > Dennis > >> >> Any insights you could provide would be much appreciated. >> >> Thank you ever so much. >> >> Kind regards >> Stratos Laskarides >> South Africa >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > >[[alternative HTML version deleted]]
Uwe Ligges
2010-Oct-23 15:42 UTC
[R] Limitations and scale of R, and performance issues if and when limit reached
On 21.10.2010 22:20, Stratos Laskarides wrote:> Hi there > > Thank you for everyone's help in all my previous questions. > > By way of intro, I am a masters student in actuarial science at the > University of Cape Town, and I am doing a project in R on some healthcare > cost data. Just for clarity before I embark on further research may I please > ask the following. > > I want to take the direction of modelling healh insurance claims data with > Tweedie compound poisson models for over 2 million beneficiaries. I'd also > like to work in a double GLM framework so that the dispersion parameter > captures as much variance as possible. In addition, I'd like these results > to somehow feed into a stochastic model application, which will form part of > a Dynamic Financial Analysis model of a health insurer. > > My question is, in light of the above broad overview, how large must data > sets be before R faces any performance problems or issues? In other words > what "scale" can R handle?Depends on the available memory, the kind of data and the methods you are going to apply. Uwe Ligges> Thanks ever so much once again. > > Kind regards > Stratos > > On Tue, Oct 12, 2010 at 11:31 AM, Dennis Murphy<djmuser at gmail.com> wrote: > >> Hi: >> >> On Tue, Oct 12, 2010 at 12:51 AM, Stratos Laskarides<stratlask at gmail.com >>> wrote: >> >>> Dear Madam/Sir >>> >>> This may be quite a long shot... >>> >>> By way of intro, I am a masters student in actuarial science at the >>> University of Cape Town, and I am doing a project in R on some healthcare >>> cost data. During my coding in R I encountered an error message, which I >>> then googled, but I am still unable to resolve the issue. >>> >>> I would like to please ask if and how it is possible to resolve the >>> problem >>> raised by the error message "Error: NA/NaN/Inf in foreign function call >>> (arg >>> 1) In addition: Warning message: *step size truncated due to divergence" >>> *in >>> R? >>> >> >> That error message can arise if division by zero occurs somewhere in the >> computation. Try using ftable() or some related function that will print >> out your >> complete table (4-way?) and check whether you have zero frequency in one >> or more cells. If there are zero frequencies, that does not necessarily >> explain >> the problem, but it's a reasonable initial hypothesis. Merging some >> categories to >> get enough frequencies per cell may be useful if you do have zero >> frequencies, >> and then try the fit again to see if you get more sensible results. >> >> When the error is thrown, it can be useful to do >> traceback() >> >> as it recalls the sequence of function calls that led up to the error, but >> it helps to >> have enough R experience to make heads or tails of the output :) >> >>> >>> As for some background on my specific data and research problem at hand, I >>> am fitting a gamma regression model to 13 000 lines of insurance claims >>> data, which will be regressed against categorical variables such as Age >>> Band, Gender, and Region. >>> >> >> The more variables you have in the model, the greater the number of cell >> combinations. A 15 x 2 x 5 combination of your three variables, for >> example, would generate 150 combinations of the three variables, and it's >> entirely possible for a few of those combinations to have small or zero >> frequencies. >> In addition, adding a new variable to the model would at least double the >> number >> of cells, spreading/thinning out the data even more. >> >>> >>> Perhaps my problem arises because the data set is too large and the >>> iteratively reweighted least squares algorithm therefore cannot converge, >>> in >>> which case I perhaps need another GLM type. Or maybe the categorical >>> explanatory variables can take on too many values (e.g. there are 15 Age >>> Bands, 5 Regions). >>> >> >> If your response is continuous and positive valued with a right skewed >> distribution, >> then a Gamma model would appear to be sensible. >> >> The data set is not too large; successful GLMs have been fit with much >> larger >> data sets. Your second hypothesis sounds more plausible, though. >> >> HTH, >> Dennis >> >>> >>> Any insights you could provide would be much appreciated. >>> >>> Thank you ever so much. >>> >>> Kind regards >>> Stratos Laskarides >>> South Africa >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.