Hello! (I dont know if I can raise this query here on this forum, but I had already raised on teh finance forum, but have not received any sugegstion, so now raising on this list. Sorry for the same. The query is about what to do, if no statistical distribution is fitting to data). I am into risk management and deal with Operatioanl risk. As a part of BASEL II guidelines, we need to arrive at the capital charge the banks must set aside to counter any operational risk, if it happens. As a part of Loss Distribution Approach (LDA), we need to collate past loss events and use these loss amounts. The usual process as being practised in the industry is as follows - Using these historical loss amounts and using the various statistical tests like KS test, AD test, PP plot, QQ plot etc, we try to identify best statistical (continuous) distribution fitting this historical loss data. Then using these estimated parameters w.r.t. the statistical distribution, we simulate say 1 miliion loss anounts and then taking appropriate percentile (say 99.9%), we arrive at the capital charge. However, many a times, loss data is such that fitting of distribution to loss data is not possible. May be loss data is multimodal or has significant variability, making the fitting of distribution impossible. Can someone guide me how to deal with such data and what can be done to simulate losses using this historical loss data in R. My data is as follows - mydat <- c(829.53,4000,6000,1000,1063904,102400,22000,4000,4200,2000,10000,400, 459006, 7276,4000,100,4000,10000,613803.36, 825,1000,5000,4000,3000,84500,200, 2000,68000,97400,6267.8, 49500,27000,2100,10489.92,2200,2000,2000,1000,1900, 6000,5600,100,4000,14300,100,94100,1200,7000,2000,3000,1100,6900,1000,18500,6000,2000,4000,8400,11200,1000,15100,23300,4000,13100,4500,200,2000,50000,3900,3200,2000,2000,67000,2000,500,2000,1000,1900,10400,1900,2000,3200,6500,10000,2900,1000,14300,1000,2700,1500,12000,40000,25000,2800,5000,15000,4000,1000,21000,15000,16000,54000,1500,19200,2000,2000,1000,39000,5000,1100,18000,10000,3500,1000,10000,5000,14000,1800,4000,1000,300,4000,1000,100,1000,4400,2000,2000,12000,200,100,1000,1000,2000,1600,2000,4000,14000,4000,13500,1000,200,200,1000,18000,23000,41400,60000,500,3000,21000,6900,14600,1900,4000,4500,1000,2000,2000,1000,4100,2000,1000,2000,8000,3000,1500,2000,2000,3500,2000,2000,1000,3800,30000,55000,500,1000,1000,2000,62400,2000,3000,200,200! 0,3500,2000,2000,500,3000,4500,1000,10000,2000,3000,3600,1000,2000,2000,5000,23000,2000,1900,2000,60000,2000,60000,20000,2000,2000,4600,1000,2000,1000,18000,6000,62000,68000,26800,50000,45900,16900,21500,2000,22700,2000,2000,32000,10000,5000,138000,159700,13000,2000,17619,2000,1000,4000,2000,1500,4000,20000,158900,74100,6000,24900,60000,500,1000,40000,10000,50000,800,4000,4900,6500,5000,400,500,3000,32300,24000,300,11500,2000,5000,1000,500,5000,5500,17450,56800,2000,1000,21400,22000,60000,3000,7500,3000,1000,1000,2000,1500,83700,2000,4000,170005,70000,6700,1500,3500,2000,10563.97,1500,25000,2000,2000,2267.57,1100,3100,2000,3500,10000,2000,6000,1500,200,20000,4000,46400,296900,150000,3700,7500,20000,48500,3500,12000,2500,4000,8500,1000,14500,1000,11000,2000,2000,120000,20000,7600,3000,2000,8000,1600,40000,2000,5000,34187.67,279100,9900,31300,814000,43500,5100,49500,4500,6262.38,100,10400,2400,1500,5000,2500,15000,40000,32500,41100,358600,109600,514300,258200,225900,402700,27! 4300,75000,1000,56000,10000,4100,1000,15000,100,40000,7900,5000,105000 ,15100,2000,1100,2900,1500,600,500,1300,100,5000,5000,10000,10100,7000,40000,10500,5000,9500,1000,15200,2000,10000,10000,100,7800,3500,189900,58000,345000,151700,11000,6000,7000,15700,6000,3000,5000,10000,2000,1000,36000,1000,500,8000,9000,6000,2000,26500,6000,5000,97200,2000,5100,17000,2500,25500,24000,5400,90000,41500,6200,7500,5000,7000,41000,25000,1500,40000,5000,10000,21500,100,32000,32500,70000,500,66400,21000,5000,5000,12600,3000,6200,38900,10000,1000,60000,41100,1200,31300,2500,58000,4100,58000,42500) Sorry for the inconvenience. I do understand fitting of distribution to such data is not a full proof method, but this is what is the procedure that has been followed in the risk management risk industry. Please note that my question is not pertaining to operational risk. My question is if distributions are not fitting to a particular data, how do we proceed further to simualte data based on this data. Regards Amelia Marsh
Amelia Marsh <amelia_marsh08 <at> yahoo.com> writes:> Hello! (I dont know if I can raise this query here on this forum, > but I had already raised on teh finance forum, but have not received > any sugegstion, so now raising on this list. Sorry for the same. The > query is about what to do, if no statistical distribution is fitting > to data).> I am into risk management and deal with Operatioanl risk. As a part > of BASEL II guidelines, we need to arrive at the capital charge the > banks must set aside to counter any operational risk, if it > happens. As a part of Loss Distribution Approach (LDA), we need to > collate past loss events and use these loss amounts. The usual > process as being practised in the industry is as follows -> Using these historical loss amounts and using the various > statistical tests like KS test, AD test, PP plot, QQ plot etc, we > try to identify best statistical (continuous) distribution fitting > this historical loss data. Then using these estimated parameters > w.r.t. the statistical distribution, we simulate say 1 miliion loss > anounts and then taking appropriate percentile (say 99.9%), we > arrive at the capital charge.> However, many a times, loss data is such that fitting of > distribution to loss data is not possible. May be loss data is > multimodal or has significant variability, making the fitting of > distribution impossible. Can someone guide me how to deal with such > data and what can be done to simulate losses using this historical > loss data in R.A skew-(log)-normal fit doesn't look too bad ... (whenever you have positive data that are this strongly skewed, log-transforming is a good step) hist(log10(mydat),col="gray",breaks="FD",freq=FALSE) ## default breaks are much coarser: ## hist(log10(mydat),col="gray",breaks="Sturges",freq=FALSE) lines(density(log10(mydat)),col=2,lwd=2) library(fGarch) ss <- snormFit(log10(mydat)) xvec <- seq(2,6.5,length=101) lines(xvec,do.call(dsnorm,c(list(x=xvec),as.list(ss$par))), col="blue",lwd=2) ## or try a skew-Student-t: not very different: ss2 <- sstdFit(log10(mydat)) lines(xvec,do.call(dsstd,c(list(x=xvec),as.list(ss2$estimate))), col="purple",lwd=2) There are more flexible distributional families (Johnson, log-spline ...) Multimodal data are a different can of worms -- consider fitting a finite mixture model ...
Try: qqnorm(log(mydat)) That doesn't look too bad, does it? Now: where is the problem? Cheers, B. On Jul 22, 2015, at 12:41 PM, Amelia Marsh <amelia_marsh08 at yahoo.com> wrote:> Hello! > > (I dont know if I can raise this query here on this forum, but I had already raised on teh finance forum, but have not received any sugegstion, so now raising on this list. Sorry for the same. The query is about what to do, if no statistical distribution is fitting to data). > > I am into risk management and deal with Operatioanl risk. As a part of BASEL II guidelines, we need to arrive at the capital charge the banks must set aside to counter any operational risk, if it happens. As a part of Loss Distribution Approach (LDA), we need to collate past loss events and use these loss amounts. The usual process as being practised in the industry is as follows - > > Using these historical loss amounts and using the various statistical tests like KS test, AD test, PP plot, QQ plot etc, we try to identify best statistical (continuous) distribution fitting this historical loss data. Then using these estimated parameters w.r.t. the statistical distribution, we simulate say 1 miliion loss anounts and then taking appropriate percentile (say 99.9%), we arrive at the capital charge. > > However, many a times, loss data is such that fitting of distribution to loss data is not possible. May be loss data is multimodal or has significant variability, making the fitting of distribution impossible. Can someone guide me how to deal with such data and what can be done to simulate losses using this historical loss data in R. > > My data is as follows - > > mydat <- c(829.53,4000,6000,1000,1063904,102400,22000,4000,4200,2000,10000,400, 459006, 7276,4000,100,4000,10000,613803.36, 825,1000,5000,4000,3000,84500,200, 2000,68000,97400,6267.8, 49500,27000,2100,10489.92,2200,2000,2000,1000,1900, 6000,5600,100,4000,14300,100,94100,1200,7000,2000,3000,1100,6900,1000,18500,6000,2000,4000,8400,11200,1000,15100,23300,4000,13100,4500,200,2000,50000,3900,3200,2000,2000,67000,2000,500,2000,1000,1900,10400,1900,2000,3200,6500,10000,2900,1000,14300,1000,2700,1500,12000,40000,25000,2800,5000,15000,4000,1000,21000,15000,16000,54000,1500,19200,2000,2000,1000,39000,5000,1100,18000,10000,3500,1000,10000,5000,14000,1800,4000,1000,300,4000,1000,100,1000,4400,2000,2000,12000,200,100,1000,1000,2000,1600,2000,4000,14000,4000,13500,1000,200,200,1000,18000,23000,41400,60000,500,3000,21000,6900,14600,1900,4000,4500,1000,2000,2000,1000,4100,2000,1000,2000,8000,3000,1500,2000,2000,3500,2000,2000,1000,3800,30000,55000,500,1000,1000,2000,62400,2000,3000,200,200! > ! > 0,3500,2000,2000,500,3000,4500,1000,10000,2000,3000,3600,1000,2000,2000,5000,23000,2000,1900,2000,60000,2000,60000,20000,2000,2000,4600,1000,2000,1000,18000,6000,62000,68000,26800,50000,45900,16900,21500,2000,22700,2000,2000,32000,10000,5000,138000,159700,13000,2000,17619,2000,1000,4000,2000,1500,4000,20000,158900,74100,6000,24900,60000,500,1000,40000,10000,50000,800,4000,4900,6500,5000,400,500,3000,32300,24000,300,11500,2000,5000,1000,500,5000,5500,17450,56800,2000,1000,21400,22000,60000,3000,7500,3000,1000,1000,2000,1500,83700,2000,4000,170005,70000,6700,1500,3500,2000,10563.97,1500,25000,2000,2000,2267.57,1100,3100,2000,3500,10000,2000,6000,1500,200,20000,4000,46400,296900,150000,3700,7500,20000,48500,3500,12000,2500,4000,8500,1000,14500,1000,11000,2000,2000,120000,20000,7600,3000,2000,8000,1600,40000,2000,5000,34187.67,279100,9900,31300,814000,43500,5100,49500,4500,6262.38,100,10400,2400,1500,5000,2500,15000,40000,32500,41100,358600,109600,514300,258200,225900,402700,27! > 4300,75000,1000,56000,10000,4100,1000,15000,100,40000,7900,5000,105000 > ,15100,2000,1100,2900,1500,600,500,1300,100,5000,5000,10000,10100,7000,40000,10500,5000,9500,1000,15200,2000,10000,10000,100,7800,3500,189900,58000,345000,151700,11000,6000,7000,15700,6000,3000,5000,10000,2000,1000,36000,1000,500,8000,9000,6000,2000,26500,6000,5000,97200,2000,5100,17000,2500,25500,24000,5400,90000,41500,6200,7500,5000,7000,41000,25000,1500,40000,5000,10000,21500,100,32000,32500,70000,500,66400,21000,5000,5000,12600,3000,6200,38900,10000,1000,60000,41100,1200,31300,2500,58000,4100,58000,42500) > > Sorry for the inconvenience. I do understand fitting of distribution to such data is not a full proof method, but this is what is the procedure that has been followed in the risk management risk industry. Please note that my question is not pertaining to operational risk. My question is if distributions are not fitting to a particular data, how do we proceed further to simualte data based on this data. > > Regards > > Amelia Marsh > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
So - as you can see, your data can be modelled. Now the interesting question is: what do you do with that knowledge. I know nearly nothing about your domain, but given that the data looks log-normal, I am curious abut the following: - Most of the events are in the small-loss category. But most of the damage is done by the rare large losses. Is it even meaningful to guard against a single 1/1000 event? Shouldn't you be saying: my contingency funds need to be large enough to allow survival of, say, a fiscal year with 99.9 % probability? This is a very different question. - If a loss occurs, in what time do the funds need to be replenished? Do you need to take series of events into account? - The model assumes that the data are independent. This is probably a poor (and dangerous) assumption. Cheers, B. On Jul 22, 2015, at 3:56 PM, Ben Bolker <bbolker at gmail.com> wrote:> Amelia Marsh <amelia_marsh08 <at> yahoo.com> writes: > > >> Hello! (I dont know if I can raise this query here on this forum, >> but I had already raised on teh finance forum, but have not received >> any sugegstion, so now raising on this list. Sorry for the same. The >> query is about what to do, if no statistical distribution is fitting >> to data). > >> I am into risk management and deal with Operatioanl risk. As a part >> of BASEL II guidelines, we need to arrive at the capital charge the >> banks must set aside to counter any operational risk, if it >> happens. As a part of Loss Distribution Approach (LDA), we need to >> collate past loss events and use these loss amounts. The usual >> process as being practised in the industry is as follows - > >> Using these historical loss amounts and using the various >> statistical tests like KS test, AD test, PP plot, QQ plot etc, we >> try to identify best statistical (continuous) distribution fitting >> this historical loss data. Then using these estimated parameters >> w.r.t. the statistical distribution, we simulate say 1 miliion loss >> anounts and then taking appropriate percentile (say 99.9%), we >> arrive at the capital charge. > >> However, many a times, loss data is such that fitting of >> distribution to loss data is not possible. May be loss data is >> multimodal or has significant variability, making the fitting of >> distribution impossible. Can someone guide me how to deal with such >> data and what can be done to simulate losses using this historical >> loss data in R. > > A skew-(log)-normal fit doesn't look too bad ... (whenever you > have positive data that are this strongly skewed, log-transforming > is a good step) > > hist(log10(mydat),col="gray",breaks="FD",freq=FALSE) > ## default breaks are much coarser: > ## hist(log10(mydat),col="gray",breaks="Sturges",freq=FALSE) > lines(density(log10(mydat)),col=2,lwd=2) > library(fGarch) > ss <- snormFit(log10(mydat)) > xvec <- seq(2,6.5,length=101) > lines(xvec,do.call(dsnorm,c(list(x=xvec),as.list(ss$par))), > col="blue",lwd=2) > ## or try a skew-Student-t: not very different: > ss2 <- sstdFit(log10(mydat)) > lines(xvec,do.call(dsstd,c(list(x=xvec),as.list(ss2$estimate))), > col="purple",lwd=2) > > There are more flexible distributional families (Johnson, > log-spline ...) > > Multimodal data are a different can of worms -- consider > fitting a finite mixture model ... > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.