Stefan Sobernig
2013-Feb-15 13:13 UTC
[R] Fitting pareto distribution / plotting observed & fitted dists
Some background: I have some data on structural dependencies in a base of code artifacts. The dependency structure is reflected in terms of relative node degrees, with each node representing some code unit (just as an example). This gives me real data of the following form (sorry for the longish posting): dat1 <- c(0.00245098039215686, 0, 0, 0, 0, 0, 0, 0, 0.0563725490196078, 0, 0, 0, 0.071078431372549, 0, 0, 0, 0, 0, 0, 0, 0.00490196078431373, 0.00245098039215686, 0, 0, 0, 0.00245098039215686, 0.0196078431372549, 0, 0.0588235294117647, 0.00490196078431373, 0.00980392156862745, 0, 0, 0.00245098039215686, 0, 0, 0, 0.0220588235294118, 0, 0, 0.00245098039215686, 0, 0, 0.00245098039215686, 0, 0.00245098039215686, 0.00980392156862745, 0.0294117647058824, 0.0637254901960784, 0, 0, 0, 0.00245098039215686, 0.0392156862745098, 0, 0.0147058823529412, 0, 0.017156862745098, 0.00245098039215686, 0, 0.00980392156862745, 0, 0.00735294117647059, 0.00490196078431373, 0.0514705882352941, 0.00245098039215686, 0, 0, 0, 0, 0.00245098039215686, 0, 0.00245098039215686, 0, 0.00245098039215686, 0, 0, 0.0147058823529412, 0.0367647058823529, 0.0269607843137255, 0.0269607843137255, 0.00735294117647059, 0.0441176470588235, 0, 0, 0.0196078431372549, 0, 0.00490196078431373, 0.0245098039215686, 0.00490196078431373, 0.00490196078431373, 0.0196078431372549, 0, 0.0318627450980392, 0.0245098039215686, 0, 0.00245098039215686, 0, 0.0416666666666667, 0, 0, 0.00490196078431373, 0.00490196078431373, 0.00245098039215686, 0.0122549019607843, 0.00490196078431373, 0.00490196078431373, 0.071078431372549, 0, 0, 0, 0, 0.00245098039215686, 0.00245098039215686, 0.00490196078431373, 0.0343137254901961, 0.00980392156862745, 0.00245098039215686, 0.053921568627451, 0, 0.0245098039215686, 0.00245098039215686, 0.0245098039215686, 0, 0.0294117647058824, 0.00490196078431373, 0.00980392156862745, 0.0367647058823529, 0, 0, 0.017156862745098, 0, 0.0245098039215686, 0, 0.071078431372549, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00980392156862745, 0.00980392156862745, 0.00490196078431373, 0.0343137254901961, 0.0147058823529412, 0.0122549019607843, 0.00735294117647059, 0, 0.00245098039215686, 0, 0.00245098039215686, 0.110294117647059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00490196078431373, 0.00735294117647059, 0, 0.00245098039215686, 0, 0, 0.00735294117647059, 0.0735294117647059, 0, 0, 0, 0, 0, 0.00490196078431373, 0, 0.00245098039215686, 0.105392156862745, 0, 0, 0, 0, 0, 0.00735294117647059, 0.00980392156862745, 0.00245098039215686, 0.0147058823529412, 0, 0, 0.00245098039215686, 0.00490196078431373, 0.00245098039215686, 0.00735294117647059, 0.0563725490196078, 0, 0, 0, 0, 0.0318627450980392, 0, 0, 0.00490196078431373, 0.0465686274509804, 0, 0.0147058823529412, 0.017156862745098, 0.00735294117647059, 0.0245098039215686, 0.017156862745098, 0, 0, 0, 0.071078431372549, 0, 0, 0, 0, 0, 0, 0, 0.00980392156862745, 0.0122549019607843, 0.00980392156862745, 0.0196078431372549, 0, 0, 0, 0.00980392156862745, 0.00490196078431373, 0.00735294117647059, 0, 0.0196078431372549, 0.0220588235294118, 0, 0, 0.00490196078431373, 0.0661764705882353, 0, 0, 0, 0, 0, 0, 0, 0, 0.00490196078431373, 0.0955882352941176, 0, 0, 0, 0, 0, 0.00490196078431373, 0.00490196078431373, 0, 0.00245098039215686, 0.0588235294117647, 0, 0, 0, 0, 0.0857843137254902, 0, 0, 0, 0, 0.017156862745098, 0, 0, 0.0294117647058824, 0, 0, 0.0122549019607843, 0, 0, 0.0147058823529412, 0, 0, 0.0318627450980392, 0, 0, 0.00245098039215686, 0.00245098039215686, 0.00980392156862745, 0.00245098039215686, 0.00245098039215686, 0.00245098039215686, 0.0784313725490196, 0, 0, 0.0784313725490196, 0, 0, 0, 0.00735294117647059, 0.00245098039215686, 0.00490196078431373, 0.00245098039215686, 0, 0.00245098039215686, 0, 0.0122549019607843, 0, 0.0857843137254902, 0, 0, 0, 0, 0.0196078431372549, 0, 0.00980392156862745, 0.00245098039215686, 0, 0, 0.00490196078431373, 0.00735294117647059, 0.00735294117647059, 0.00735294117647059, 0.00735294117647059, 0.00735294117647059, 0.00735294117647059, 0.00735294117647059, 0, 0, 0.00735294117647059, 0.00980392156862745, 0.0122549019607843, 0.0245098039215686, 0.00980392156862745, 0.00245098039215686, 0.00490196078431373, 0.00490196078431373, 0.00245098039215686, 0.00245098039215686, 0.00735294117647059, 0.00490196078431373, 0, 0.00245098039215686, 0.017156862745098, 0.00490196078431373, 0.00490196078431373, 0.00245098039215686, 0.00245098039215686, 0.0269607843137255, 0, 0.00490196078431373, 0.00490196078431373, 0.00490196078431373, 0.00980392156862745, 0.00735294117647059, 0.00980392156862745, 0.0245098039215686, 0, 0.0563725490196078, 0, 0, 0, 0, 0, 0, 0.053921568627451, 0, 0, 0.0220588235294118, 0, 0, 0, 0.00245098039215686, 0, 0, 0, 0, 0.00490196078431373, 0.00735294117647059, 0.00735294117647059, 0.0122549019607843, 0.0147058823529412, 0, 0.00980392156862745, 0, 0, 0.0196078431372549, 0, 0, 0.0122549019607843, 0, 0, 0.0147058823529412, 0, 0.00490196078431373, 0.0220588235294118, 0, 0.00980392156862745, 0.0220588235294118, 0.0269607843137255, 0) Out of curiosity, I plotted the double-log CDF and the Pareto Q-Q plots. This provided me with a visual cue of a potential power-law distribution (at least, that's what I am boldly thinking). So I went ahead and performed the procedure as suggested by Clauset, Shalizi, and Newman (2007/09), by reusing the essentials found in http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.r. I obtained and tested the following fitting estimates based on dat1: xmin <- 0.01715686 alpha <- 2.381581 I then simply wanted to print the observed and fitted dists in one plot, so I ran: library(ggplot2) library(VGAM) dat2 <- rpareto(length(dat1), location=xmin, shape=alpha) hi1 <- hist(dat1, plot=FALSE, breaks="FD") hi2 <- hist(dat2, plot=FALSE, breaks="FD") y1 <- hi1$counts/sum(hi1$counts) y2 <- hi2$counts/sum(hi2$counts) x1 <- hi1$mids x2 <- hi2$mids qplot() + geom_line(aes(x=x1,y=y1)) + geom_line(aes(x=x2,y=y2), color="red") y1.c <- rev(cumsum(rev(y1))) y2.c <- rev(cumsum(rev(y2))) qplot() + geom_line(aes(x=x1,y=y1.c)) + geom_line(aes(x=x2,y=y2.c), color="red") In the plot, the fitted dist line, while ressembling the observed, is shifted to the right; IMO because of the xmin/location parameter, correct? Is it appropriate to correct for xmin like so? x2 <- x2 - xmin Am I missing anything? Do I need to be more careful at an earlier stage (e.g., breaks as this is binned data)? I apologize in advance, I am a statistical autodidact, so I might just be confusing the obvious. I'd highly appreciate your hints :) Stefan -- Institute for Information Systems and New Media Vienna University of Economics and Business Augasse 2-6, A-1090 Vienna `- http://nm.wu.ac.at/en/sobernig `- stefan.sobernig at wu.ac.at `- ss at thinkersfoot.net `- +43-1-31336-4878 [phone] `- +43-1-31336-746 [fax]
Dear List, In my previous posting (https://stat.ethz.ch/pipermail/r-help/2013-February/347593.html), I refer to playing around with distribution fitting for a particular sort of data (see dat1 in the previous posting): I collected absolute node degrees of some network structure and computed the relative node degrees. This is because I want to compare different networks at another stage. Based on relative degrees, I attempted some distribution fitting procedures (see my previous posting). However, I realized that zero values (representing isolates in the network) in an otherwise continuous variable are problematic (well, not much of a surprise). I am wondering how to best deal with those isolates/zeros? Excluding them entirely would be an option but isolates are relevant for my analysis. Are there best practises for dealing with such a hybrid variable? I am aware of censored methods (for regression analysis etc.), but not when it comes down to distribution fitting, I fear. Many thanks, Stefan