I wonder whether the functions for skewness and kurtosis in the e1071 package are based on correct formulas. The functions in the package e1071 are: # -------------------------------------------- skewness <- function (x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] sum((x - mean(x))^3)/(length(x) * sd(x)^3) } # -------------------------------------------- and # -------------------------------------------- kurtosis <- function (x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] sum((x - mean(x))^4)/(length(x) * var(x)^2) - 3 } # -------------------------------------------- However, sd() and var() are the estimated population parameters. To calculate the sample statistics of skewness and kurtosis, shouldn't one use the sample statistics of the standard deviation (and variance), as well? For example: # -------------------------------------------- # Function to calculate the sample statistic of skewness: skew_s=function(x) { x = x[!is.na(x)] n = length(x) if (n < 3) { cat('valid cases = ',n,'\nskewness is not defined for less than 3 valid cases!\n') } else { z = sqrt(n/(n-1))*scale(x) mean(z^3) } } # -------------------------------------------- and # -------------------------------------------- # Function to calculate the sample statistic of kurtosis: kurt_s=function(x) { x = x[!is.na(x)] n = length(x) if (n < 4) { cat('valid cases = ',n,'\nkurtosis is not defined for less than 4 valid cases!\n') } else { z = sqrt(n/(n-1))*scale(x) mean(z^4)-3 } } # -------------------------------------------- Whereas, to calculate the (unbiased) estimated population parameter of skewness and kurtosis, the correction should also include the number of cases in the following way: # -------------------------------------------- # Function to calculate the unbiased populataion estimate of skewness: skew=function(x) { x = x[!is.na(x)] n = length(x) if (n < 3) { cat('valid cases = ',n,'\nskewness is not defined for less than 3 valid cases!\n') } else { z = scale(x) sum(z^3)*n/((n-1)*(n-2)) } } # -------------------------------------------- and # -------------------------------------------- # Function to calculate the unbiased population estimate of kurtosis: kurt=function(x) { x = x[!is.na(x)] n = length(x) if (n < 4) { cat('valid cases = ',n,'\nkurtosis is not defined for less than 4 valid cases!\n') } else { z = scale(x) sum(z^4)*n*(n+1)/((n-1)*(n-2)*(n-3))-3*(n-1)^2/((n-2)*(n-3)) } } # -------------------------------------------- Thus, it seems that the formulas used in the e1071 package are neither formulas for the sample statistics nor for the (unbiased) estimates of the population parameters. Is there another definition of kurtosis and skewness that I am not aware of? Dirk -- ************************************************* Dr. Dirk Enzmann Institute of Criminal Sciences Dept. of Criminology Edmund-Siemers-Allee 1 D-20146 Hamburg Germany phone: +49-040-42838.7498 (office) +49-040-42838.4591 (Billon) fax: +49-040-42838.2344 email: dirk.enzmann at jura.uni-hamburg.de www: http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
This is probably an issue over definitions rather than the correct answer. To me skewness and kurtosis are functions of the distribution rather than the population, they are equivalent to expectation rather than mean. For the normal distribution it makes no sense to estimate them as the distribution is uniquely defined by its first two moments. However there are two defnitions of kurotsis as it is often standardized such that the expectation is 0. HTH Phineas www.pwp.phineas.blueyonder.co.uk>>> Dirk Enzmann <dirk.enzmann at jura.uni-hamburg.de> 05/23/05 3:17 AM >>>I wonder whether the functions for skewness and kurtosis in the e1071 package are based on correct formulas. The functions in the package e1071 are: # -------------------------------------------- skewness <- function (x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] sum((x - mean(x))^3)/(length(x) * sd(x)^3) } # -------------------------------------------- and # -------------------------------------------- kurtosis <- function (x, na.rm = FALSE) { if (na.rm) x <- x[!is.na(x)] sum((x - mean(x))^4)/(length(x) * var(x)^2) - 3 } # -------------------------------------------- However, sd() and var() are the estimated population parameters. To calculate the sample statistics of skewness and kurtosis, shouldn't one use the sample statistics of the standard deviation (and variance), as well? For example: # -------------------------------------------- # Function to calculate the sample statistic of skewness: skew_s=function(x) { x = x[!is.na(x)] n = length(x) if (n < 3) { cat('valid cases = ',n,'\nskewness is not defined for less than 3 valid cases!\n') } else { z = sqrt(n/(n-1))*scale(x) mean(z^3) } } # -------------------------------------------- and # -------------------------------------------- # Function to calculate the sample statistic of kurtosis: kurt_s=function(x) { x = x[!is.na(x)] n = length(x) if (n < 4) { cat('valid cases = ',n,'\nkurtosis is not defined for less than 4 valid cases!\n') } else { z = sqrt(n/(n-1))*scale(x) mean(z^4)-3 } } # -------------------------------------------- Whereas, to calculate the (unbiased) estimated population parameter of skewness and kurtosis, the correction should also include the number of cases in the following way: # -------------------------------------------- # Function to calculate the unbiased populataion estimate of skewness: skew=function(x) { x = x[!is.na(x)] n = length(x) if (n < 3) { cat('valid cases = ',n,'\nskewness is not defined for less than 3 valid cases!\n') } else { z = scale(x) sum(z^3)*n/((n-1)*(n-2)) } } # -------------------------------------------- and # -------------------------------------------- # Function to calculate the unbiased population estimate of kurtosis: kurt=function(x) { x = x[!is.na(x)] n = length(x) if (n < 4) { cat('valid cases = ',n,'\nkurtosis is not defined for less than 4 valid cases!\n') } else { z = scale(x) sum(z^4)*n*(n+1)/((n-1)*(n-2)*(n-3))-3*(n-1)^2/((n-2)*(n-3)) } } # -------------------------------------------- Thus, it seems that the formulas used in the e1071 package are neither formulas for the sample statistics nor for the (unbiased) estimates of the population parameters. Is there another definition of kurtosis and skewness that I am not aware of? Dirk -- ************************************************* Dr. Dirk Enzmann Institute of Criminal Sciences Dept. of Criminology Edmund-Siemers-Allee 1 D-20146 Hamburg Germany phone: +49-040-42838.7498 (office) +49-040-42838.4591 (Billon) fax: +49-040-42838.2344 email: dirk.enzmann at jura.uni-hamburg.de www: http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
To me your answer is opaque (but that seems to be rather a problem of language ;-) ). Perhaps my question has not been expressed clearly enough. Let me state it differently: In the R package e1071 the formulas (implicit) used are (3) and (4) (see below), the standard deviation used in these formulas, however is based on (2) (see below). This seems to be inconsistent and my question is, whether there is a commonly used third definition of skewness and kurtosis in which the formulas for the "biased" skewness and kurtosis _but_ with the "unbiased" standard deviation are employed. The standard deviation can be defined as the _sample_ statistic: sd = 1/n * sum( (x - mean(x))^2 ) # (1) and as the estimated population parameter: sd = 1/(n-1) * sum( (x-mean(x))^2 ) # (2). In R the function sd() calculates the latter. In the same way, expressed via z-values skewness and kurtosis can be defined as the _sample_ statistic (also called "biased estimator" , see: http://www.mathdaily.com/lessons/Skewness ): skewness = mean(z^3) # (3) kurtosis = mean(z^4)-3 # (4) with z = (x - mean(x))/sd(x) with sd = 1/n * sum( (x - mean(x)^2 ) (thus: here sd is the _sample_ statistic, see (1) above!) but they can also be defined as the estimated population parameters (also called "unbiased", see: http://www.mathdaily.com/lessons/Kurtosis#Sample_kurtosis ): skewness = n/((n-1)*(n-2)) * sum(z^3) # (5) kurtosis = n*(n+1)/((n-1)*(n-2)*(n-3)) * sum(z^4) - 3*(n-1)^2/((n-2)*(n-3)) # (6) with z = (x - mean(x))/sd(x) with sd = 1/(n-1) * sum( (x - mean(x)^2 ) (thus: here sd is the estimated population parameter, see (2) above!. BTW: The R function scale() calculates the z-values based on this definition, as well.) Campbell wrote:> This is probably an issue over definitions rather than the correct > answer. To me skewness and kurtosis are functions of the distribution > rather than the population, they are equivalent to expectation rather > than mean. For the normal distribution it makes no sense to estimate > them as the distribution is uniquely defined by its first two moments. > However there are two defnitions of kurotsis as it is often > standardized such that the expectation is 0.************************************************* Dr. Dirk Enzmann Institute of Criminal Sciences Dept. of Criminology Edmund-Siemers-Allee 1 D-20146 Hamburg Germany phone: +49-040-42838.7498 (office) +49-040-42838.4591 (Billon) fax: +49-040-42838.2344 email: dirk.enzmann at jura.uni-hamburg.de www: http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
My last question to the R list is another example of asking too fast before reading (sorry). But by the way and because it might be interesting for others, an answer can be found in: Joanes, D. N. & Gill, C. A. (1998) Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society (Series D): The Statistician 47 (1), 183?189. The measures discussed in this article are: g1 : skewness as defined in formula (3) of my mail (and in STATA) G1 : skewness as defined in formula (5) of my mail (and in SAS, SPSS, and EXCEL) b1 : skewness as defined in the package e1071 (and in MINITAB and BMDP) g2 : kurtosis as defined in formula (4) of my mail (and in STATA) G2 : kurtosis as defined in formula (6) of my mail (and in SAS, SPSS, and EXCEL) b2 : kurtosis as defined in the package e1071 (and in MINITAB and BMDP) ------------------------ Off topic: I don't know why the thread of these mails are not included in the correct thread. Dirk -- ************************************************* Dr. Dirk Enzmann Institute of Criminal Sciences Dept. of Criminology Edmund-Siemers-Allee 1 D-20146 Hamburg Germany phone: +49-040-42838.7498 (office) +49-040-42838.4591 (Billon) fax: +49-040-42838.2344 email: dirk.enzmann at jura.uni-hamburg.de www: http://www2.jura.uni-hamburg.de/instkrim/kriminologie/Mitarbeiter/Enzmann/Enzmann.html
Possibly Parallel Threads
- Problem with read.spss() and as.data.frame(), or: alternative to subset()?
- skewness and kurtosis in e1071 correct? (correction)
- par(new=TRUE) vs par(new=FALSE)
- standardized random effects with ranef.lme()
- How to reply to a thread if receiving R-help mails in digest form