Bug summary: glm() causes a segfault if the argument 'data' is a data frame with more than 16384 rows. Bug demonstration: -------input --------------- N <- 16400 df <- data.frame(x=runif(N, min=1,max=2),y=rpois(N, 2)) glm(y ~ x, family=poisson, data=df) ------ output --------------- *** caught segfault *** address (nil), cause 'unknown' Traceback: 1: ifelse(y == 0, 1, y/mu) 2: dev.resids(y, mu, weights) 3: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, mustart = mustart, offset = offset, family = family, control = control, intercept = attr(mt, "intercept") > 0) 4: glm(y ~ x, family = poisson, data = df) -------------------------------- The code generates a segfault if the value of 'N' is greater than 16384. regards Adrian Baddeley //////////////////////////////////////////////////////////// --please do not edit the information below-- Version: platform = x86_64-unknown-linux-gnu arch = x86_64 os = linux-gnu system = x86_64, linux-gnu status major = 2 minor = 10.1 year = 2009 month = 12 day = 14 svn rev = 50720 language = R version.string = R version 2.10.1 (2009-12-14) Locale: LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C Search Path: .GlobalEnv, package:stats, package:graphics, package:grDevices, package:utils, package:datasets, package:methods, Autoloads, package:base
>>>>> "AB" == Adrian Baddeley <adrian at maths.uwa.edu.au> >>>>> on Thu, 17 Dec 2009 08:35:09 +0100 (CET) writes:AB> Bug summary: AB> glm() causes a segfault if the argument 'data' AB> is a data frame with more than 16384 rows. not on my desktop AB> Bug demonstration: AB> -------input --------------- AB> N <- 16400 AB> df <- data.frame(x=runif(N, min=1,max=2),y=rpois(N, 2)) AB> glm(y ~ x, family=poisson, data=df) I don't get a problem, nor do I if I do the above in a for() loop, 100 times, nor do I get a problem when I do it 100 times with N <- 50000 (where you need to wait a few minutes) using the following x86_64-unknown-linux-gnu N <- 50000; for(n in 1:100) { cat("\nn = ",n,"\n----\n"); df <- data.frame(x=runif(N, min=1,max=2),y=rpois(N, 2)); print(glm(y ~ x, family=poisson, data=df)) } So, I guess it's problem only on *some* platforms. Mine is also x86_64-unknown-linux-gnu [with 8 Giga bytes of RAM, and a quad-core AMD Phenom(tm) II X4 925 Processor ]. Martin AB> ------ output --------------- AB> *** caught segfault *** AB> address (nil), cause 'unknown' AB> Traceback: AB> 1: ifelse(y == 0, 1, y/mu) AB> 2: dev.resids(y, mu, weights) AB> 3: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = AB> etastart, mustart = mustart, offset = offset, family = family, AB> control = control, intercept = attr(mt, "intercept") > 0) AB> 4: glm(y ~ x, family = poisson, data = df) AB> -------------------------------- AB> The code generates a segfault if the value of 'N' is greater than 16384. AB> regards AB> Adrian Baddeley AB> //////////////////////////////////////////////////////////// AB> --please do not edit the information below-- AB> Version: AB> platform = x86_64-unknown-linux-gnu AB> arch = x86_64 AB> os = linux-gnu AB> system = x86_64, linux-gnu AB> status AB> major = 2 AB> minor = 10.1 AB> year = 2009 AB> month = 12 AB> day = 14 AB> svn rev = 50720 AB> language = R AB> version.string = R version 2.10.1 (2009-12-14) AB> Locale: AB> LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C AB> Search Path: AB> .GlobalEnv, package:stats, package:graphics, package:grDevices, AB> package:utils, package:datasets, package:methods, Autoloads, package:base AB> ______________________________________________ AB> R-devel at r-project.org mailing list AB> https://stat.ethz.ch/mailman/listinfo/r-devel
I cannot reproduce this on our x86_64 Fedora systems (and I tried all the usual tricks such as gctorture and valgrind to provoke a problem). And I have fitted much larger GLMs many times over the last decade, so your 'bug summary' cannot be the whole story. Your example is random and you haven't set a seed: to eliminate that there is something specific about the data you tried can you set one and tell us which failed. One possibility is a compiler optimization bug, so can you please tell us what compilers were used with what flags to build this version of R, and if you built it yourself try it without optimization. (The machines I used had GCC 4.3.2 and 4.4.1 with CFLAGS="-g -O3 -Wall -pedantic -mtune=core2" FFLAGS="-g -O -mtune=core2": higher levels of optimization have known problems with recent x86_64 versions of gfortran, and I am wondering if that is an underlying issue.) On Thu, 17 Dec 2009, adrian at maths.uwa.edu.au wrote:> Bug summary: > glm() causes a segfault if the argument 'data' > is a data frame with more than 16384 rows. > > Bug demonstration: > > -------input --------------- > N <- 16400 > df <- data.frame(x=runif(N, min=1,max=2),y=rpois(N, 2)) > glm(y ~ x, family=poisson, data=df) > > ------ output --------------- > *** caught segfault *** > address (nil), cause 'unknown' > > Traceback: > 1: ifelse(y == 0, 1, y/mu) > 2: dev.resids(y, mu, weights) > 3: glm.fit(x = X, y = Y, weights = weights, start = start, etastart > etastart, mustart = mustart, offset = offset, family = family, > control = control, intercept = attr(mt, "intercept") > 0) > 4: glm(y ~ x, family = poisson, data = df) > > -------------------------------- > > The code generates a segfault if the value of 'N' is greater than 16384. > > regards > Adrian Baddeley > > //////////////////////////////////////////////////////////// > > --please do not edit the information below-- > > Version: > platform = x86_64-unknown-linux-gnu > arch = x86_64 > os = linux-gnu > system = x86_64, linux-gnu > status > major = 2 > minor = 10.1 > year = 2009 > month = 12 > day = 14 > svn rev = 50720 > language = R > version.string = R version 2.10.1 (2009-12-14) > > Locale: > LC_CTYPE=en_AU.UTF-8;LC_NUMERIC=C;LC_TIME=en_AU.UTF-8;LC_COLLATE=en_AU.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_AU.UTF-8;LC_PAPER=en_AU.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_AU.UTF-8;LC_IDENTIFICATION=C > > Search Path: > .GlobalEnv, package:stats, package:graphics, package:grDevices, > package:utils, package:datasets, package:methods, Autoloads, package:base > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595