Dear Simon,
Thank you for your response! I was not able to provide you with the requested
information at an earlier stage since I am not a full time academic /
researcher.
An example of a bam call that may result in an error is:
bam(formula=Di ~ 1 + Gender + I(L_Dis==0) + s(DisPerc, by=as.numeric(L_Dis==2),
bs='cr'), offset=log(Ei*Mi), family=poisson, data=dtPF,
method="fREML", discrete=TRUE, gc.level=2);
Here, dtPF is a data.table object with 22m rows and 21 columns/variables, Gender
is a factor variable, L_Dis is an integer variable which equals 0 if DisPerc is
missing (manually set to 0.1), equals 1 if DisPerc==0, and equals 2 if
DisPerc>0 (ranges from 0 to 0.25).
The sessionInfo() provides the following output:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS/LAPACK:
/sara/eb/Debian9/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_sandybridgep-r0.2.20.so
locale:
[1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
[4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
[7] LC_PAPER=en_US LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
attached base packages:
[1] methods stats graphics grDevices utils datasets base
other attached packages:
[1] mgcv_1.8-27 nlme_3.1-137 data.table_1.12.0
loaded via a namespace (and not attached):
[1] compiler_3.4.3 Matrix_1.2-16 tools_3.4.3 splines_3.4.3
[5] grid_3.4.3 lattice_0.20-38
Thank you for your help!
Frank
________________________________
From: R-help <r-help-bounces at r-project.org> on behalf of r-help-request
at r-project.org <r-help-request at r-project.org>
Sent: Saturday, March 16, 2019 11:00 AM
To: r-help at r-project.org
Subject: R-help Digest, Vol 193, Issue 16
Send R-help mailing list submissions to
r-help at r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-help
or, via email, send a message with subject or body 'help' to
r-help-request at r-project.org
You can reach the person managing the list at
r-help-owner at r-project.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of R-help digest..."
Date: Fri, 15 Mar 2019 12:31:31 +0000
From: Simon Wood <simon.wood at bath.edu>
To: r-help at r-project.org
Subject: Re: [R] [mgcv] Memory issues with bam() on computer cluster
Message-ID: <d8e2643a-d960-0d86-4296-f0c7fcf149cb at bath.edu>
Content-Type: text/plain; charset="utf-8"
Can you supply the results of sessionInfo() please, and the full bam
call that causes this.
best,
Simon (mgcv maintainer)
On 15/03/2019 09:09, Frank van Berkum wrote:> Dear Community,
>
> In our current research we are trying to fit Generalized Additive Models to
a large dataset. We are using the package mgcv in R.
>
> Our dataset contains about 22 million records with less than 20 risk
factors for each observation, so in our case n>>p. The dataset covers the
period 2006 until 2011, and we analyse both the complete dataset and datasets in
which we leave out a single year. The latter part is done to analyse robustness
of the results. We understand k-fold cross validation may seem more appropriate,
but out approach is closer to what is done in practice (how will one additional
year of information affect your estimates?).
>
> We use the function bam as advocated in Wood et al. (2017), and we apply
the following options: bam(?, discrete=TRUE, chunk.size=10000, gc.level=1). We
run these analyses on a computer cluster (see
https://userinfo.surfsara.nl/systems/lisa/description for details), and the job
is allocated to a node within the computer cluster. A node has at least 16 cores
and 64Gb memory.
>
> We had expected 64Gb of memory to be sufficient for these analyses,
especially since the bam function is built specifically for large datasets.
However, when applying this function to the different datasets described above
with different regression specifications (different risk factors included in the
linear predictor), we sometimes obtain errors of the following form.
>
> Error in XWyd(G$Xd, w, z, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop,
ar.stop, :
>
> 'Calloc' could not allocate memory (22624897 of 8 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWyd
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> Error in Xbd(G$Xd, coef, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop) :
>
> 'Calloc' could not allocate memory (18590685 of 8 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> Xbd
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> Error: cannot allocate vector of size 1.7 Gb
>
> Timing stopped at: 2 0.556 4.831
>
> Error in system.time(oo <- .C(C_XWXd0, XWX = as.double(rep(0, (pt +
nt)^2)), :
>
> 'Calloc' could not allocate memory (55315650 of 24 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWXd ->
system.time -> .C
>
> Timing stopped at: 1.056 1.396 2.459
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> The errors seem to arise at different stages in the optimization process.
We have analysed whether these errors disappear if different settings are used
(different chunk.size, different gc.level), but this does not resolve our
problem. Also, the errors occur on different datasets when using different
settings, and even when using the same settings it is possible that an error
that occurred on dataset X in one run it does not necessarily occur on dataset X
in a different run. When using the discrete=TRUE option, optimization can be
parallelized, but we have chosen to not employ this feature to ensure memory
does not have to be shared between parallel processes.
>
> Naturally I cannot share our dataset with you which makes the problem
difficult to analyse. However, based on your collective knowledge, could you
pinpoint us to where the problem may occur? Is it something within the C-code
used within the package (as the last error seems to indicate), or is it related
to the computer cluster?
>
> Any help or insights is much appreciated.
>
> Kind regards,
>
> Frank
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Simon Wood, School of Mathematics, University of Bristol, BS8 1TW UK
https://people.maths.bris.ac.uk/~sw15190/
[[alternative HTML version deleted]]
Frank van Berkum
2019-Mar-20 09:15 UTC
[R] [mgcv] Memory issues with bam() on computer cluster
Dear Simon,
Thank you for your response! I was not able to provide you with the requested
information at an earlier stage since I am not a full time academic /
researcher.
An example of a bam call that may result in an error is:
bam(formula=Di ~ 1 + Gender + I(L_Dis==0) + s(DisPerc, by=as.numeric(L_Dis==2),
bs='cr'), offset=log(Ei*Mi), family=poisson, data=dtPF,
method="fREML", discrete=TRUE, gc.level=2);
Here, dtPF is a data.table object with 22m rows and 21 columns/variables, Gender
is a factor variable, L_Dis is an integer variable which equals 0 if DisPerc is
missing (manually set to 0.1), equals 1 if DisPerc==0, and equals 2 if
DisPerc>0 (ranges from 0 to 0.25).
The sessionInfo() provides the following output:
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
Matrix products: default
BLAS/LAPACK:
/sara/eb/Debian9/OpenBLAS/0.2.20-GCC-6.4.0-2.28/lib/libopenblas_sandybridgep-r0.2.20.so
locale:
[1] LC_CTYPE=en_US LC_NUMERIC=C LC_TIME=en_US
[4] LC_COLLATE=en_US LC_MONETARY=en_US LC_MESSAGES=en_US
[7] LC_PAPER=en_US LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_US LC_IDENTIFICATION=C
attached base packages:
[1] methods stats graphics grDevices utils datasets base
other attached packages:
[1] mgcv_1.8-27 nlme_3.1-137 data.table_1.12.0
loaded via a namespace (and not attached):
[1] compiler_3.4.3 Matrix_1.2-16 tools_3.4.3 splines_3.4.3
[5] grid_3.4.3 lattice_0.20-38
Thank you for your help!
Frank
________________________________
From: R-help <r-help-bounces at r-project.org> on behalf of r-help-request
at r-project.org <r-help-request at r-project.org>
Sent: Saturday, March 16, 2019 11:00 AM
To: r-help at r-project.org
Subject: R-help Digest, Vol 193, Issue 16
Send R-help mailing list submissions to
r-help at r-project.org
To subscribe or unsubscribe via the World Wide Web, visit
https://stat.ethz.ch/mailman/listinfo/r-help
or, via email, send a message with subject or body 'help' to
r-help-request at r-project.org
You can reach the person managing the list at
r-help-owner at r-project.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of R-help digest..."
Date: Fri, 15 Mar 2019 12:31:31 +0000
From: Simon Wood <simon.wood at bath.edu>
To: r-help at r-project.org
Subject: Re: [R] [mgcv] Memory issues with bam() on computer cluster
Message-ID: <d8e2643a-d960-0d86-4296-f0c7fcf149cb at bath.edu>
Content-Type: text/plain; charset="utf-8"
Can you supply the results of sessionInfo() please, and the full bam
call that causes this.
best,
Simon (mgcv maintainer)
On 15/03/2019 09:09, Frank van Berkum wrote:> Dear Community,
>
> In our current research we are trying to fit Generalized Additive Models to
a large dataset. We are using the package mgcv in R.
>
> Our dataset contains about 22 million records with less than 20 risk
factors for each observation, so in our case n>>p. The dataset covers the
period 2006 until 2011, and we analyse both the complete dataset and datasets in
which we leave out a single year. The latter part is done to analyse robustness
of the results. We understand k-fold cross validation may seem more appropriate,
but out approach is closer to what is done in practice (how will one additional
year of information affect your estimates?).
>
> We use the function bam as advocated in Wood et al. (2017), and we apply
the following options: bam(?, discrete=TRUE, chunk.size=10000, gc.level=1). We
run these analyses on a computer cluster (see
https://userinfo.surfsara.nl/systems/lisa/description for details), and the job
is allocated to a node within the computer cluster. A node has at least 16 cores
and 64Gb memory.
>
> We had expected 64Gb of memory to be sufficient for these analyses,
especially since the bam function is built specifically for large datasets.
However, when applying this function to the different datasets described above
with different regression specifications (different risk factors included in the
linear predictor), we sometimes obtain errors of the following form.
>
> Error in XWyd(G$Xd, w, z, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop,
ar.stop, :
>
> 'Calloc' could not allocate memory (22624897 of 8 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWyd
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> Error in Xbd(G$Xd, coef, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop) :
>
> 'Calloc' could not allocate memory (18590685 of 8 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> Xbd
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> Error: cannot allocate vector of size 1.7 Gb
>
> Timing stopped at: 2 0.556 4.831
>
> Error in system.time(oo <- .C(C_XWXd0, XWX = as.double(rep(0, (pt +
nt)^2)), :
>
> 'Calloc' could not allocate memory (55315650 of 24 bytes)
>
> Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWXd ->
system.time -> .C
>
> Timing stopped at: 1.056 1.396 2.459
>
> Execution halted
>
> Warning message:
>
> system call failed: Cannot allocate memory
>
> The errors seem to arise at different stages in the optimization process.
We have analysed whether these errors disappear if different settings are used
(different chunk.size, different gc.level), but this does not resolve our
problem. Also, the errors occur on different datasets when using different
settings, and even when using the same settings it is possible that an error
that occurred on dataset X in one run it does not necessarily occur on dataset X
in a different run. When using the discrete=TRUE option, optimization can be
parallelized, but we have chosen to not employ this feature to ensure memory
does not have to be shared between parallel processes.
>
> Naturally I cannot share our dataset with you which makes the problem
difficult to analyse. However, based on your collective knowledge, could you
pinpoint us to where the problem may occur? Is it something within the C-code
used within the package (as the last error seems to indicate), or is it related
to the computer cluster?
>
> Any help or insights is much appreciated.
>
> Kind regards,
>
> Frank
>
> [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Simon Wood, School of Mathematics, University of Bristol, BS8 1TW UK
https://people.maths.bris.ac.uk/~sw15190/
[[alternative HTML version deleted]]