Roy Mendelssohn - NOAA Federal
2016-Mar-19 03:28 UTC
[R] Reshaping an array - how does it work in R
Hi Henrik: I want to do want in oceanography is called an EOF, which is just a PCA analysis. Unless I am missing something, in R I need to flatten my 3-D matrix into a 2-D data matrix. I can fit the entire 30GB matrix into memory, and I believe I have enough memory to do the PCA by constraining the number of components returned . What I don?t think I have enough memory for is an operation that makes a copy of the matrix. As I said, in theory I know how to do the flattening, it a simple command, but in practice I don?t have enough memory. So I spent the afternoon rewriting my code to read in parts of the data at a time and then putting those in the appropriate places of a matrix already flattened of appropriate size. In case someone is wondering, on the 3D grid the matrix is [1001,1001,3650]. So I create an empty matrix size [1001*1001,3650], and read in a slice of the lat-lon grid, and map those into the appropriate places in the flattened matrix. By reading in appropriately sized chunks my memory usage is not pushed too far. -Roy> On Mar 18, 2016, at 7:37 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote: > > On Fri, Mar 18, 2016 at 3:15 PM, Roy Mendelssohn - NOAA Federal > <roy.mendelssohn at noaa.gov> wrote: >> Thanks. That is what I needed to know. I don?t want to play around with some of the other suggestions, as I don?t totally understand what they do, and don?t want to risk messing up something and not be aware of it. >> >> There is a way to read in the data chunks at a time and reshape it and put, it into the (reshaped) larger array, harder to program but probably worth the pain to be certain of what I am doing. > > I recommend this approach; whenever I work with reasonable large data > (that may become even larger in the future), I try to implement a > constant-memory version of the algorithm, which often comes down to > processing data in chunks. The simplest version of this is to read > all data into memory and the subset, but if you can read data in in > chunks that is even better. > > Though, I'm curious to what matrix operations you wish to perform. > Because if you wish to do regular summation, then base::.rowSums() and > base::.colSums() allow you to override the default dimensions on the > fly without inducing extra copies, e.g. > >> X <- array(1:24, dim=c(2,3,4)) >> .rowSums(X, m=6, n=4) > [1] 40 44 48 52 56 60 >> rowSums(matrix(X, nrow=6, ncol=4)) > [1] 40 44 48 52 56 60 > > For other types of calculations, you might want to look at > matrixStats. It has partial(*) support for overriding the default > dimensions in a similar fashion. For instance, > >> library("matrixStats") >> rowVars(X, dim.=c(6,4)) > > The above effectively calculates rowVars(matrix(X, nrow=6, ncol=4)) > without making copies. > > (*) By partial I mean that this is a feature that hasn't been pushed > through to all of matrixStats functions, cf. > https://github.com/HenrikBengtsson/matrixStats/issues/83. > > Cheers, > > Henrik > (author of matrixStats) > >> >> I had a feeling a copy was made, just wanted to make certain of it. >> >> Thanks again, >> >> -Roy >> >>> On Mar 18, 2016, at 2:56 PM, D?nes T?th <toth.denes at ttk.mta.hu> wrote: >>> >>> Hi Roy, >>> >>> R (usually) makes a copy if the dimensionality of an array is modified, even if you use this syntax: >>> x <- array(1:24, c(2, 3, 4)) >>> dim(x) <- c(6, 4) >>> >>> See also ?tracemem, ?data.table::address, ?pryr::address and other tools to trace if an internal copy is done. >>> >>> Workaround: use data.table::setattr or bit::setattr to modify the dimensions in place (i.e., without making a copy). Risk: if you modify an object by reference, all other objects which point to the same memory address will be modified silently, too. >>> >>> HTH, >>> Denes >>> >>> >>> >>> On 03/18/2016 10:28 PM, Roy Mendelssohn - NOAA Federal wrote: >>>> Hi All: >>>> >>>> I am working with a very large array. if noLat is the number of latitudes, noLon the number of longitudes and noTime the number of time periods, the array is of the form: >>>> >>>> myData[noLat, no Lon, noTime]. >>>> >>>> It is read in this way because that is how it is stored in a (series) of netcdf files. For the analysis I need to do, I need instead the array: >>>> >>>> myData[noLat*noLon, noTime]. Normally this would be easy: >>>> >>>> myData<- array(myData,dim=c(noLat*noLon,noTime)) >>>> >>>> My question is how does this command work in R - does it make a copy of the existing array, with different indices for the dimensions, or does it just redo the indices and leave the given array as is? The reason for this question is my array is 30GB in memory, and I don?t have enough space to have a copy of the array in memory. If the latter I will have to figure out a work around to bring in only part of the data at a time and put it into the proper locations. >>>> >>>> Thanks, >>>> >>>> -Roy >>>> >>>> >>>> >>>> ********************** >>>> "The contents of this message do not reflect any position of the U.S. Government or NOAA." >>>> ********************** >>>> Roy Mendelssohn >>>> Supervisory Operations Research Analyst >>>> NOAA/NMFS >>>> Environmental Research Division >>>> Southwest Fisheries Science Center >>>> ***Note new address and phone*** >>>> 110 Shaffer Road >>>> Santa Cruz, CA 95060 >>>> Phone: (831)-420-3666 >>>> Fax: (831) 420-3980 >>>> e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ >>>> >>>> "Old age and treachery will overcome youth and skill." >>>> "From those who have been given much, much will be expected" >>>> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >> >> ********************** >> "The contents of this message do not reflect any position of the U.S. Government or NOAA." >> ********************** >> Roy Mendelssohn >> Supervisory Operations Research Analyst >> NOAA/NMFS >> Environmental Research Division >> Southwest Fisheries Science Center >> ***Note new address and phone*** >> 110 Shaffer Road >> Santa Cruz, CA 95060 >> Phone: (831)-420-3666 >> Fax: (831) 420-3980 >> e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ >> >> "Old age and treachery will overcome youth and skill." >> "From those who have been given much, much will be expected" >> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
On Fri, Mar 18, 2016 at 8:28 PM, Roy Mendelssohn - NOAA Federal <roy.mendelssohn at noaa.gov> wrote:> Hi Henrik: > > I want to do want in oceanography is called an EOF, which is just a PCA analysis. Unless I am missing something, in R I need to flatten my 3-D matrix into a 2-D data matrix. I can fit the entire 30GB matrix into memory, and I believe I have enough memory to do the PCA by constraining the number of components returned . What I don?t think I have enough memory for is an operation that makes a copy of the matrix. > > As I said, in theory I know how to do the flattening, it a simple command, but in practice I don?t have enough memory. So I spent the afternoon rewriting my code to read in parts of the data at a time and then putting those in the appropriate places of a matrix already flattened of appropriate size. In case someone is wondering, on the 3D grid the matrix is [1001,1001,3650]. So I create an empty matrix size [1001*1001,3650], and read in a slice of the lat-lon grid, and map those into the appropriate places in the flattened matrix. By reading in appropriately sized chunks my memory usage is not pushed too far.Sounds good. There's another small caveat. Make sure to specify the 'data' argument for matrix() we allocating an "empty" matrix, e.g. X <- matrix(NA_real_, nrow=1001*1001, ncol=3650) This will give you a double matrix with all missing value. If you use the default X <- matrix(nrow=1001*1001, ncol=3650) you'll get a logical matrix, which will introduce a copy as soon as you assign a double value (e.g. X[1,1] <- 3.14). The latter is a complete waste of memory/time. See http://www.jottr.org/2014/06/matrixNA-wrong-way.html for details. /Henrik> > -Roy > > >> On Mar 18, 2016, at 7:37 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote: >> >> On Fri, Mar 18, 2016 at 3:15 PM, Roy Mendelssohn - NOAA Federal >> <roy.mendelssohn at noaa.gov> wrote: >>> Thanks. That is what I needed to know. I don?t want to play around with some of the other suggestions, as I don?t totally understand what they do, and don?t want to risk messing up something and not be aware of it. >>> >>> There is a way to read in the data chunks at a time and reshape it and put, it into the (reshaped) larger array, harder to program but probably worth the pain to be certain of what I am doing. >> >> I recommend this approach; whenever I work with reasonable large data >> (that may become even larger in the future), I try to implement a >> constant-memory version of the algorithm, which often comes down to >> processing data in chunks. The simplest version of this is to read >> all data into memory and the subset, but if you can read data in in >> chunks that is even better. >> >> Though, I'm curious to what matrix operations you wish to perform. >> Because if you wish to do regular summation, then base::.rowSums() and >> base::.colSums() allow you to override the default dimensions on the >> fly without inducing extra copies, e.g. >> >>> X <- array(1:24, dim=c(2,3,4)) >>> .rowSums(X, m=6, n=4) >> [1] 40 44 48 52 56 60 >>> rowSums(matrix(X, nrow=6, ncol=4)) >> [1] 40 44 48 52 56 60 >> >> For other types of calculations, you might want to look at >> matrixStats. It has partial(*) support for overriding the default >> dimensions in a similar fashion. For instance, >> >>> library("matrixStats") >>> rowVars(X, dim.=c(6,4)) >> >> The above effectively calculates rowVars(matrix(X, nrow=6, ncol=4)) >> without making copies. >> >> (*) By partial I mean that this is a feature that hasn't been pushed >> through to all of matrixStats functions, cf. >> https://github.com/HenrikBengtsson/matrixStats/issues/83. >> >> Cheers, >> >> Henrik >> (author of matrixStats) >> >>> >>> I had a feeling a copy was made, just wanted to make certain of it. >>> >>> Thanks again, >>> >>> -Roy >>> >>>> On Mar 18, 2016, at 2:56 PM, D?nes T?th <toth.denes at ttk.mta.hu> wrote: >>>> >>>> Hi Roy, >>>> >>>> R (usually) makes a copy if the dimensionality of an array is modified, even if you use this syntax: >>>> x <- array(1:24, c(2, 3, 4)) >>>> dim(x) <- c(6, 4) >>>> >>>> See also ?tracemem, ?data.table::address, ?pryr::address and other tools to trace if an internal copy is done. >>>> >>>> Workaround: use data.table::setattr or bit::setattr to modify the dimensions in place (i.e., without making a copy). Risk: if you modify an object by reference, all other objects which point to the same memory address will be modified silently, too. >>>> >>>> HTH, >>>> Denes >>>> >>>> >>>> >>>> On 03/18/2016 10:28 PM, Roy Mendelssohn - NOAA Federal wrote: >>>>> Hi All: >>>>> >>>>> I am working with a very large array. if noLat is the number of latitudes, noLon the number of longitudes and noTime the number of time periods, the array is of the form: >>>>> >>>>> myData[noLat, no Lon, noTime]. >>>>> >>>>> It is read in this way because that is how it is stored in a (series) of netcdf files. For the analysis I need to do, I need instead the array: >>>>> >>>>> myData[noLat*noLon, noTime]. Normally this would be easy: >>>>> >>>>> myData<- array(myData,dim=c(noLat*noLon,noTime)) >>>>> >>>>> My question is how does this command work in R - does it make a copy of the existing array, with different indices for the dimensions, or does it just redo the indices and leave the given array as is? The reason for this question is my array is 30GB in memory, and I don?t have enough space to have a copy of the array in memory. If the latter I will have to figure out a work around to bring in only part of the data at a time and put it into the proper locations. >>>>> >>>>> Thanks, >>>>> >>>>> -Roy >>>>> >>>>> >>>>> >>>>> ********************** >>>>> "The contents of this message do not reflect any position of the U.S. Government or NOAA." >>>>> ********************** >>>>> Roy Mendelssohn >>>>> Supervisory Operations Research Analyst >>>>> NOAA/NMFS >>>>> Environmental Research Division >>>>> Southwest Fisheries Science Center >>>>> ***Note new address and phone*** >>>>> 110 Shaffer Road >>>>> Santa Cruz, CA 95060 >>>>> Phone: (831)-420-3666 >>>>> Fax: (831) 420-3980 >>>>> e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ >>>>> >>>>> "Old age and treachery will overcome youth and skill." >>>>> "From those who have been given much, much will be expected" >>>>> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. >>>>> >>>>> ______________________________________________ >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>> >>> ********************** >>> "The contents of this message do not reflect any position of the U.S. Government or NOAA." >>> ********************** >>> Roy Mendelssohn >>> Supervisory Operations Research Analyst >>> NOAA/NMFS >>> Environmental Research Division >>> Southwest Fisheries Science Center >>> ***Note new address and phone*** >>> 110 Shaffer Road >>> Santa Cruz, CA 95060 >>> Phone: (831)-420-3666 >>> Fax: (831) 420-3980 >>> e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ >>> >>> "Old age and treachery will overcome youth and skill." >>> "From those who have been given much, much will be expected" >>> "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > > ********************** > "The contents of this message do not reflect any position of the U.S. Government or NOAA." > ********************** > Roy Mendelssohn > Supervisory Operations Research Analyst > NOAA/NMFS > Environmental Research Division > Southwest Fisheries Science Center > ***Note new address and phone*** > 110 Shaffer Road > Santa Cruz, CA 95060 > Phone: (831)-420-3666 > Fax: (831) 420-3980 > e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ > > "Old age and treachery will overcome youth and skill." > "From those who have been given much, much will be expected" > "the arc of the moral universe is long, but it bends toward justice" -MLK Jr. >
Roy Mendelssohn - NOAA Federal
2016-Mar-19 16:03 UTC
[R] Reshaping an array - how does it work in R
> On Mar 19, 2016, at 8:18 AM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote: > > On Fri, Mar 18, 2016 at 8:28 PM, Roy Mendelssohn - NOAA Federal > <roy.mendelssohn at noaa.gov> wrote: >> Hi Henrik: >> >> I want to do want in oceanography is called an EOF, which is just a PCA analysis. Unless I am missing something, in R I need to flatten my 3-D matrix into a 2-D data matrix. I can fit the entire 30GB matrix into memory, and I believe I have enough memory to do the PCA by constraining the number of components returned . What I don?t think I have enough memory for is an operation that makes a copy of the matrix. >> >> As I said, in theory I know how to do the flattening, it a simple command, but in practice I don?t have enough memory. So I spent the afternoon rewriting my code to read in parts of the data at a time and then putting those in the appropriate places of a matrix already flattened of appropriate size. In case someone is wondering, on the 3D grid the matrix is [1001,1001,3650]. So I create an empty matrix size [1001*1001,3650], and read in a slice of the lat-lon grid, and map those into the appropriate places in the flattened matrix. By reading in appropriately sized chunks my memory usage is not pushed too far. > > Sounds good. There's another small caveat. Make sure to specify the > 'data' argument for matrix() we allocating an "empty" matrix, e.g. > > X <- matrix(NA_real_, nrow=1001*1001, ncol=3650) > > This will give you a double matrix with all missing value. If you use > the default > > X <- matrix(nrow=1001*1001, ncol=3650) > > you'll get a logical matrix, which will introduce a copy as soon as > you assign a double value (e.g. X[1,1] <- 3.14). The latter is a > complete waste of memory/time. See > http://www.jottr.org/2014/06/matrixNA-wrong-way.html for details. > > /HenrikThanks. Yes one time for some reason I can?t remember I did ?NA where that is documented but it is not something you would think of offhand. -Roy ********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.