thr3ads.net - R help - [R] handling a lot of data [Jan 2012]

If this information is useful, please help other people find it:
Share via:

Petr Kurtin

2012-Jan-30 08:54 UTC

[R] handling a lot of data

Hi,

I have got a lot of SPSS data for years 1993-2010. I load all data into
lists so I can easily index the values over the years. Unfortunately loaded
data occupy quite a lot of memory (10Gb) - so my question is, what's the
best approach to work with big data files? Can R get a value from the file
data without full loading into memory? How can a slower computer with not
enough memory work with such data?

I use the following commands:

data1993 = vector("list", 4);
data1993[[1]] = read.spss(...)  # first trimester
data1993[[2]] = read.spss(...)  # second trimester
...
data_all = vector("list", 17);
data_all[[1993]] = data1993;
...

and indexing, e.g.: data_all[[1993]][[1]]$DISTRICT, etc.

Thanks,
Petr Kurtin

Milan Bouchet-Valat

2012-Jan-30 09:39 UTC

head link

[R] handling a lot of data

Le lundi 30 janvier 2012 ? 09:54 +0100, Petr Kurtin a ?crit
:> Hi,
> 
> I have got a lot of SPSS data for years 1993-2010. I load all data into
> lists so I can easily index the values over the years. Unfortunately loaded
> data occupy quite a lot of memory (10Gb) - so my question is, what's
the
> best approach to work with big data files? Can R get a value from the file
> data without full loading into memory? How can a slower computer with not
> enough memory work with such data?
> 
> I use the following commands:
> 
> data1993 = vector("list", 4);
> data1993[[1]] = read.spss(...)  # first trimester
> data1993[[2]] = read.spss(...)  # second trimester
> ...
> data_all = vector("list", 17);
> data_all[[1993]] = data1993;
> ...
> 
> and indexing, e.g.: data_all[[1993]][[1]]$DISTRICT, etc.Have a look at the "Large memory and out-of-memory data" of High
Performance Computing task view[1]. In particular, you may want to use
the "ff" package and its ffdf object, which allows backing a data
frame
on a file so that RAM can be freed when needed.

Another advice I'd give you is to convert the data from SPSS format
to .RData once, and to always use the latter. In my experience,
importation often creates memory fragmentation, in addition to being
very slow (don't hesitate to save, quit and restart R to reduce this
problem).

What use do you make of the different years? If you need e.g. to run a
model on all of them at the same time, then you'll need to concatenate
all the data frames from the "data_all" list, and I guess that's
where
the RAM will be the problem: you'll have two copies of the data at the
same time. Once you've succeeded doing this, loading the full data set
will use less RAM, and so may work on lower-end computers.

A general solution is also to only load the variables you really need.
The "saves" package allows you to save the whole data set into an
archive of several .RData files, and to load only what you want from it.
It all depends on your needs, constraints, and failed attempts. ;-)


Regards


1: http://cran.r-project.org/web/views/HighPerformanceComputing.html

R. Michael Weylandt

2012-Jan-30 16:02 UTC

head link

[R] handling a lot of data

This won't help with large memory issues, but just a pointer:

When you start to construct data_all with these commands

data_all = vector("list", 17);
data_all[[1993]] = data1993;

The first pre-allocates a list of length 17, but the second adds the
data to the 1993rd slot requiring a complete reallocation. Look at
length(data_all). You'd be better off in general with something like
this:

data_all <- vector("list", 17)
names(data_all) <- 1993: 2010
data_all[["1993"]] <- data1993
etc.

which creates a vector of length 17 with components named after the years.

If you want to automate that last bit over each year, this would work:

for( yr in 1993: 2010){
    data_all[[as.character(yr)]] <- get(paste("data", yr, sep =
""))
}

It's also been pointed out to me that the Oarray package allows one to
start indexing at an arbitrary point (e.g., 1993 for the first slot)
which might be helpful for managing your data_all object.

Michael

On Mon, Jan 30, 2012 at 3:54 AM, Petr Kurtin <kurtin at avast.com>
wrote:> Hi,
>
> I have got a lot of SPSS data for years 1993-2010. I load all data into
> lists so I can easily index the values over the years. Unfortunately loaded
> data occupy quite a lot of memory (10Gb) - so my question is, what's
the
> best approach to work with big data files? Can R get a value from the file
> data without full loading into memory? How can a slower computer with not
> enough memory work with such data?
>
> I use the following commands:
>
> data1993 = vector("list", 4);
> data1993[[1]] = read.spss(...) ?# first trimester
> data1993[[2]] = read.spss(...) ?# second trimester
> ...
> data_all = vector("list", 17);
> data_all[[1993]] = data1993;
> ...
>
> and indexing, e.g.: data_all[[1993]][[1]]$DISTRICT, etc.
>
> Thanks,
> Petr Kurtin
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Reasonably Related Threads

Search for more possibly parallel threads

R help - Jan 2012 - handling a lot of data

[R] handling a lot of data

[R] handling a lot of data

[R] handling a lot of data

Reasonably Related Threads