[Sending it again in plain text mode]
Greetings,
We've a fairly large dataset (around 60GB) to be loaded and crunched
in real time. The kind of data operations that will be performed on
this data are simple read only aggregates after filtering the
data.table instance based on the parameters that will passed in real
time. We need to have more than one instance of such R process running
to serve different testing environments (each testing environment has
fairly identical dataset but do have a *small amount of changes*). As
we all know, data.table loads the entire dataset into memory for
processing and hence we are facing a constraint on number of such
process that we could run on the machine. On a 128GB RAM machine, we
are coming up with ways in which we could reduce the memory footprint
so that we can try to spawn more instances and use the resources
efficiently. One of the approaches we tried out was memory
de-duplication using UKSM
(http://kerneldedup.org/en/projects/uksm/introduction), given that we
did have few idle cpu cores. Outcome of the experiment was quite
impressive, considering that the effort to set it up was quite less
and the entire approach considers the application layer as a black
box.
Quick snapshot of the results:
1 Instance (without UKSM): ~60GB RAM was being used
1 Instance (with UKSM): ~53 GB RAM was being used
2 Instance (without UKSM): ~125GB RAM was being used
2 Instance (with UKSM): ~81 GB RAM was being used
We can see that around 44 GB of RAM was saved after UKSM merged
similar pages and all this for a compromise of 1 CPU core on a 48
core machine. We did not feel any noticeable degradation of
performance because the data is refreshed by a batch job only once
(every morning); UKSM gets in at this time and performs the same page
merging and for the rest of day, its just read only analysis. The kind
of queries we fire on the dataset at most scans 2-3GB of the entire
dataset and hence the query subset spike was low as well.
We're interested in knowing if this is a plausible solution to this
problem? Any other points/solutions that we should be considering?
On Tue, Jul 15, 2014 at 9:25 PM, Varadharajan Mukundan
<srinathsmn at gmail.com> wrote:> Greetings,
>
> We've a fairly large dataset (around 60GB) to be loaded and crunched in
real
> time. The kind of data operations that will be performed on this data are
> simple read only aggregates after filtering the data.table instance based
on
> the parameters that will passed in real time. We need to have more than one
> instance of such R process running to serve different testing environments
> (each testing environment has fairly identical dataset but do have a *small
> amount of changes*). As we all know, data.table loads the entire dataset
> into memory for processing and hence we are facing a constraint on number
of
> such process that we could run on the machine. On a 128GB RAM machine, we
> are coming up with ways in which we could reduce the memory footprint so
> that we can try to spawn more instances and use the resources efficiently.
> One of the approaches we tried out was memory de-duplication using UKSM
> (http://kerneldedup.org/en/projects/uksm/introduction), given that we did
> have few idle cpu cores. Outcome of the experiment was quite impressive,
> considering that the effort to set it up was quite less and the entire
> approach considers the application layer as a black box.
>
> Quick snapshot of the results:
> 1 Instance (without UKSM): ~60GB RAM was being used
> 1 Instance (with UKSM): ~53 GB RAM was being used
>
> 2 Instance (without UKSM): ~125GB RAM was being used
> 2 Instance (with UKSM): ~81 GB RAM was being used
>
> We can see that around 44 GB of RAM was saved after UKSM merged similar
> pages and all this for a compromise of 1 CPU core on a 48 core machine. We
> did not feel any noticeable degradation of performance because the data is
> refreshed by a batch job only once (every morning); UKSM gets in at this
> time and performs the same page merging and for the rest of day, its just
> read only analysis. The kind of queries we fire on the dataset at most
scans
> 2-3GB of the entire dataset and hence the query subset spike was low as
> well.
>
> We're interested in knowing if this is a plausible solution to this
problem?
> Any other points/solutions that we should be considering?
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you
wanted"
> -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- www.thinkasgeek.wordpress.com
--
Thanks,
M. Varadharajan
------------------------------------------------
"Experience is what you get when you didn't get what you wanted"
-By Prof. Randy Pausch in "The Last Lecture"
My Journal :- www.thinkasgeek.wordpress.com