thr3ads.net - R devel - [Rd] Performing Merge and Duplicated on very large files [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Eitan Rubin

2007-Apr-18 03:44 UTC

[Rd] Performing Merge and Duplicated on very large files

Hi,

  I am working with very large matrices (>1 million records), and need to
1. Join the files (can be achieved with Merge)
2. Find lines that have the same value in some field (after the join) and
randomly sample 1 row.

I am concerned with the complexity of merge - how (un)efficient is it? I
don't have access to the real data, I need to send the script to someone who
does, so I can't just try and see what happens.

Similarly I am worried about the duplicated function - will it run on the
merged matrix? It is expected to be ~500,000 rows long, and have small
clusters of duplicated values (1-10 repeats of the same value).

  ER
 - - - - - -
Eitan Rubin
Dept. of Microbiology and Immunology
AND
National Institute of Biotechnology in the Negev

Ben Gurion University
Beer Sheva, Israel
Phone: 08-6479197

Sean Davis

2007-Apr-18 10:17 UTC

head link

[Rd] Performing Merge and Duplicated on very large files

On Tuesday 17 April 2007 23:44, Eitan Rubin wrote:> Hi,
>
>   I am working with very large matrices (>1 million records), and need
to
> 1. Join the files (can be achieved with Merge)
> 2. Find lines that have the same value in some field (after the join) and
> randomly sample 1 row.
>
> I am concerned with the complexity of merge - how (un)efficient is it? I
> don't have access to the real data, I need to send the script to
someone
> who does, so I can't just try and see what happens.
>
> Similarly I am worried about the duplicated function - will it run on the
> merged matrix? It is expected to be ~500,000 rows long, and have small
> clusters of duplicated values (1-10 repeats of the same value).
Eitan,

This is a question better asked on R-help.  You will need to test, perhaps 
with simulated data, but merge() and certainly duplicated() can typically be 
used on 1M rows.  However, you imply that you have multiple of these tables.  
You might also consider using a SQL database for your data.  RSQLite is a 
self-contained database and gives you the power of SQL with little hassle.  

Sean

Seemingly Similar Threads

Search for more possibly parallel threads

R devel - Apr 2007 - Performing Merge and Duplicated on very large files

[Rd] Performing Merge and Duplicated on very large files

[Rd] Performing Merge and Duplicated on very large files

Seemingly Similar Threads