Magnus Thor Torfason
2013-Nov-01 13:32 UTC
[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus
jim holtman
2013-Nov-01 13:49 UTC
[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason <zulutime.net at gmail.com> wrote:> Pretty much what the subject says: > > I used an env as the basis for a Hashtable in R, based on information that > this is in fact the way environments are implemented under the hood. > > I've been experimenting with doubling the number of entries, and so far it > has seemed to be scaling more or less linearly, as expected. > > But as I went from 17 million entries to 34 million entries, the completion > time has gone from 18 hours, to 5 days and counting. > > > The keys and values are in all cases strings of equal length. > > One might suspect that the slow-down might have to do with the memory being > swapped to disk, but from what I know about my computing environment, that > should not be the case. > > So my first question: > Is anyone familiar with anything in the implementation of environments that > would limit their use or slow them down (faster than O(nlog(n)) as the > number of entries is increased? > > And my second question: > I realize that this is not strictly what R environments were designed for, > but this is what my algorithm requires: I must go through these millions of > entries, storing them in the hash table and sometimes retrieving them along > the way, in a more or less random manner, which is contingent on the data I > am encountering, and on the contents of the hash table at each moment. > > Does anyone have a good recommendation for alternatives to implement huge, > fast, table-like structures in R? > > Best, > Magnus > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Magnus Thor Torfason
2013-Nov-04 11:19 UTC
[R] Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
There are around 16M unique values. After accounting for equivalence, the number is much smaller (I don't know how much smaller, since my program has not completed yet :-) Yes, I meant that "B and C are also equivalent". The original version was a typo. Best, Magnus On 11/1/2013 3:45 PM, jim holtman wrote:> in the 20M pairs, how many unique values are there? In your statement > above " But equivalence is transitive, so if A and B occur together in > one pair, and A and C occur together in another pair, then A and C are > also equivalent.", did you mean that "B and C are also equivalent"? > > Jim Holtman > Data Munger Guru >