thr3ads.net - R help - [R] Finding unique elements faster [Dec 2014]

If this information is useful, please help other people find it:
Share via:

apeshifter

2014-Dec-08 20:21 UTC

[R] Finding unique elements faster

Dear all, 

for the past two weeks, I've been working on a script to retrieve word pairs
and calculate some of their statistics using R. Everything seemed to work
fine until I switched from a small test dataset to the 'real thing' and
noticed what a runtime monster I had devised! 

I could reduce processing time significantly when I realized that with R, I
did not have to do everything in loops and count things vector element by
vector element, but could just have the program count everything with
tables, e.g. with 
  > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]

However, now I seem to have run into a performance problem that I cannot
solve. I hope there's a kind soul on this list who has some advice for me.
On to the problem:

The last relic of the afore-mentioned for-loop that goes through all the
word pairs and tries to calculate some statistics on them is the following
line of code:
  > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
(where word1 and word2 are the first and second word within the two-word
sequence (all.word.pairs, above)
  
Here, I am trying to count the number of 'types', linguistically
speaking,
before the second word in the two-word sequence (later, I am doing the same
for the first word within the sequence). The expression works, but given my
~400,000 word pairs/word1's/word2's etc, this takes quite some time.
About
10 hours on my machine, in fact, since R cannot use the other three of the
four cores. Since I want to repeat the process for another 20 corpora of
similar size, I would definitely appreciate some help on this subject.

I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]'
and the
subset() function and both seem to work (though I haven't checked whether
all the numbers are in fact correctly calculated), but they take about the
same amount of time. So that's no use for me. 

Does anybody have any tips to speed this up?

Thank you very much!



--
View this message in context:
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
Sent from the R help mailing list archive at Nabble.com.

Gheorghe Postelnicu

2014-Dec-08 20:57 UTC

head link

[R] Finding unique elements faster

2 ideas (haven't tried them):

1. if your data is in a data frame, did you try using the by function?
Seems it would do the grouping for you.

2. Since you mention the cpu cores, you could use libraries like foreach
and %dopar% or mcapply.

I would try 1. and see if it provides a sufficient speed-up.

On Mon, Dec 8, 2014 at 9:21 PM, apeshifter <ch_koch at gmx.de> wrote:
> Dear all,
>
> for the past two weeks, I've been working on a script to retrieve word
> pairs
> and calculate some of their statistics using R. Everything seemed to work
> fine until I switched from a small test dataset to the 'real thing'
and
> noticed what a runtime monster I had devised!
>
> I could reduce processing time significantly when I realized that with R, I
> did not have to do everything in loops and count things vector element by
> vector element, but could just have the program count everything with
> tables, e.g. with
>   > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]
>
> However, now I seem to have run into a performance problem that I cannot
> solve. I hope there's a kind soul on this list who has some advice for
me.
> On to the problem:
>
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>   > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
>
> Here, I am trying to count the number of 'types', linguistically
speaking,
> before the second word in the two-word sequence (later, I am doing the same
> for the first word within the sequence). The expression works, but given my
> ~400,000 word pairs/word1's/word2's etc, this takes quite some
time. About
> 10 hours on my machine, in fact, since R cannot use the other three of the
> four cores. Since I want to repeat the process for another 20 corpora of
> similar size, I would definitely appreciate some help on this subject.
>
> I have been trying
'typefreq.after1<-table(unique(word2[word1]))[2]' and
> the
> subset() function and both seem to work (though I haven't checked
whether
> all the numbers are in fact correctly calculated), but they take about the
> same amount of time. So that's no use for me.
>
> Does anybody have any tips to speed this up?
>
> Thank you very much!
>
>
>
> --
> View this message in context:
> http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Jeff Newmiller

2014-Dec-08 21:47 UTC

head link

[R] Finding unique elements faster

The data.table package might be of use to you, but lacking a reproducible 
example [1] I think I will leave figuring out just how to you.

Being on Nabble you may not be able to see the footer appended to every 
message on this MAILING LIST. For your benefit, here it is:

* R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
* https://stat.ethz.ch/mailman/listinfo/r-help
* PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
* and provide commented, minimal, self-contained, reproducible code.

[1]
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example

On Mon, 8 Dec 2014, apeshifter wrote:
> Dear all,
>
> for the past two weeks, I've been working on a script to retrieve word
pairs
> and calculate some of their statistics using R. Everything seemed to work
> fine until I switched from a small test dataset to the 'real thing'
and
> noticed what a runtime monster I had devised!
>
> I could reduce processing time significantly when I realized that with R, I
> did not have to do everything in loops and count things vector element by
> vector element, but could just have the program count everything with
> tables, e.g. with
>  > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs]
>
> However, now I seem to have run into a performance problem that I cannot
> solve. I hope there's a kind soul on this list who has some advice for
me.
> On to the problem:
>
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>  > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
>
> Here, I am trying to count the number of 'types', linguistically
speaking,
> before the second word in the two-word sequence (later, I am doing the same
> for the first word within the sequence). The expression works, but given my
> ~400,000 word pairs/word1's/word2's etc, this takes quite some
time. About
> 10 hours on my machine, in fact, since R cannot use the other three of the
> four cores. Since I want to repeat the process for another 20 corpora of
> similar size, I would definitely appreciate some help on this subject.
>
> I have been trying
'typefreq.after1<-table(unique(word2[word1]))[2]' and the
> subset() function and both seem to work (though I haven't checked
whether
> all the numbers are in fact correctly calculated), but they take about the
> same amount of time. So that's no use for me.
>
> Does anybody have any tips to speed this up?
>
> Thank you very much!
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
>
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

Stefan Evert

2014-Dec-08 22:16 UTC

head link

[R] Finding unique elements faster

On 8 Dec 2014, at 21:21, apeshifter <ch_koch at gmx.de> wrote:
> The last relic of the afore-mentioned for-loop that goes through all the
> word pairs and tries to calculate some statistics on them is the following
> line of code:
>> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])]))
> (where word1 and word2 are the first and second word within the two-word
> sequence (all.word.pairs, above)
It is difficult to tell without a fully reproducible example, but from this code
I get the impression that word1 and word2 represent word pair _tokens_ rather
than pair _types_ (otherwise you wouldn't need the unique()).  That's a
very inefficient way of dealing with co-occurrence data, especially since
you've already computed the set of pair types in order to get the
co-occurrence counts.

If word1, word2 are type vectors (i.e. every pair occurs just once), then this
should give you what you want:

	tapply(BB$word2, BB$word1, length)

If they are token vectors, you need to supply your own type counting function,
which will be a bit slower

	tapply(BB$word2, BB$word1, function (x) length(unique(x)))

On my machine, this takes about 0.2s for 770,000 word pairs.


BTW, you might want to take a look at Unit 4 of the SIGIL course

	http://sigil.r-forge.r-project.org/

which has some tips on how you can deal efficiently with co-occurrence data in
R.

Best,
Stefan

apeshifter

2014-Dec-09 10:02 UTC

head link

[R] Finding unique elements faster

Thank you all for your suggestions! I must say I am amazed by the number of
people who consider helping out another! Fells like it was a good idea to
start using R - back when I was still using Perl for such tasks, I'd been
happy to have this kind of support!

@ Gheorghe Postelnicu: Unfortunately, the data is not yet in a data frame
when this part of the program starts. At this point, I am trying to fill in
all the relevant vectors (all.word.pairs, word1, word2, freq.word1,
freq.word2, typefreq.w1, typefreq.w2, ...) and then combine them to a data
frame. I will try to get my head around the doParallel package package for
the foreach loop, since parallel computing would certainly be helpful. 

@ Jeff Newmiller: Sound interesting, but I fear the same problem applies as
for Gheorghe's suggestion. I will need a data frame first,for which I do not
have all the correct values... Will keep the package in mind, though, for
future projects.

@ Stefan Evert-3: I am not sure I understand what you mean in the second
example. Since the counting of types is exactly my problem at the moment, I
do not see how I could provide a function that would work more efficiently
in the context you are describing. The line of code that I was giving is
exactly my attempt at doint this... Sorry, I might just not be getting what
you are aiming at... :-/  However, your assumptions are quite correct. word1
and word2 do indeed contain word tokens, as does all.word.pairs. The reason
for this is that I need the word pairs within the vector to be in the same
order as they appeared in the original corpus files. Also, thank you for the
link. I will check this out when I am analysing collocates. However, I
didn't find notes on my specific problem in the slides. However, please do
not think I was not using reference material for designing my script. I was
in fact using  Gries 2009: "Quantitative Corpus Linguistics with R"
<http://www.amazon.de/Quantitative-Corpus-Linguistics-Practical-Introduction-ebook/dp/B001Y35H5A/ref=sr_1_1?ie=UTF8&qid=1418119630&sr=8-1&keywords=gries+quantitative+corpus+linguistics>
for this. The trouble is that the methods in the book help as far as simple
n-gram frequency calculations are concerned (since, e.g. table() would just
do the trick), but methods for this size of repeated checks on tables are
not included.

Best,
Christopher



--
View this message in context:
http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539p4700582.html
Sent from the R help mailing list archive at Nabble.com.

R help - Dec 2014 - Finding unique elements faster

[R] Finding unique elements faster

[R] Finding unique elements faster

[R] Finding unique elements faster

[R] Finding unique elements faster

[R] Finding unique elements faster