Dear all, for the past two weeks, I've been working on a script to retrieve word pairs and calculate some of their statistics using R. Everything seemed to work fine until I switched from a small test dataset to the 'real thing' and noticed what a runtime monster I had devised! I could reduce processing time significantly when I realized that with R, I did not have to do everything in loops and count things vector element by vector element, but could just have the program count everything with tables, e.g. with > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs] However, now I seem to have run into a performance problem that I cannot solve. I hope there's a kind soul on this list who has some advice for me. On to the problem: The last relic of the afore-mentioned for-loop that goes through all the word pairs and tries to calculate some statistics on them is the following line of code: > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])])) (where word1 and word2 are the first and second word within the two-word sequence (all.word.pairs, above) Here, I am trying to count the number of 'types', linguistically speaking, before the second word in the two-word sequence (later, I am doing the same for the first word within the sequence). The expression works, but given my ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About 10 hours on my machine, in fact, since R cannot use the other three of the four cores. Since I want to repeat the process for another 20 corpora of similar size, I would definitely appreciate some help on this subject. I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and the subset() function and both seem to work (though I haven't checked whether all the numbers are in fact correctly calculated), but they take about the same amount of time. So that's no use for me. Does anybody have any tips to speed this up? Thank you very much! -- View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html Sent from the R help mailing list archive at Nabble.com.
2 ideas (haven't tried them): 1. if your data is in a data frame, did you try using the by function? Seems it would do the grouping for you. 2. Since you mention the cpu cores, you could use libraries like foreach and %dopar% or mcapply. I would try 1. and see if it provides a sufficient speed-up. On Mon, Dec 8, 2014 at 9:21 PM, apeshifter <ch_koch at gmx.de> wrote:> Dear all, > > for the past two weeks, I've been working on a script to retrieve word > pairs > and calculate some of their statistics using R. Everything seemed to work > fine until I switched from a small test dataset to the 'real thing' and > noticed what a runtime monster I had devised! > > I could reduce processing time significantly when I realized that with R, I > did not have to do everything in loops and count things vector element by > vector element, but could just have the program count everything with > tables, e.g. with > > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs] > > However, now I seem to have run into a performance problem that I cannot > solve. I hope there's a kind soul on this list who has some advice for me. > On to the problem: > > The last relic of the afore-mentioned for-loop that goes through all the > word pairs and tries to calculate some statistics on them is the following > line of code: > > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])])) > (where word1 and word2 are the first and second word within the two-word > sequence (all.word.pairs, above) > > Here, I am trying to count the number of 'types', linguistically speaking, > before the second word in the two-word sequence (later, I am doing the same > for the first word within the sequence). The expression works, but given my > ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About > 10 hours on my machine, in fact, since R cannot use the other three of the > four cores. Since I want to repeat the process for another 20 corpora of > similar size, I would definitely appreciate some help on this subject. > > I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and > the > subset() function and both seem to work (though I haven't checked whether > all the numbers are in fact correctly calculated), but they take about the > same amount of time. So that's no use for me. > > Does anybody have any tips to speed this up? > > Thank you very much! > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
The data.table package might be of use to you, but lacking a reproducible example [1] I think I will leave figuring out just how to you. Being on Nabble you may not be able to see the footer appended to every message on this MAILING LIST. For your benefit, here it is: * R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see * https://stat.ethz.ch/mailman/listinfo/r-help * PLEASE do read the posting guide http://www.R-project.org/posting-guide.html * and provide commented, minimal, self-contained, reproducible code. [1] http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example On Mon, 8 Dec 2014, apeshifter wrote:> Dear all, > > for the past two weeks, I've been working on a script to retrieve word pairs > and calculate some of their statistics using R. Everything seemed to work > fine until I switched from a small test dataset to the 'real thing' and > noticed what a runtime monster I had devised! > > I could reduce processing time significantly when I realized that with R, I > did not have to do everything in loops and count things vector element by > vector element, but could just have the program count everything with > tables, e.g. with > > freq.w1w2.2<-table(all.word.pairs)[all.word.pairs] > > However, now I seem to have run into a performance problem that I cannot > solve. I hope there's a kind soul on this list who has some advice for me. > On to the problem: > > The last relic of the afore-mentioned for-loop that goes through all the > word pairs and tries to calculate some statistics on them is the following > line of code: > > typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])])) > (where word1 and word2 are the first and second word within the two-word > sequence (all.word.pairs, above) > > Here, I am trying to count the number of 'types', linguistically speaking, > before the second word in the two-word sequence (later, I am doing the same > for the first word within the sequence). The expression works, but given my > ~400,000 word pairs/word1's/word2's etc, this takes quite some time. About > 10 hours on my machine, in fact, since R cannot use the other three of the > four cores. Since I want to repeat the process for another 20 corpora of > similar size, I would definitely appreciate some help on this subject. > > I have been trying 'typefreq.after1<-table(unique(word2[word1]))[2]' and the > subset() function and both seem to work (though I haven't checked whether > all the numbers are in fact correctly calculated), but they take about the > same amount of time. So that's no use for me. > > Does anybody have any tips to speed this up? > > Thank you very much! > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
On 8 Dec 2014, at 21:21, apeshifter <ch_koch at gmx.de> wrote:> The last relic of the afore-mentioned for-loop that goes through all the > word pairs and tries to calculate some statistics on them is the following > line of code: >> typefreq.after1[i]<-length(unique(word2[which(word1==word1[i])])) > (where word1 and word2 are the first and second word within the two-word > sequence (all.word.pairs, above)It is difficult to tell without a fully reproducible example, but from this code I get the impression that word1 and word2 represent word pair _tokens_ rather than pair _types_ (otherwise you wouldn't need the unique()). That's a very inefficient way of dealing with co-occurrence data, especially since you've already computed the set of pair types in order to get the co-occurrence counts. If word1, word2 are type vectors (i.e. every pair occurs just once), then this should give you what you want: tapply(BB$word2, BB$word1, length) If they are token vectors, you need to supply your own type counting function, which will be a bit slower tapply(BB$word2, BB$word1, function (x) length(unique(x))) On my machine, this takes about 0.2s for 770,000 word pairs. BTW, you might want to take a look at Unit 4 of the SIGIL course http://sigil.r-forge.r-project.org/ which has some tips on how you can deal efficiently with co-occurrence data in R. Best, Stefan
Thank you all for your suggestions! I must say I am amazed by the number of people who consider helping out another! Fells like it was a good idea to start using R - back when I was still using Perl for such tasks, I'd been happy to have this kind of support! @ Gheorghe Postelnicu: Unfortunately, the data is not yet in a data frame when this part of the program starts. At this point, I am trying to fill in all the relevant vectors (all.word.pairs, word1, word2, freq.word1, freq.word2, typefreq.w1, typefreq.w2, ...) and then combine them to a data frame. I will try to get my head around the doParallel package package for the foreach loop, since parallel computing would certainly be helpful. @ Jeff Newmiller: Sound interesting, but I fear the same problem applies as for Gheorghe's suggestion. I will need a data frame first,for which I do not have all the correct values... Will keep the package in mind, though, for future projects. @ Stefan Evert-3: I am not sure I understand what you mean in the second example. Since the counting of types is exactly my problem at the moment, I do not see how I could provide a function that would work more efficiently in the context you are describing. The line of code that I was giving is exactly my attempt at doint this... Sorry, I might just not be getting what you are aiming at... :-/ However, your assumptions are quite correct. word1 and word2 do indeed contain word tokens, as does all.word.pairs. The reason for this is that I need the word pairs within the vector to be in the same order as they appeared in the original corpus files. Also, thank you for the link. I will check this out when I am analysing collocates. However, I didn't find notes on my specific problem in the slides. However, please do not think I was not using reference material for designing my script. I was in fact using Gries 2009: "Quantitative Corpus Linguistics with R" <http://www.amazon.de/Quantitative-Corpus-Linguistics-Practical-Introduction-ebook/dp/B001Y35H5A/ref=sr_1_1?ie=UTF8&qid=1418119630&sr=8-1&keywords=gries+quantitative+corpus+linguistics> for this. The trouble is that the methods in the book help as far as simple n-gram frequency calculations are concerned (since, e.g. table() would just do the trick), but methods for this size of repeated checks on tables are not included. Best, Christopher -- View this message in context: http://r.789695.n4.nabble.com/Finding-unique-elements-faster-tp4700539p4700582.html Sent from the R help mailing list archive at Nabble.com.