Fabien Tarrade
2016-Apr-10  19:27 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
Hi Duncan,> Didn't you post the same question yesterday? Perhaps nobody answered > because your question is unanswerable.sorry, I got a email that my message was waiting for approval and when I look at the forum I didn't see my message and this is why I sent it again and this time I did check that the format of my message was text only. Sorry for the noise.> You need to describe what the strings are like and what the patterns > are like if you want advice on speeding things up.my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) and I am searching for the frequency in my DF of the strings starting with a sequence of few words. I guess these days it is standard to use DF with millions of entries so I was wondering how people are doing that in the faster way. Thanks Cheers Fabien -- Dr Fabien Tarrade Quantitative Analyst/Developer - Data Scientist Senior data analyst specialised in the modelling, processing and statistical treatment of data. PhD in Physics, 10 years of experience as researcher at the forefront of international scientific research. Fascinated by finance and data modelling. Geneva, Switzerland Email : <mailto:contact at fabien-tarrade.eu>contact at fabien-tarrade.eu Phone : <http://www.fabien-tarrade.eu>www.fabien-tarrade.eu Phone : +33 (0)6 14 78 70 90 LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter <https://twitter.com/fabtar> Google <https://plus.google.com/+FabienTarradeProfile/posts> Facebook <https://www.facebook.com/fabien.tarrade.eu> Google <skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
Bert Gunter
2016-Apr-10  22:41 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
Fabien: I was unable to make any sense of your latest response (maybe I'm just dense). If others have similar difficulties, and you fail to get a satisfactory response, I suggest that you read and follow the posting guide's request for a **small, reproducible example* (perhaps the first few dozen rows of your data frame) in which you show the code you tried and your desired result. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Apr 10, 2016 at 12:27 PM, Fabien Tarrade <fabien.tarrade at gmail.com> wrote:> Hi Duncan, >> >> Didn't you post the same question yesterday? Perhaps nobody answered >> because your question is unanswerable. > > sorry, I got a email that my message was waiting for approval and when I > look at the forum I didn't see my message and this is why I sent it again > and this time I did check that the format of my message was text only. Sorry > for the noise. >> >> You need to describe what the strings are like and what the patterns are >> like if you want advice on speeding things up. > > my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) and I > am searching for the frequency in my DF of the strings starting with a > sequence of few words. > > I guess these days it is standard to use DF with millions of entries so I > was wondering how people are doing that in the faster way. > > Thanks > Cheers > Fabien > > -- > Dr Fabien Tarrade > > Quantitative Analyst/Developer - Data Scientist > > Senior data analyst specialised in the modelling, processing and statistical > treatment of data. > PhD in Physics, 10 years of experience as researcher at the forefront of > international scientific research. > Fascinated by finance and data modelling. > > Geneva, Switzerland > > Email : <mailto:contact at fabien-tarrade.eu>contact at fabien-tarrade.eu > Phone : <http://www.fabien-tarrade.eu>www.fabien-tarrade.eu > Phone : +33 (0)6 14 78 70 90 > > LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter > <https://twitter.com/fabtar> Google > <https://plus.google.com/+FabienTarradeProfile/posts> Facebook > <https://www.facebook.com/fabien.tarrade.eu> Google <skype:fabtarhiggs?call> > Xing <https://www.xing.com/profile/Fabien_Tarrade> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jim Lemon
2016-Apr-10  22:51 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
Hi Fabien,
I was going to send this last night, but I thought it was too simple.
Runs in about one millisecond.
df<-data.frame(freq=runif(1000),
 strings=apply(matrix(sample(LETTERS,10000,TRUE),ncol=10),
 1,paste,collapse=""))
match.ind<-grep("DF",df$strings)
match.ind
 [1]   2  11  91 133 169 444 547 605 734 943
Jim
On Mon, Apr 11, 2016 at 5:27 AM, Fabien Tarrade
<fabien.tarrade at gmail.com> wrote:> Hi Duncan,
>>
>> Didn't you post the same question yesterday?  Perhaps nobody
answered
>> because your question is unanswerable.
>
> sorry, I got a email that my message was waiting for approval and when I
> look at the forum I didn't see my message and this is why  I sent it
again
> and this time I did check that the format of my message was text only.
Sorry
> for the noise.
>>
>> You need to describe what the strings are like and what the patterns
are
>> like if you want advice on speeding things up.
>
> my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) and
I
> am searching for the frequency in my DF of the strings starting with a
> sequence of few words.
>
> I guess these days it is standard to use DF with millions of entries so I
> was wondering how people are doing that in the faster way.
>
> Thanks
> Cheers
> Fabien
>
> --
> Dr Fabien Tarrade
>
> Quantitative Analyst/Developer - Data Scientist
>
> Senior data analyst specialised in the modelling, processing and
statistical
> treatment of data.
> PhD in Physics, 10 years of experience as researcher at the forefront of
> international scientific research.
> Fascinated by finance and data modelling.
>
> Geneva, Switzerland
>
> Email : <mailto:contact at fabien-tarrade.eu>contact at
fabien-tarrade.eu
> Phone : <http://www.fabien-tarrade.eu>www.fabien-tarrade.eu
> Phone : +33 (0)6 14 78 70 90
>
> LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter
> <https://twitter.com/fabtar> Google
> <https://plus.google.com/+FabienTarradeProfile/posts> Facebook
> <https://www.facebook.com/fabien.tarrade.eu> Google
<skype:fabtarhiggs?call>
> Xing <https://www.xing.com/profile/Fabien_Tarrade>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Fabien Tarrade
2016-Apr-10  22:59 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
Hi Jim, I didn't know this one. I will have a look. Thanks Cheers Fabien> Hi Fabien, > I was going to send this last night, but I thought it was too simple. > Runs in about one millisecond. > > df<-data.frame(freq=runif(1000), > strings=apply(matrix(sample(LETTERS,10000,TRUE),ncol=10), > 1,paste,collapse="")) > match.ind<-grep("DF",df$strings) > match.ind > [1] 2 11 91 133 169 444 547 605 734 943 > > Jim-- Dr Fabien Tarrade Quantitative Analyst/Developer - Data Scientist Senior data analyst specialised in the modelling, processing and statistical treatment of data. PhD in Physics, 10 years of experience as researcher at the forefront of international scientific research. Fascinated by finance and data modelling. Geneva, Switzerland Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu> Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu> Phone : +33 (0)6 14 78 70 90 LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter <https://twitter.com/fabtar> Google <https://plus.google.com/+FabienTarradeProfile/posts> Facebook <https://www.facebook.com/fabien.tarrade.eu> Google <skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
Martin Morgan
2016-Apr-10  23:32 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
On 04/10/2016 03:27 PM, Fabien Tarrade wrote:> Hi Duncan, >> Didn't you post the same question yesterday? Perhaps nobody answered >> because your question is unanswerable. > sorry, I got a email that my message was waiting for approval and when I > look at the forum I didn't see my message and this is why I sent it > again and this time I did check that the format of my message was text > only. Sorry for the noise. >> You need to describe what the strings are like and what the patterns >> are like if you want advice on speeding things up. > my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) > and I am searching for the frequency in my DF of the strings starting > with a sequence of few words. > > I guess these days it is standard to use DF with millions of entries so > I was wondering how people are doing that in the faster way.I did this to generate and search 40 million unique strings > grams <- as.character(1:4e7) ## a long time passes... > system.time(grep("^900001", grams)) ## similar times to grepl user system elapsed 10.384 0.168 10.543 Is that the basic task you're trying to accomplish? grep(l) goes quickly to C, so I don't think data.table or other will be markedly faster if you're looking for an arbitrary regular expression (use fixed=TRUE if looking for an exact match). If you're looking for strings that start with a pattern, then in R-3.3.0 there is > system.time(res0 <- startsWith(grams, "900001")) user system elapsed 0.658 0.012 0.669 which returns the same result as grepl > identical(res0, res1 <- grepl("^900001", grams)) [1] TRUE One can also parallelize the already vectorized grepl function with parallel::pvec, with some opportunity for gain (compared to grepl) on non-Windows > system.time(res2 <- pvec(seq_along(grams), function(i) grepl("^900001", grams[i]), mc.cores=8)) user system elapsed 24.996 1.709 3.974 > identical(res0, res2) [[1]] TRUE I think anything else would require pre-processing of some kind, and then some more detail about what your data looks like is required. Martin Morgan> > Thanks > Cheers > Fabien >This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.
Fabien Tarrade
2016-Apr-29  19:03 UTC
[R] what is the faster way to search for a pattern in a few million entries data frame ?
Hi Martin and everybody, sorry for the long delay. Thanks for all the suggestions. With my code and my training data I found similar numbers to the one below. Thanks Cheers Fabien> I did this to generate and search 40 million unique strings > > > grams <- as.character(1:4e7) ## a long time passes... > > system.time(grep("^900001", grams)) ## similar times to grepl > user system elapsed > 10.384 0.168 10.543 > > Is that the basic task you're trying to accomplish? grep(l) goes > quickly to C, so I don't think data.table or other will be markedly > faster if you're looking for an arbitrary regular expression (use > fixed=TRUE if looking for an exact match). > > If you're looking for strings that start with a pattern, then in > R-3.3.0 there is > > > system.time(res0 <- startsWith(grams, "900001")) > user system elapsed > 0.658 0.012 0.669 > > which returns the same result as grepl > > > identical(res0, res1 <- grepl("^900001", grams)) > [1] TRUE > > One can also parallelize the already vectorized grepl function with > parallel::pvec, with some opportunity for gain (compared to grepl) on > non-Windows > > > system.time(res2 <- pvec(seq_along(grams), function(i) > grepl("^900001", grams[i]), mc.cores=8)) > user system elapsed > 24.996 1.709 3.974 > > identical(res0, res2) > [[1]] TRUE > > I think anything else would require pre-processing of some kind, and > then some more detail about what your data looks like is required.-- Dr Fabien Tarrade Quantitative Analyst/Developer - Data Scientist Senior data analyst specialised in the modelling, processing and statistical treatment of data. PhD in Physics, 10 years of experience as researcher at the forefront of international scientific research. Fascinated by finance and data modelling. Geneva, Switzerland Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu> Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu> Phone : +33 (0)6 14 78 70 90 LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter <https://twitter.com/fabtar> Google <https://plus.google.com/+FabienTarradeProfile/posts> Facebook <https://www.facebook.com/fabien.tarrade.eu> Google <skype:fabtarhiggs?call> Xing <https://www.xing.com/profile/Fabien_Tarrade>
Possibly Parallel Threads
- what is the faster way to search for a pattern in a few million entries data frame ?
- what is the faster way to search for a pattern in a few million entries data frame ?
- fast way to search for a pattern in a few million entries data frame
- what is the faster way to search for a pattern in a few million entries data frame ?
- what is the faster way to search for a pattern in a few million entries data frame ?