thr3ads.net - R help - [R] what is the faster way to search for a pattern in a few million entries data frame ? [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Fabien Tarrade

2016-Apr-10 18:03 UTC

[R] what is the faster way to search for a pattern in a few million entries data frame ?

Hi there,

I have a data frame DF with 40 millions strings and their frequency. I 
am searching for strings with a given pattern and I am trying to speed 
up this part of my code. I try many options but so far I am not 
satisfied. I tried:
- grepl and subset are equivalent in term of processing time
    grepl(paste0("^",pattern),df$Strings)
    subset(df, grepl(paste0("^",pattern), df$Strings))

- lookup(pattern,df) is not what I am looking for since it is doing an 
exact matching

- I tried to convert my data frame in a data table but it didn't improve 
things (probably read/write of this DT will be much faster)

- the only way I found was to remove 1/3 of the data frame with the 
strings of lowest frequency which speed up the process by a factor x10 !

- didn't try yet parRapply and with a machine with multicore I can get 
another factor.
    I did use parLapply for some other code but I had many issue with 
memory (crashing my Mac).
    I had to sub-divide the dataset to have it working correctly but I 
didn't manage to fully understand the issue.

I am sure their is some other smart way to do that. Any good 
article/blogs or suggestion that can give me some guidance ?

Thanks a lot
Cheers
Fabien

-- 
Dr Fabien Tarrade

Quantitative Analyst/Developer - Data Scientist

Senior data analyst specialised in the modelling, processing and 
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of 
international scientific research.
Fascinated by finance and data modelling.

Geneva, Switzerland

Email : contact at fabien-tarrade.eu <mailto:contact at fabien-tarrade.eu>
Phone : www.fabien-tarrade.eu <http://www.fabien-tarrade.eu>
Phone : +33 (0)6 14 78 70 90

LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter 
<https://twitter.com/fabtar> Google 
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook 
<https://www.facebook.com/fabien.tarrade.eu> Google 
<skype:fabtarhiggs?call> Xing
<https://www.xing.com/profile/Fabien_Tarrade>

Duncan Murdoch

2016-Apr-10 18:40 UTC

head link

[R] what is the faster way to search for a pattern in a few million entries data frame ?

On 10/04/2016 2:03 PM, Fabien Tarrade wrote:> Hi there,
>
> I have a data frame DF with 40 millions strings and their frequency. I
> am searching for strings with a given pattern and I am trying to speed
> up this part of my code. I try many options but so far I am not
> satisfied. I tried:
> - grepl and subset are equivalent in term of processing time
>      grepl(paste0("^",pattern),df$Strings)
>      subset(df, grepl(paste0("^",pattern), df$Strings))
>
> - lookup(pattern,df) is not what I am looking for since it is doing an
> exact matching
>
> - I tried to convert my data frame in a data table but it didn't
improve
> things (probably read/write of this DT will be much faster)
>
> - the only way I found was to remove 1/3 of the data frame with the
> strings of lowest frequency which speed up the process by a factor x10 !
>
> - didn't try yet parRapply and with a machine with multicore I can get
> another factor.
>      I did use parLapply for some other code but I had many issue with
> memory (crashing my Mac).
>      I had to sub-divide the dataset to have it working correctly but I
> didn't manage to fully understand the issue.
>
> I am sure their is some other smart way to do that. Any good
> article/blogs or suggestion that can give me some guidance ?
Didn't you post the same question yesterday?  Perhaps nobody answered 
because your question is unanswerable.  You need to describe what the 
strings are like and what the patterns are like if you want advice on 
speeding things up.

Duncan Murdoch

Fabien Tarrade

2016-Apr-10 19:27 UTC

head link

[R] what is the faster way to search for a pattern in a few million entries data frame ?

Hi Duncan,> Didn't you post the same question yesterday?  Perhaps nobody answered 
> because your question is unanswerable.sorry, I got a email that my message was waiting for approval and when I 
look at the forum I didn't see my message and this is why  I sent it 
again and this time I did check that the format of my message was text 
only. Sorry for the noise.> You need to describe what the strings are like and what the patterns 
> are like if you want advice on speeding things up.my strings are 1-gram up to 5-grams (sequence of 1 work up to 5 words) 
and I am searching for the frequency in my DF of the strings starting 
with a sequence of few words.

I guess these days it is standard to use DF with millions of entries so 
I was wondering how people are doing that in the faster way.

Thanks
Cheers
Fabien

-- 
Dr Fabien Tarrade

Quantitative Analyst/Developer - Data Scientist

Senior data analyst specialised in the modelling, processing and 
statistical treatment of data.
PhD in Physics, 10 years of experience as researcher at the forefront of 
international scientific research.
Fascinated by finance and data modelling.

Geneva, Switzerland

Email : <mailto:contact at fabien-tarrade.eu>contact at fabien-tarrade.eu
Phone : <http://www.fabien-tarrade.eu>www.fabien-tarrade.eu
Phone : +33 (0)6 14 78 70 90

LinkedIn <http://ch.linkedin.com/in/fabientarrade/> Twitter 
<https://twitter.com/fabtar> Google 
<https://plus.google.com/+FabienTarradeProfile/posts> Facebook 
<https://www.facebook.com/fabien.tarrade.eu> Google 
<skype:fabtarhiggs?call> Xing
<https://www.xing.com/profile/Fabien_Tarrade>

Possibly Parallel Threads

Search for more reasonably related threads

R help - Apr 2016 - what is the faster way to search for a pattern in a few million entries data frame ?

[R] what is the faster way to search for a pattern in a few million entries data frame ?

[R] what is the faster way to search for a pattern in a few million entries data frame ?

[R] what is the faster way to search for a pattern in a few million entries data frame ?

Possibly Parallel Threads