Simon Pickert
2013-Nov-06 13:11 UTC
[R] Multiple String word replacements: Performance Issue
Dear experts, I?ve been on this for weeks now, and couldn?t find a solution..Sorry for the long description. I figured I post many details, so you get the problem entirely, although it?s not hard to grasp. **Situation:** Data frame consisting of 4 million entries (total size: 250 MB). Two columns: `ID` and `TEXT`. Text strings are each up to 200 characters. **Task:** Preprocessing the text strings Example Data: +??????+?????????????????+ | ID | Text | +??+?????????????????????+ | 123 | $AAPL is up +5% | | 456 | $MSFT , $EBAY doing great. www.url.com | .. +??+?????????????????????+ Should become +??????+??????????????????????????????????-??+ | ID | Text clean | First Ticker | All Ticker | Ticker Count +??+????????????????????+??????+???? +???????-?+ | 123 | [ticker] is up [positive_percentage] | $aapl | $aapl | 1 | 456 | [ticker] [ticker] doing great [url] [pos_emotion] | $msft | $msft,$ebay | 2 .. +??+????????????????????+??????-+??????+??????+ **Problem:** It takes too long. On my 8GB RAM Dual-Core machine: Cancelled after 1 day. On a 70GB 8-Core Amazon EC2 instance: Cancelled after 1 day. **Details:** I am basically - Counting how often certain words appear in one string - Write this number into a new column (COUNT) - Replace this (counted) word - Replace other words (which I don't need to count before) - Replace some regular expressions The vectors which are used as patterns look like this: "\\bWORD1\\b|\\bWORD2\\b|\\bWORD3\\b|\\bWORD4\\b..." Thus, those 'replacement vectors' are character vectors of length 1, each containing up to 800 words **Main code:** library("parallel") library("stringr") preprocessText<-function(x){ # Replace the 'html-and' arguments<-list(pattern="\\&\\;",replacement="and",x=x, ignore.case=TRUE) y<-do.call(gsub, arguments) # Remove some special characters arguments<-list(pattern="[^-[:alnum:]\\'\\:\\/\\$\\%\\.\\,\\+\\-\\#\\@\\_\\!\\?+[:space:]]",replacement="",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) # Lowercase arguments<-list(string=y,pattern=tolower(rep_ticker)) first<-do.call(str_match,arguments) # Identify signal words and count them # Need to be done in parts, because otherwise R can't handle this many at once arguments<-list(string=x, pattern=rep_words_part1) t1<-do.call(str_extract_all,arguments) arguments<-list(string=x, pattern=rep_words_part2) t2<-do.call(str_extract_all,arguments) arguments<-list(string=x, pattern=rep_words_part3) t3<-do.call(str_extract_all,arguments) arguments<-list(string=x, pattern=rep_words_part4) t4<-do.call(str_extract_all,arguments) count=length(t1[[1]])+length(t2[[1]])+length(t3[[1]])+length(t4[[1]]) signal_words=c(t1[[1]],t2[[1]],t3[[1]],t4[[1]]) # Replacements arguments<-list(pattern=rep_wordsA,replacement="[ticker]",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_wordB_part1,replacement="[ticker] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_wordB_part2,replacement="[ticker] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_wordB_part3,replacement="[ticker2] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_wordB_part4,replacement=?[ticker2] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_email,replacement=" [email_address] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_url,replacement=" [url] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_wordC,replacement=" [wordC] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) # Some regular expressions arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+%",replacement=" [positive_percentage] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern="-[[:digit:]]*.?[[:digit:]]+%",replacement=" [negative_percentage] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+%",replacement=" [percentage] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern="\\$[[:digit:]]*.?[[:digit:]]+",replacement=" [dollar_value] ",x=y,ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+",replacement=" [pos_number] ",x=y, ignore.case=TRUE)# remaining numbers y<-do.call(gsub, arguments) arguments<-list(pattern="\\-[[:digit:]]*.?[[:digit:]]+",replacement=" [neg_number] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+",replacement=" [number] ",x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_question,replacement=" [question] ", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) # Unify synonyms arguments<-list(pattern=rep_syno1,replacement="happy", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_syno2,replacement="sad", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_syno3,replacement="people", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_syno4,replacement="father", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_syno5,replacement="mother", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rep_syno6,replacement="money", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) # Remove words # Punctuation (I know there a pre-defined R commands for this, but I need to customize this arguments<-list(pattern=rem_punct,replacement="", x=y, ignore.case=TRUE) y<-do.call(gsub, arguments) arguments<-list(pattern=rem_linebreak,replacement=" ", x=y, ignore.case=TRUE) #Remove line breaks y<-do.call(gsub, arguments) #Append Positive or Negative Emotion arguments<-list(x=y) y<-do.call(appendEmotion, arguments) # Output result<-list( textclean=y, first_ticker=first, all_ticker=signal_words, ticker_count=count) return(result) } resultList<-mclapply(dataframe$text_column,preprocessText) ** end main code ** (The return would be a list, which I plan to convert to a data.frame. Don?t get that far though). Before, I also tried to call each `gsub` seperately, thus performing the first `gsub` on every text string, then the second `gsub` and so on.. but I guess that this was even less efficient. The code itself works, but for me it seems that this can be speeded up. Unfortunately I'm not familiar with hash tables, which is what I heard could be a solution. Appreciate your ideas and help very much! *Definition of the one function called inside `preprocessText`* appendEmotion<-function(x){ if (grepl(app_pos,x)){ x<-paste(x," [pos_emotion] ") } if(grepl(app_neg,x)){ x<-paste(x," [neg_emotion] ") } #Output return(x) }
jim holtman
2013-Nov-07 13:19 UTC
[R] Multiple String word replacements: Performance Issue
Here is a start. I was wondering how long it would take to at least substitute 800 different patterns into 4M vectors. Here is my test. It took longer (99 sec) to create the test data than to do the substitutes (52 secs). Now some variations on this can provide the other information that you are probably after in less than a day ( I would guess less than an hour)> n <- 1000 > x <- paste0("$"+ , sample(LETTERS, n, TRUE) + , sample(LETTERS, n, TRUE) + , sample(LETTERS, n, TRUE) + , sample(LETTERS, n, TRUE) + )> x <- x[!duplicated(x)][1:800] > > n <- 4000000 > system.time({+ output <- replicate(n, paste(sample(x,2), collapse = ' ')) + }) user system elapsed 99.85 0.22 100.37> > system.time({+ pattern <- paste0("\\", x, collapse = "|") + z <- gsub(pattern, "[ticker]", output, perl = TRUE) + }) user system elapsed 52.05 0.00 52.21> > > str(output)chr [1:4000000] "$JHVN $VKOL" "$GTEU $CEGL" "$LOEY $ETQK" "$AFDO $SDLH" "$MOIN $WEVR" ...> str(z)chr [1:4000000] "[ticker] [ticker]" "[ticker] [ticker]" "[ticker] [ticker]" ...> str(pattern)chr "\\$MATF|\\$GFGC|\\$SRYC|\\$HLWS|\\$GHFB|\\$BGVU|\\$GFDW|\\$PSFN|\\$ONDY|\\$SXUH|\\$EBDJ|\\$YNQY|\\$NDBT|\\$TOQK|\\$IUBN|\\$VSMT"| __truncated__>Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Wed, Nov 6, 2013 at 8:11 AM, Simon Pickert <simon.pickert at t-online.de> wrote:> Dear experts, > I?ve been on this for weeks now, and couldn?t find a solution..Sorry for the long description. I figured I post many details, so you get the problem entirely, although it?s not hard to grasp. > > **Situation:** > Data frame consisting of 4 million entries (total size: 250 MB). Two columns: `ID` and `TEXT`. Text strings are each up to 200 characters. > > > **Task:** > Preprocessing the text strings > > Example Data: > > > +??????+?????????????????+ > | ID | Text | > +??+?????????????????????+ > | 123 | $AAPL is up +5% | > | 456 | $MSFT , $EBAY doing great. www.url.com | > .. > +??+?????????????????????+ > > Should become > > +??????+??????????????????????????????????-??+ > | ID | Text clean | First Ticker | All Ticker | Ticker Count > +??+????????????????????+??????+???? +???????-?+ > | 123 | [ticker] is up [positive_percentage] | $aapl | $aapl | 1 > | 456 | [ticker] [ticker] doing great [url] [pos_emotion] | $msft | $msft,$ebay | 2 > .. > +??+????????????????????+??????-+??????+??????+ > > > > **Problem:** > It takes too long. On my 8GB RAM Dual-Core machine: Cancelled after 1 day. On a 70GB 8-Core Amazon EC2 instance: Cancelled after 1 day. > > > **Details:** > I am basically > > - Counting how often certain words appear in one string > - Write this number into a new column (COUNT) > - Replace this (counted) word > - Replace other words (which I don't need to count before) > - Replace some regular expressions > > The vectors which are used as patterns look like this: > > "\\bWORD1\\b|\\bWORD2\\b|\\bWORD3\\b|\\bWORD4\\b..." > > Thus, those 'replacement vectors' are character vectors of length 1, each containing up to 800 words > > > > **Main code:** > > library("parallel") > library("stringr") > > preprocessText<-function(x){ > > # Replace the 'html-and' > arguments<-list(pattern="\\&\\;",replacement="and",x=x, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > # Remove some special characters > arguments<-list(pattern="[^-[:alnum:]\\'\\:\\/\\$\\%\\.\\,\\+\\-\\#\\@\\_\\!\\?+[:space:]]",replacement="",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > # Lowercase > arguments<-list(string=y,pattern=tolower(rep_ticker)) > first<-do.call(str_match,arguments) > > # Identify signal words and count them > # Need to be done in parts, because otherwise R can't handle this many at once > arguments<-list(string=x, pattern=rep_words_part1) > t1<-do.call(str_extract_all,arguments) > > arguments<-list(string=x, pattern=rep_words_part2) > t2<-do.call(str_extract_all,arguments) > > arguments<-list(string=x, pattern=rep_words_part3) > t3<-do.call(str_extract_all,arguments) > > arguments<-list(string=x, pattern=rep_words_part4) > t4<-do.call(str_extract_all,arguments) > > count=length(t1[[1]])+length(t2[[1]])+length(t3[[1]])+length(t4[[1]]) > signal_words=c(t1[[1]],t2[[1]],t3[[1]],t4[[1]]) > > > # Replacements > > arguments<-list(pattern=rep_wordsA,replacement="[ticker]",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_wordB_part1,replacement="[ticker] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_wordB_part2,replacement="[ticker] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_wordB_part3,replacement="[ticker2] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_wordB_part4,replacement=?[ticker2] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_email,replacement=" [email_address] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_url,replacement=" [url] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_wordC,replacement=" [wordC] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > # Some regular expressions > arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+%",replacement=" [positive_percentage] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern="-[[:digit:]]*.?[[:digit:]]+%",replacement=" [negative_percentage] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+%",replacement=" [percentage] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern="\\$[[:digit:]]*.?[[:digit:]]+",replacement=" [dollar_value] ",x=y,ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern="\\+[[:digit:]]*.?[[:digit:]]+",replacement=" [pos_number] ",x=y, ignore.case=TRUE)# remaining numbers > y<-do.call(gsub, arguments) > > arguments<-list(pattern="\\-[[:digit:]]*.?[[:digit:]]+",replacement=" [neg_number] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern="[[:digit:]]*.?[[:digit:]]+",replacement=" [number] ",x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_question,replacement=" [question] ", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > > # Unify synonyms > arguments<-list(pattern=rep_syno1,replacement="happy", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_syno2,replacement="sad", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_syno3,replacement="people", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_syno4,replacement="father", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_syno5,replacement="mother", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rep_syno6,replacement="money", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > # Remove words > # Punctuation (I know there a pre-defined R commands for this, but I need to customize this > arguments<-list(pattern=rem_punct,replacement="", x=y, ignore.case=TRUE) > y<-do.call(gsub, arguments) > > arguments<-list(pattern=rem_linebreak,replacement=" ", x=y, ignore.case=TRUE) #Remove line breaks > y<-do.call(gsub, arguments) > > #Append Positive or Negative Emotion > arguments<-list(x=y) > y<-do.call(appendEmotion, arguments) > > > # Output > result<-list( > textclean=y, > first_ticker=first, > all_ticker=signal_words, > ticker_count=count) > > return(result) > } > > resultList<-mclapply(dataframe$text_column,preprocessText) > > ** end main code ** > > (The return would be a list, which I plan to convert to a data.frame. Don?t get that far though). > > > Before, I also tried to call each `gsub` seperately, thus performing the first `gsub` on every text string, then the second `gsub` and so on.. but I guess that this was even less efficient. > > The code itself works, but for me it seems that this can be speeded up. Unfortunately I'm not familiar with hash tables, which is what I heard could be a solution. > > Appreciate your ideas and help very much! > > > > > *Definition of the one function called inside `preprocessText`* > > appendEmotion<-function(x){ > > if (grepl(app_pos,x)){ > x<-paste(x," [pos_emotion] ") > } > if(grepl(app_neg,x)){ > x<-paste(x," [neg_emotion] ") > } > #Output > return(x) > } > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.