Hi everyone! I have a question about data processing efficiency. My data are as follows: I have a data set on quarterly institutional ownership of equities; some of them have had recent IPOs, some have not (I have a binary flag set). The total dataset size is 700k+ rows. My goal is this: For every quarter since issue for each IPO, I need to find a "matched" firm in the same industry, and close in market cap. So, e.g., for firm X, which had an IPO, i need to find a matched non-issuing firm in quarter 1 since IPO, then a (possibly different) non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there are about 8300 of these). Thus it seems to me that I need to be doing a lot of data selection and subsetting, and looping (yikes!), but the result appears to be highly inefficient and takes ages (well, many hours). What I am doing, in pseudocode, is this: 1. for each quarter of data, getting out all the IPOs and all the eligible non-issuing firms. 2. for each IPO in a quarter, grab all the non-issuers in the same industry, sort them by size, and finally grab a matching firm closest in size (the exact procedure is to grab the closest bigger firm if one exists, and just the biggest available if all are smaller) 3. assign the matched firm-observation the same "quarters since issue" as the IPO being matched 4. rbind them all into the "matching" dataset. The function I currently have is pasted below, for your reference. Is there any way to make it produce the same result but much faster? Specifically, I am guessing eliminating some loops would be very good, but I don't see how, since I need to do some fancy footwork for each IPO in each quarter to find the matching firm. I'll be doing a few things similar to this, so it's somewhat important to up the efficiency of this. Maybe some of you R-fu masters can clue me in? :) I would appreciate any help, tips, tricks, tweaks, you name it! :) ========== my function below ========== fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, quarters_since_issue=40) { result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is cheaper, so typecast the result to matrix colnames = names(tfdata) quarterends = sort(unique(tfdata$DATE)) for (aquarter in quarterends) { tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] tfdata_quarter_fitting_nonissuers = tfdata_quarter[ (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & (tfdata_quarter$IPO.Flag == 0), ] tfdata_quarter_ipoissuers = tfdata_quarter[ tfdata_quarter$IPO.Flag == 1, ] for (i in 1:nrow(tfdata_quarter_ipoissuers)) { arow = tfdata_quarter_ipoissuers[i,] industrypeers = tfdata_quarter_fitting_nonissuers[ tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] industrypeers = industrypeers[ order(industrypeers$Market.Cap.13f), ] if ( nrow(industrypeers) > 0 ) { if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ]) > 0 ) { bestpeer = industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,] } else { bestpeer = industrypeers[nrow(industrypeers),] } bestpeer$Quarters.Since.IPO.Issue = arow$Quarters.Since.IPO.Issue #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == bestpeer$PERMNO] = 1 result = rbind(result, as.matrix(bestpeer)) } } #result = rbind(result, tfdata_quarter) print (aquarter) } result = as.data.frame(result) names(result) = colnames return(result) } ========= end of my function =============
Maybe you should provide a minimal, working code with data, so that we all can give it a try. In the mean time: take a look at the Rprof function to see where your code can be improved. Good luck Bart Daniel Folkinshteyn-2 wrote:> > Hi everyone! > > I have a question about data processing efficiency. > > My data are as follows: I have a data set on quarterly institutional > ownership of equities; some of them have had recent IPOs, some have not > (I have a binary flag set). The total dataset size is 700k+ rows. > > My goal is this: For every quarter since issue for each IPO, I need to > find a "matched" firm in the same industry, and close in market cap. So, > e.g., for firm X, which had an IPO, i need to find a matched non-issuing > firm in quarter 1 since IPO, then a (possibly different) non-issuing > firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there > are about 8300 of these). > > Thus it seems to me that I need to be doing a lot of data selection and > subsetting, and looping (yikes!), but the result appears to be highly > inefficient and takes ages (well, many hours). What I am doing, in > pseudocode, is this: > > 1. for each quarter of data, getting out all the IPOs and all the > eligible non-issuing firms. > 2. for each IPO in a quarter, grab all the non-issuers in the same > industry, sort them by size, and finally grab a matching firm closest in > size (the exact procedure is to grab the closest bigger firm if one > exists, and just the biggest available if all are smaller) > 3. assign the matched firm-observation the same "quarters since issue" > as the IPO being matched > 4. rbind them all into the "matching" dataset. > > The function I currently have is pasted below, for your reference. Is > there any way to make it produce the same result but much faster? > Specifically, I am guessing eliminating some loops would be very good, > but I don't see how, since I need to do some fancy footwork for each IPO > in each quarter to find the matching firm. I'll be doing a few things > similar to this, so it's somewhat important to up the efficiency of > this. Maybe some of you R-fu masters can clue me in? :) > > I would appreciate any help, tips, tricks, tweaks, you name it! :) > > ========== my function below ==========> > fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, > quarters_since_issue=40) { > > result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is > cheaper, so typecast the result to matrix > > colnames = names(tfdata) > > quarterends = sort(unique(tfdata$DATE)) > > for (aquarter in quarterends) { > tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] > > tfdata_quarter_fitting_nonissuers = tfdata_quarter[ > (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & > (tfdata_quarter$IPO.Flag == 0), ] > tfdata_quarter_ipoissuers = tfdata_quarter[ > tfdata_quarter$IPO.Flag == 1, ] > > for (i in 1:nrow(tfdata_quarter_ipoissuers)) { > arow = tfdata_quarter_ipoissuers[i,] > industrypeers = tfdata_quarter_fitting_nonissuers[ > tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] > industrypeers = industrypeers[ > order(industrypeers$Market.Cap.13f), ] > if ( nrow(industrypeers) > 0 ) { > if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= > arow$Market.Cap.13f, ]) > 0 ) { > bestpeer = > industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,] > } > else { > bestpeer = industrypeers[nrow(industrypeers),] > } > bestpeer$Quarters.Since.IPO.Issue = > arow$Quarters.Since.IPO.Issue > > #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == > bestpeer$PERMNO] = 1 > result = rbind(result, as.matrix(bestpeer)) > } > } > #result = rbind(result, tfdata_quarter) > print (aquarter) > } > > result = as.data.frame(result) > names(result) = colnames > return(result) > > } > > ========= end of my function ============> > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://www.nabble.com/Improving-data-processing-efficiency-tp17676300p17678034.html Sent from the R help mailing list archive at Nabble.com.
Anybody have any thoughts on this? Please? :) on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following:> Hi everyone! > > I have a question about data processing efficiency. > > My data are as follows: I have a data set on quarterly institutional > ownership of equities; some of them have had recent IPOs, some have not > (I have a binary flag set). The total dataset size is 700k+ rows. > > My goal is this: For every quarter since issue for each IPO, I need to > find a "matched" firm in the same industry, and close in market cap. So, > e.g., for firm X, which had an IPO, i need to find a matched non-issuing > firm in quarter 1 since IPO, then a (possibly different) non-issuing > firm in quarter 2 since IPO, etc. Repeat for each issuing firm (there > are about 8300 of these). > > Thus it seems to me that I need to be doing a lot of data selection and > subsetting, and looping (yikes!), but the result appears to be highly > inefficient and takes ages (well, many hours). What I am doing, in > pseudocode, is this: > > 1. for each quarter of data, getting out all the IPOs and all the > eligible non-issuing firms. > 2. for each IPO in a quarter, grab all the non-issuers in the same > industry, sort them by size, and finally grab a matching firm closest in > size (the exact procedure is to grab the closest bigger firm if one > exists, and just the biggest available if all are smaller) > 3. assign the matched firm-observation the same "quarters since issue" > as the IPO being matched > 4. rbind them all into the "matching" dataset. > > The function I currently have is pasted below, for your reference. Is > there any way to make it produce the same result but much faster? > Specifically, I am guessing eliminating some loops would be very good, > but I don't see how, since I need to do some fancy footwork for each IPO > in each quarter to find the matching firm. I'll be doing a few things > similar to this, so it's somewhat important to up the efficiency of > this. Maybe some of you R-fu masters can clue me in? :) > > I would appreciate any help, tips, tricks, tweaks, you name it! :) > > ========== my function below ==========> > fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, > quarters_since_issue=40) { > > result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is > cheaper, so typecast the result to matrix > > colnames = names(tfdata) > > quarterends = sort(unique(tfdata$DATE)) > > for (aquarter in quarterends) { > tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] > > tfdata_quarter_fitting_nonissuers = tfdata_quarter[ > (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & > (tfdata_quarter$IPO.Flag == 0), ] > tfdata_quarter_ipoissuers = tfdata_quarter[ > tfdata_quarter$IPO.Flag == 1, ] > > for (i in 1:nrow(tfdata_quarter_ipoissuers)) { > arow = tfdata_quarter_ipoissuers[i,] > industrypeers = tfdata_quarter_fitting_nonissuers[ > tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] > industrypeers = industrypeers[ > order(industrypeers$Market.Cap.13f), ] > if ( nrow(industrypeers) > 0 ) { > if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= > arow$Market.Cap.13f, ]) > 0 ) { > bestpeer = > industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,] > } > else { > bestpeer = industrypeers[nrow(industrypeers),] > } > bestpeer$Quarters.Since.IPO.Issue = > arow$Quarters.Since.IPO.Issue > > #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == > bestpeer$PERMNO] = 1 > result = rbind(result, as.matrix(bestpeer)) > } > } > #result = rbind(result, tfdata_quarter) > print (aquarter) > } > > result = as.data.frame(result) > names(result) = colnames > return(result) > > } > > ========= end of my function ============>
On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow <Greg.Snow at imail.org> wrote:>> -----Original Message----- >> From: r-help-bounces at r-project.org >> [mailto:r-help-bounces at r-project.org] On Behalf Of Patrick Burns >> Sent: Friday, June 06, 2008 12:04 PM >> To: Daniel Folkinshteyn >> Cc: r-help at r-project.org >> Subject: Re: [R] Improving data processing efficiency >> >> That is going to be situation dependent, but if you have a >> reasonable upper bound, then that will be much easier and not >> far from optimal. >> >> If you pick the possibly too small route, then increasing the >> size in largish junks is much better than adding a row at a time. > > Pat, > > I am unfamiliar with the use of the word "junk" as a unit of measure for data objects. I figure there are a few different possibilities: > > 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. > > 2. This was a Freudian slip based on your opinion of some datasets you have seen. > > 3. Somewhere between your mind and the final product "jumps/chunks" became "junks" (possibly a microsoft "correction", or just typing too fast combined with number 2). > > 4. "junks" is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). >5. Chinese sailing vessel. http://en.wikipedia.org/wiki/Junk_(ship)
My guess is that number 2 is closest to the mark. Typing too fast is unfortunately not one of my habitual attributes. Gabor Grothendieck wrote:> On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow <Greg.Snow at imail.org> wrote: > >>> -----Original Message----- >>> From: r-help-bounces at r-project.org >>> [mailto:r-help-bounces at r-project.org] On Behalf Of Patrick Burns >>> Sent: Friday, June 06, 2008 12:04 PM >>> To: Daniel Folkinshteyn >>> Cc: r-help at r-project.org >>> Subject: Re: [R] Improving data processing efficiency >>> >>> That is going to be situation dependent, but if you have a >>> reasonable upper bound, then that will be much easier and not >>> far from optimal. >>> >>> If you pick the possibly too small route, then increasing the >>> size in largish junks is much better than adding a row at a time. >>> >> Pat, >> >> I am unfamiliar with the use of the word "junk" as a unit of measure for data objects. I figure there are a few different possibilities: >> >> 1. You are using the term intentionally meaning that you suggest he increases the size in terms of old cars and broken pianos rather than used up pens and broken pencils. >> >> 2. This was a Freudian slip based on your opinion of some datasets you have seen. >> >> 3. Somewhere between your mind and the final product "jumps/chunks" became "junks" (possibly a microsoft "correction", or just typing too fast combined with number 2). >> >> 4. "junks" is an official measure of data/object size that I need to learn more about (the history of the term possibly being related to 2 and 3 above). >> >> > > 5. Chinese sailing vessel. > http://en.wikipedia.org/wiki/Junk_(ship) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >
Greg Snow
2008-Jun-06 19:14 UTC
[R] New vocabulary on a Friday afternoon. Was: Improving data processing efficiency
I still like the number 4 option, so I think we need to come up with a formal definition for a "junk" of data. I read somewhere that Tukey coined the word "bit" as it applies to computers, we can share the credit/blame for "junks" of data. My proposal for a statistical/data definition of the work junk: Junk (noun): A quantity of data just large enough to get the client excited about the "great" dataset they provided, but not large enough to make any useful conclusions. Example sentence: We just received another junk of data from the boss, who gets to give him the bad news that it still does not prove his pet theory? -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org (801) 408-8111> -----Original Message----- > From: Patrick Burns [mailto:pburns at pburns.seanet.com] > Sent: Friday, June 06, 2008 12:58 PM > To: Gabor Grothendieck > Cc: Greg Snow; r-help at r-project.org > Subject: Re: [R] Improving data processing efficiency > > My guess is that number 2 is closest to the mark. > Typing too fast is unfortunately not one of my habitual attributes. > > Gabor Grothendieck wrote: > > On Fri, Jun 6, 2008 at 2:28 PM, Greg Snow > <Greg.Snow at imail.org> wrote: > > > >>> -----Original Message----- > >>> From: r-help-bounces at r-project.org > >>> [mailto:r-help-bounces at r-project.org] On Behalf Of Patrick Burns > >>> Sent: Friday, June 06, 2008 12:04 PM > >>> To: Daniel Folkinshteyn > >>> Cc: r-help at r-project.org > >>> Subject: Re: [R] Improving data processing efficiency > >>> > >>> That is going to be situation dependent, but if you have a > >>> reasonable upper bound, then that will be much easier and not far > >>> from optimal. > >>> > >>> If you pick the possibly too small route, then increasing > the size > >>> in largish junks is much better than adding a row at a time. > >>> > >> Pat, > >> > >> I am unfamiliar with the use of the word "junk" as a unit > of measure for data objects. I figure there are a few > different possibilities: > >> > >> 1. You are using the term intentionally meaning that you > suggest he increases the size in terms of old cars and broken > pianos rather than used up pens and broken pencils. > >> > >> 2. This was a Freudian slip based on your opinion of some > datasets you have seen. > >> > >> 3. Somewhere between your mind and the final product > "jumps/chunks" became "junks" (possibly a microsoft > "correction", or just typing too fast combined with number 2). > >> > >> 4. "junks" is an official measure of data/object size that > I need to learn more about (the history of the term possibly > being related to 2 and 3 above). > >> > >> > > > > 5. Chinese sailing vessel. > > http://en.wikipedia.org/wiki/Junk_(ship) > > > > ______________________________________________ > > R-help at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > >
In a case like this, if you can possibly work with matrices instead of data frames, you might get significant speedup. (More accurately, I have had situations where I obtained speed up by working with matrices instead of dataframes.) Even if you have to code character columns as numeric, it can be worth it. Data frames have overhead that matrices do not. (Here's where profiling might have given a clue) Granted, there has been recent work in reducing the overhead associated with dataframes, but I think it's worth a try. Carrying along extra columns and doing row subsetting, rbinding, etc, means a lot more things happening in memory. So, for example, if all of your matching is based just on a few columns, extract those columns, convert them to a matrix, do all the matching, and then based on some sort of row index retrieve all of the associated columns. -Don At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:>Hi everyone! > >I have a question about data processing efficiency. > >My data are as follows: I have a data set on quarterly institutional >ownership of equities; some of them have had recent IPOs, some have >not (I have a binary flag set). The total dataset size is 700k+ rows. > >My goal is this: For every quarter since issue for each IPO, I need >to find a "matched" firm in the same industry, and close in market >cap. So, e.g., for firm X, which had an IPO, i need to find a >matched non-issuing firm in quarter 1 since IPO, then a (possibly >different) non-issuing firm in quarter 2 since IPO, etc. Repeat for >each issuing firm (there are about 8300 of these). > >Thus it seems to me that I need to be doing a lot of data selection >and subsetting, and looping (yikes!), but the result appears to be >highly inefficient and takes ages (well, many hours). What I am >doing, in pseudocode, is this: > >1. for each quarter of data, getting out all the IPOs and all the >eligible non-issuing firms. >2. for each IPO in a quarter, grab all the non-issuers in the same >industry, sort them by size, and finally grab a matching firm >closest in size (the exact procedure is to grab the closest bigger >firm if one exists, and just the biggest available if all are >smaller) >3. assign the matched firm-observation the same "quarters since >issue" as the IPO being matched >4. rbind them all into the "matching" dataset. > >The function I currently have is pasted below, for your reference. >Is there any way to make it produce the same result but much faster? >Specifically, I am guessing eliminating some loops would be very >good, but I don't see how, since I need to do some fancy footwork >for each IPO in each quarter to find the matching firm. I'll be >doing a few things similar to this, so it's somewhat important to up >the efficiency of this. Maybe some of you R-fu masters can clue me >in? :) > >I would appreciate any help, tips, tricks, tweaks, you name it! :) > >========== my function below ==========> >fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, >quarters_since_issue=40) { > > result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is >cheaper, so typecast the result to matrix > > colnames = names(tfdata) > > quarterends = sort(unique(tfdata$DATE)) > > for (aquarter in quarterends) { > tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] > > tfdata_quarter_fitting_nonissuers = tfdata_quarter[ >(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) >& (tfdata_quarter$IPO.Flag == 0), ] > tfdata_quarter_ipoissuers = tfdata_quarter[ >tfdata_quarter$IPO.Flag == 1, ] > > for (i in 1:nrow(tfdata_quarter_ipoissuers)) { > arow = tfdata_quarter_ipoissuers[i,] > industrypeers = tfdata_quarter_fitting_nonissuers[ >tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] > industrypeers = industrypeers[ >order(industrypeers$Market.Cap.13f), ] > if ( nrow(industrypeers) > 0 ) { > if ( >nrow(industrypeers[industrypeers$Market.Cap.13f >= >arow$Market.Cap.13f, ]) > 0 ) { > bestpeer = >industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, >][1,] > } > else { > bestpeer = industrypeers[nrow(industrypeers),] > } > bestpeer$Quarters.Since.IPO.Issue = >arow$Quarters.Since.IPO.Issue > >#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == >bestpeer$PERMNO] = 1 > result = rbind(result, as.matrix(bestpeer)) > } > } > #result = rbind(result, tfdata_quarter) > print (aquarter) > } > > result = as.data.frame(result) > names(result) = colnames > return(result) > >} > >========= end of my function ============> >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- -------------------------------------- Don MacQueen Environmental Protection Department Lawrence Livermore National Laboratory Livermore, CA, USA 925-423-1062
Esmail Bonakdarian wrote:> hadley wickham wrote: >> > > Hi, > > I tried this suggestion as I am curious about bottlenecks in my own > R code ... > >> Why not try profiling? The profr package provides an alternative >> display that I find more helpful than the default tools: >> >> install.packages("profr") > > > install.packages("profr") > Warning message: > package ?profr? is not availableI selected a different mirror in place of the Iowa one and it worked. Odd, I just assumed all the same packages are available on all mirrors.