Aarushi Kaushal
2016-Sep-24 18:49 UTC
[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R
Hey there, I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi, which is involved in financial services, Portfolio management to be precise. Recently we've started creating ourselves a database using R for all the stocks etc. to be automated and hence analyzed accordingly for future investment purposes (data related to which is already available, and in our possession). I and a colleague of mine, we are currently at the data cleaning stage - where we need to organize and format the data according to how we want it in the database. The problem lies in notation & symbols used in the original csv data files acquired from the government website - where we have to do approximate matching (for efficiency) and thereby extract the numerics only from that string of characters from the respective columns of the dataframe. 1.) As of now we are looking at using the agrep function, to detect & locate the pattern matches namely - DIVIDEND , SPLIT, BONUS 2.) From there on carry out the extraction of the respective numeric values associated with these actions in to the corresponding columns - BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio), SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio), FInal Dividend, Interim Dividend & Special Dividend. COLUMN PURPOSE 1. DIVIDEND-RE.1/- PER SHARE 2. AGM/DIV-RS.3.50 PER SHARE 3. SPL DIV-RS.2.70 PER SHARE 4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4 5. FV SPLIT Rs.10 to RE.1 6. BON 3:2 + SPLT Rs. 5 to Rs.2.5 7. BONUS 4:1 8. DIV:10% Ex. DIVIDEND-RE.1/- PER SHARE FINAL_DIV-1 AGM/DIV-RS.3.50 PER SHARE FINAL_DIV-3.50 SPL DIV-RS.2.70 PER SHARE SPECIAL DIV-2.70 Ex. FV SPLIT Rs.10 to RE.1 SPLIT_NUM - 1 SPLIT_DEN - 10 Ex. BONUS 4:1 BONUS_NUM - 4 BONUS_DEN - 1 However, the problem with that is that agrep returns the vector indices instead of the string indices which makes it cumbersome to extract the numeric values following the respective matches. So I want a Fuzzy logic approach to - check for the presence of SPLIT, DIVIDEND, BONUS - index of which ever cell the pattern match occurs in the column PURPOSE of the data frame - index position of that particular pattern in the string to extract the numerical value following the matched pattern *Basically Is there any way in R to determine if the patterns can be checked and matched approximately while returning for value - the indices for the same in the respective strings?**(such that if in case the symbols change furthermore in the future according to the government website's notation in the data storage, or the format/positioning/spacing changes - it could account for all those changes automatically.)* I am attaching below the .csv file consisting of just the column we need to carry out the cleaning in for your convenience. It would be very helpful, if we could get some guidance as to how to proceed further at the earliest. regards, aarushi kaushal
David Winsemius
2016-Sep-25 02:18 UTC
[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R
> On Sep 24, 2016, at 11:49 AM, Aarushi Kaushal <kaushalaarushi at gmail.com> wrote: > > Hey there, > > I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi, > which is involved in financial services, Portfolio management to be > precise. Recently we've started creating ourselves a database using R for > all the stocks etc. to be automated and hence analyzed accordingly for > future investment purposes (data related to which is already available, and > in our possession). > > I and a colleague of mine, we are currently at the data cleaning stage - > where we need to organize and format the data according to how we want it > in the database. The problem lies in notation & symbols used in the > original csv data files acquired from the government website - where we > have to do approximate matching (for efficiency) and thereby extract the > numerics only from that string of characters from the respective columns of > the dataframe. > > 1.) As of now we are looking at using the agrep function, to detect & > locate the pattern matches namely - DIVIDEND , SPLIT, BONUS > > 2.) From there on carry out the extraction of the respective numeric values > associated with these actions in to the corresponding columns - > BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio), > SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio), > FInal Dividend, Interim Dividend & Special Dividend. > > > COLUMN PURPOSE > > 1. DIVIDEND-RE.1/- PER SHARE > 2. AGM/DIV-RS.3.50 PER SHARE > 3. SPL DIV-RS.2.70 PER SHARE > 4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4 > 5. FV SPLIT Rs.10 to RE.1 > 6. BON 3:2 + SPLT Rs. 5 to Rs.2.5 > 7. BONUS 4:1 > 8. DIV:10% > > Ex. > DIVIDEND-RE.1/- PER SHARE > FINAL_DIV-1 > > AGM/DIV-RS.3.50 PER SHARE > FINAL_DIV-3.50 > > SPL DIV-RS.2.70 PER SHARE > SPECIAL DIV-2.70 > > Ex. > FV SPLIT Rs.10 to RE.1 > SPLIT_NUM - 1 > SPLIT_DEN - 10 > > Ex. BONUS 4:1 > BONUS_NUM - 4 > BONUS_DEN - 1 > > However, the problem with that is that agrep returns the vector indices > instead of the string indices which makes it cumbersome to extract the > numeric values following the respective matches.Please read ?agrep which was my starting point. (I needed to see if `agrep` was like grep in being capable of returning character values of matches.) Can you explain what that actually means? What would be a "string index" if it is not the value returned when the parameter to `agrep` is setas: value=TRUE?> So I want a Fuzzy logic approach to > > - check for the presence of SPLIT, DIVIDEND, BONUS > - index of which ever cell the pattern match occurs in the column > PURPOSE of the data frame > - index position of that particular pattern in the string to extract the > numerical value following the matched pattern > > *Basically Is there any way in R to determine if the patterns can be > checked and matched approximately while returning for value - the indices > for the same in the respective strings?**(such that if in case the symbols > change furthermore in the future according to the government website's > notation in the data storage, or the format/positioning/spacing changes - > it could account for all those changes automatically.)* > I am attaching below the .csv file consisting of just the column we need to > carry out the cleaning in for your convenience. > > It would be very helpful, if we could get some guidance as to how to > proceed further at the earliest.It would be helpful for us for _you_ to construct a simple example and explain what was desired from it (as is described in the Posting Guide). -- David Winsemius Alameda, CA, USA
Bert Gunter
2016-Sep-25 04:03 UTC
[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R
"So I want a **Fuzzy logic approach** to..." That is a near meaningless buzzword. I suggest you search on "fuzzy logic" on the rseek.org website and see if any of the hits there does whatever it is that you have in mind. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Sep 24, 2016 at 11:49 AM, Aarushi Kaushal <kaushalaarushi at gmail.com> wrote:> Hey there, > > I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi, > which is involved in financial services, Portfolio management to be > precise. Recently we've started creating ourselves a database using R for > all the stocks etc. to be automated and hence analyzed accordingly for > future investment purposes (data related to which is already available, and > in our possession). > > I and a colleague of mine, we are currently at the data cleaning stage - > where we need to organize and format the data according to how we want it > in the database. The problem lies in notation & symbols used in the > original csv data files acquired from the government website - where we > have to do approximate matching (for efficiency) and thereby extract the > numerics only from that string of characters from the respective columns of > the dataframe. > > 1.) As of now we are looking at using the agrep function, to detect & > locate the pattern matches namely - DIVIDEND , SPLIT, BONUS > > 2.) From there on carry out the extraction of the respective numeric values > associated with these actions in to the corresponding columns - > BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio), > SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio), > FInal Dividend, Interim Dividend & Special Dividend. > > > COLUMN PURPOSE > > 1. DIVIDEND-RE.1/- PER SHARE > 2. AGM/DIV-RS.3.50 PER SHARE > 3. SPL DIV-RS.2.70 PER SHARE > 4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4 > 5. FV SPLIT Rs.10 to RE.1 > 6. BON 3:2 + SPLT Rs. 5 to Rs.2.5 > 7. BONUS 4:1 > 8. DIV:10% > > Ex. > DIVIDEND-RE.1/- PER SHARE > FINAL_DIV-1 > > AGM/DIV-RS.3.50 PER SHARE > FINAL_DIV-3.50 > > SPL DIV-RS.2.70 PER SHARE > SPECIAL DIV-2.70 > > Ex. > FV SPLIT Rs.10 to RE.1 > SPLIT_NUM - 1 > SPLIT_DEN - 10 > > Ex. BONUS 4:1 > BONUS_NUM - 4 > BONUS_DEN - 1 > > However, the problem with that is that agrep returns the vector indices > instead of the string indices which makes it cumbersome to extract the > numeric values following the respective matches. > So I want a Fuzzy logic approach to > > - check for the presence of SPLIT, DIVIDEND, BONUS > - index of which ever cell the pattern match occurs in the column > PURPOSE of the data frame > - index position of that particular pattern in the string to extract the > numerical value following the matched pattern > > *Basically Is there any way in R to determine if the patterns can be > checked and matched approximately while returning for value - the indices > for the same in the respective strings?**(such that if in case the symbols > change furthermore in the future according to the government website's > notation in the data storage, or the format/positioning/spacing changes - > it could account for all those changes automatically.)* > I am attaching below the .csv file consisting of just the column we need to > carry out the cleaning in for your convenience. > > It would be very helpful, if we could get some guidance as to how to > proceed further at the earliest. > > regards, > aarushi kaushal > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.