thr3ads.net - R help - [R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Aarushi Kaushal

2016-Sep-24 18:49 UTC

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

Hey there,

I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
which is involved in financial services, Portfolio management to be
precise. Recently we've started creating ourselves a database using R for
all the stocks etc. to be automated and hence analyzed accordingly for
future investment purposes (data related to which is already available, and
in our possession).

I and a colleague of mine, we are currently at the data cleaning stage -
where we need to organize and format the data according to how we want it
in the database. The problem lies in notation & symbols used in the
original csv data files acquired from the government website - where we
have to do approximate matching (for efficiency) and thereby extract the
numerics only from that string of characters from the respective columns of
the dataframe.

1.) As of now we are looking at using the agrep function, to detect &
locate the pattern matches namely - DIVIDEND , SPLIT, BONUS

2.) From there on carry out the extraction of the respective numeric values
associated with these actions in to the corresponding columns -
BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
FInal Dividend, Interim Dividend & Special Dividend.


COLUMN PURPOSE

   1. DIVIDEND-RE.1/- PER SHARE
   2. AGM/DIV-RS.3.50 PER SHARE
   3. SPL DIV-RS.2.70 PER SHARE
   4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
   5. FV SPLIT Rs.10 to RE.1
   6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
   7. BONUS 4:1
   8. DIV:10%

Ex.
DIVIDEND-RE.1/- PER SHARE
FINAL_DIV-1

AGM/DIV-RS.3.50 PER SHARE
FINAL_DIV-3.50

SPL DIV-RS.2.70 PER SHARE
SPECIAL DIV-2.70

Ex.
FV SPLIT Rs.10 to RE.1
SPLIT_NUM - 1
SPLIT_DEN - 10

Ex. BONUS 4:1
BONUS_NUM - 4
BONUS_DEN - 1

However, the problem with that is that agrep returns the vector indices
 instead of the string indices which makes it cumbersome to extract the
numeric values following the respective matches.
So I want a Fuzzy logic approach to

   - check for the presence of SPLIT, DIVIDEND, BONUS
   - index of which ever cell the pattern match occurs in the column
   PURPOSE of the data frame
   - index position of that particular pattern in the string to extract the
   numerical value following the matched pattern

*Basically Is there any way in R to determine if the patterns can be
checked and matched approximately while returning for value - the indices
for the same in the respective strings?**(such that if in case the symbols
change furthermore in the future according to the government website's
notation in the data storage, or the format/positioning/spacing changes -
it could account for all those changes automatically.)*
I am attaching below the .csv file consisting of just the column we need to
carry out the cleaning in for your convenience.

It would be very helpful, if we could get some guidance as to how to
proceed further at the earliest.

regards,
aarushi kaushal

David Winsemius

2016-Sep-25 02:18 UTC

head link

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

> On Sep 24, 2016, at 11:49 AM, Aarushi Kaushal <kaushalaarushi at
gmail.com> wrote:
> 
> Hey there,
> 
> I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
> which is involved in financial services, Portfolio management to be
> precise. Recently we've started creating ourselves a database using R
for
> all the stocks etc. to be automated and hence analyzed accordingly for
> future investment purposes (data related to which is already available, and
> in our possession).
> 
> I and a colleague of mine, we are currently at the data cleaning stage -
> where we need to organize and format the data according to how we want it
> in the database. The problem lies in notation & symbols used in the
> original csv data files acquired from the government website - where we
> have to do approximate matching (for efficiency) and thereby extract the
> numerics only from that string of characters from the respective columns of
> the dataframe.
> 
> 1.) As of now we are looking at using the agrep function, to detect &
> locate the pattern matches namely - DIVIDEND , SPLIT, BONUS
> 
> 2.) From there on carry out the extraction of the respective numeric values
> associated with these actions in to the corresponding columns -
> BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
> SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
> FInal Dividend, Interim Dividend & Special Dividend.
> 
> 
> COLUMN PURPOSE
> 
>   1. DIVIDEND-RE.1/- PER SHARE
>   2. AGM/DIV-RS.3.50 PER SHARE
>   3. SPL DIV-RS.2.70 PER SHARE
>   4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
>   5. FV SPLIT Rs.10 to RE.1
>   6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
>   7. BONUS 4:1
>   8. DIV:10%
> 
> Ex.
> DIVIDEND-RE.1/- PER SHARE
> FINAL_DIV-1
> 
> AGM/DIV-RS.3.50 PER SHARE
> FINAL_DIV-3.50
> 
> SPL DIV-RS.2.70 PER SHARE
> SPECIAL DIV-2.70
> 
> Ex.
> FV SPLIT Rs.10 to RE.1
> SPLIT_NUM - 1
> SPLIT_DEN - 10
> 
> Ex. BONUS 4:1
> BONUS_NUM - 4
> BONUS_DEN - 1
> 
> However, the problem with that is that agrep returns the vector indices
> instead of the string indices which makes it cumbersome to extract the
> numeric values following the respective matches.
Please read ?agrep which was my starting point. (I needed to see if `agrep` was
like grep in being capable of returning character values of matches.)

Can you explain what that actually means? What would be a "string
index" if it is not the value returned when the parameter to `agrep` is
setas:  value=TRUE?

> So I want a Fuzzy logic approach to
> 
>   - check for the presence of SPLIT, DIVIDEND, BONUS
>   - index of which ever cell the pattern match occurs in the column
>   PURPOSE of the data frame
>   - index position of that particular pattern in the string to extract the
>   numerical value following the matched pattern
> 
> *Basically Is there any way in R to determine if the patterns can be
> checked and matched approximately while returning for value - the indices
> for the same in the respective strings?**(such that if in case the symbols
> change furthermore in the future according to the government website's
> notation in the data storage, or the format/positioning/spacing changes -
> it could account for all those changes automatically.)*
> I am attaching below the .csv file consisting of just the column we need to
> carry out the cleaning in for your convenience.
> 
> It would be very helpful, if we could get some guidance as to how to
> proceed further at the earliest.
It would be helpful for us for _you_ to construct a simple example and explain
what was desired from it (as is described in the Posting Guide).

-- 

David Winsemius
Alameda, CA, USA

Bert Gunter

2016-Sep-25 04:03 UTC

head link

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

"So I want a **Fuzzy logic approach** to..."

That is a near meaningless buzzword.

I suggest you search on "fuzzy logic" on the rseek.org website and see
if any of the hits there does whatever it is that you have in mind.

Cheers,
Bert




Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Sep 24, 2016 at 11:49 AM, Aarushi Kaushal
<kaushalaarushi at gmail.com> wrote:> Hey there,
>
> I work for an organisation named Bullero Capital Pvt. Ltd. in New Delhi,
> which is involved in financial services, Portfolio management to be
> precise. Recently we've started creating ourselves a database using R
for
> all the stocks etc. to be automated and hence analyzed accordingly for
> future investment purposes (data related to which is already available, and
> in our possession).
>
> I and a colleague of mine, we are currently at the data cleaning stage -
> where we need to organize and format the data according to how we want it
> in the database. The problem lies in notation & symbols used in the
> original csv data files acquired from the government website - where we
> have to do approximate matching (for efficiency) and thereby extract the
> numerics only from that string of characters from the respective columns of
> the dataframe.
>
> 1.) As of now we are looking at using the agrep function, to detect &
> locate the pattern matches namely - DIVIDEND , SPLIT, BONUS
>
> 2.) From there on carry out the extraction of the respective numeric values
> associated with these actions in to the corresponding columns -
> BONUS_NUM(Numerator for the ratio), BONUS_DEN( Denominator for the ratio),
> SPLIT_NUM(Numerator for the ratio), SPLIT_DEN (Denominator for the Ratio),
> FInal Dividend, Interim Dividend & Special Dividend.
>
>
> COLUMN PURPOSE
>
>    1. DIVIDEND-RE.1/- PER SHARE
>    2. AGM/DIV-RS.3.50 PER SHARE
>    3. SPL DIV-RS.2.70 PER SHARE
>    4. DIV - FIN 3.50RE PER SHARE + SPL-Rs.1.4
>    5. FV SPLIT Rs.10 to RE.1
>    6. BON 3:2 + SPLT Rs. 5 to Rs.2.5
>    7. BONUS 4:1
>    8. DIV:10%
>
> Ex.
> DIVIDEND-RE.1/- PER SHARE
> FINAL_DIV-1
>
> AGM/DIV-RS.3.50 PER SHARE
> FINAL_DIV-3.50
>
> SPL DIV-RS.2.70 PER SHARE
> SPECIAL DIV-2.70
>
> Ex.
> FV SPLIT Rs.10 to RE.1
> SPLIT_NUM - 1
> SPLIT_DEN - 10
>
> Ex. BONUS 4:1
> BONUS_NUM - 4
> BONUS_DEN - 1
>
> However, the problem with that is that agrep returns the vector indices
>  instead of the string indices which makes it cumbersome to extract the
> numeric values following the respective matches.
> So I want a Fuzzy logic approach to
>
>    - check for the presence of SPLIT, DIVIDEND, BONUS
>    - index of which ever cell the pattern match occurs in the column
>    PURPOSE of the data frame
>    - index position of that particular pattern in the string to extract the
>    numerical value following the matched pattern
>
> *Basically Is there any way in R to determine if the patterns can be
> checked and matched approximately while returning for value - the indices
> for the same in the respective strings?**(such that if in case the symbols
> change furthermore in the future according to the government website's
> notation in the data storage, or the format/positioning/spacing changes -
> it could account for all those changes automatically.)*
> I am attaching below the .csv file consisting of just the column we need to
> carry out the cleaning in for your convenience.
>
> It would be very helpful, if we could get some guidance as to how to
> proceed further at the earliest.
>
> regards,
> aarushi kaushal
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Sep 2016 - Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R

[R] Query regarding Approximate/Fuzzy matching & String Extraction(numeric) in R