thr3ads.net - R help - [R] matching similar character strings [Jun 2013]

If this information is useful, please help other people find it:
Share via:

A M Lavezzi

2013-Jun-21 09:56 UTC

[R] matching similar character strings

Hello everybody

I have this problem: I need to match an addresses database F1 with the
information contained in a toponymic database F2.

The format of F1 is given by three columns and 800 rows, with the
columns being:

A1. Street/Road/Avenue
A2. Name
A3. Number

Consider for instance Avenue J. Kennedy , 3011. In F1 this is:

A1. Avenue
A2. J. Kennedy
A3. 3011

The format of F2 file is instead given by 20000 rows and five columns:

B1. Street/Road/Avenue
B2. Name
B3. Starting Street Number
B4. Ending Street Number
B5. Census section

So my problem is attributing the  B5 Census section to every
observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3 and
B4.

The problem is that while the information in A2 is irregularly
recorded, B2 has a given format that is Family name (space) Given
name.

So I could have that while in B2 the information is:

Kennedy John

In A2 it could be:

John Kennedy
JF Kennedy
J. Kennedy

and so on.

Thanks,

Mario

-- 
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
Sezione Diritto e Societ?
Universit? di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi

David Carlson

2013-Jun-21 14:25 UTC

head link

[R] matching similar character strings

I think you have to assume there could be other coding errors as
well, such as misspellings or abbreviating Street as St. You
probably will need to use sub() to correct A2 and possibly A1 before
trying to merge. To figure where the problems are, you might try
something like this. The last command lists the paste(A1, A2)
entries that do not match anything in paste(B1, B2).
> set.seed(42)
> a <- paste(sample(c(letters, LETTERS[1:5]), 150, replace=TRUE),+   sample(c("St", "Rd", "Ave"), 150,
replace=TRUE))> b <- paste(sample(letters, 1000, replace=TRUE), +   sample(c("St", "Rd", "Ave"), 1000,
replace=TRUE))> (ua <- sort(unique(a))) [1] "a Ave" "a Rd"  "A Rd"  "a St" 
"A St"  "b Rd"  "B Rd"  "b St"
"c Ave"
[10] "C Ave" "c Rd"  "C Rd"  "C St" 
"D Ave" "D Rd"  "d St"  "D St"
"e Ave"
[19] "E Ave" "e Rd"  "E Rd"  "e St" 
"E St"  "f Ave" "f Rd"  "g Ave"
"g Rd" 
[28] "g St"  "h Ave" "h Rd"  "h St" 
"i Ave" "i Rd"  "i St"  "j Rd"
"j St" 
[37] "k Ave" "k Rd"  "k St"  "l St" 
"m Ave" "m Rd"  "m St"  "n Ave"
"n St" 
[46] "o Ave" "o Rd"  "o St"  "p St" 
"q Ave" "q Rd"  "q St"  "r Ave"
"r Rd" 
[55] "r St"  "s Ave" "s Rd"  "s St" 
"t Ave" "t Rd"  "t St"  "u Ave"
"u Rd" 
[64] "u St"  "v Ave" "v Rd"  "v St" 
"w Ave" "w Rd"  "w St"  "x Rd"
"x St" 
[73] "y Ave" "y Rd"  "z Ave" "z Rd" 
"z St" > (ub <- sort(unique(b))) [1] "a Ave" "a Rd"  "a St"  "b Ave"
"b Rd"  "b St"  "c Ave" "c Rd"
"c St" 
[10] "d Ave" "d Rd"  "d St"  "e Ave"
"e Rd"  "e St"  "f Ave" "f Rd"
"f St" 
[19] "g Ave" "g Rd"  "g St"  "h Ave"
"h Rd"  "h St"  "i Ave" "i Rd"
"i St" 
[28] "j Ave" "j Rd"  "j St"  "k Ave"
"k Rd"  "k St"  "l Ave" "l Rd"
"l St" 
[37] "m Ave" "m Rd"  "m St"  "n Ave"
"n Rd"  "n St"  "o Ave" "o Rd"
"o St" 
[46] "p Ave" "p Rd"  "p St"  "q Ave"
"q Rd"  "q St"  "r Ave" "r Rd"
"r St" 
[55] "s Ave" "s Rd"  "s St"  "t Ave"
"t Rd"  "t St"  "u Ave" "u Rd"
"u St" 
[64] "v Ave" "v Rd"  "v St"  "w Ave"
"w Rd"  "w St"  "x Ave" "x Rd"
"x St" 
[73] "y Ave" "y Rd"  "y St"  "z Ave"
"z Rd"  "z St" > ua[!(ua %in% ub)] [1] "A Rd"  "A St"  "B Rd"  "C Ave"
"C Rd"  "C St"  "D Ave" "D Rd"
"D St" 
[10] "E Ave" "E Rd"  "E St" 

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of A M Lavezzi
Sent: Friday, June 21, 2013 4:56 AM
To: r-help
Subject: [R] matching similar character strings

Hello everybody

I have this problem: I need to match an addresses database F1 with
the
information contained in a toponymic database F2.

The format of F1 is given by three columns and 800 rows, with the
columns being:

A1. Street/Road/Avenue
A2. Name
A3. Number

Consider for instance Avenue J. Kennedy , 3011. In F1 this is:

A1. Avenue
A2. J. Kennedy
A3. 3011

The format of F2 file is instead given by 20000 rows and five
columns:

B1. Street/Road/Avenue
B2. Name
B3. Starting Street Number
B4. Ending Street Number
B5. Census section

So my problem is attributing the  B5 Census section to every
observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3
and
B4.

The problem is that while the information in A2 is irregularly
recorded, B2 has a given format that is Family name (space) Given
name.

So I could have that while in B2 the information is:

Kennedy John

In A2 it could be:

John Kennedy
JF Kennedy
J. Kennedy

and so on.

Thanks,

Mario

-- 
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
Sezione Diritto e Societ?
Universit? di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

A M Lavezzi

2013-Jul-02 14:22 UTC

head link

[R] matching similar character strings

Dear Arun,
please excuse me for this late reply, we had to stop working on this
temporaririly.

Let me reproduce here two examples of rows from F1 and F2 (sorry, but
with dput() I am not able to produce a clear example)
> F1_ex        Nome.azienda                   Indirizzo
17     Alterego             Via Edmondo De Amicis, 18

On row 17 of F1 we have a firm named ("Nome.azienda")
'Alterego' whose
address ("indirizzo") is  'Via Edmondo de Amicis, 18'

Below I reproduce the portion of F2 with information on the street
mentioned in F1_ex$Indirizzo.
> F2_ex
  CODICE    STRADA       AREADICIRCOLAZIONE          NUMBER1 BARRATO1
NUMBER2 BARRATO2 SECTION
1  15620        VIA            DE AMICIS EDMONDO                     1
                          5                                 1288
2  15620        VIA            DE AMICIS EDMONDO                     2
                         34                                 1261
3  15620        VIA            DE AMICIS EDMONDO                     7
                         17                                 1287
4  15620        VIA            DE AMICIS EDMONDO                    36
                         62                                1264
5  15620        VIA            DE AMICIS EDMONDO                    37
                         37                                1287
6  15620        VIA            DE AMICIS EDMONDO                    64
                         84                                1262


Line 1 says that the portion of VIA DE AMICIS EDMONDO
("STRADA"+"AREADICIRCOLAZIONE"), with street numbers between
1 and 5
belongs to SECTION 1288 (these are census sections). ("BARRATO1" and
"BARRATO2" refer to the letter in street numbers such as 12/A, 28/D,
etc. In the present example they are empty)

Line 2 says that the portion of VIA DE AMICIS EDMONDO, with street
numbers between 2 and 34 belongs to SECTION 1261,

etc.

Our problem is to assign SECTION 1261 to 'Alterego', exploting the
information on its address. The problem is that the syntax of the
street address in F1 is different from the syntax in F2.

Hope I have clarified the issue

thanks a lot
Mario








On Fri, Jun 21, 2013 at 5:25 PM, arun <smartpink111 at yahoo.com>
wrote:> Dear Mario,
> I didn't find any difference between 1st and 2nd row of F2, except for
the last three columns.  Question is that why should F1 1st row should be merged
to 2nd row of F2 instead of 1st row of F2. In your previous example, you
mentioned about A1, A2, ... and B1, B2, etc.  Here, it is not provided.  As I
mentioned before, it is better to provide the output of ?dput() from a subset of
dataset.
> dput(head(F1,20))
>
> dput(head(F2,20))
>
> #so that there would be atleast some matching pairs within the example
dataset.  Also, please post it to r-help as I will be able to check only after a
couple of hours
> Tx.
> Arun
>
>
>
> ----- Original Message -----
> From: Mario Lavezzi <mario.lavezzi at unipa.it>
> To: arun <smartpink111 at yahoo.com>
> Cc:
> Sent: Friday, June 21, 2013 11:08 AM
> Subject: Re: [R] matching similar character strings
>
> dear Arun
> thank you very much. Let me explain the problem:
>
> Imagine that a portion of the row in F1 is:
>
> ----------------------------
> F1
>
> 1) Street | J.F. Kennedy | 30
> ----------------------------
>
> it means that our unit of interest (a firm) has address: J.F. Kennedy
Street, 30
>
>
> The F2 database contains the list of all the streets of the city, with
additional variables characterizing that street (Census data). The database
> contains sometimes street divided in some parts, according to the street
number. For example:
>
>
> Example of three rows of F2 concerning Kennedy street and Kennedy Road:
>
> F2
>
> 1) Street | Kennedy John Fitzgerald  | 1  | 20 | A12
> 2) Street | Kennedy John Fitzgerald  | 20 | 50 | A15
> 3) Road   | Kennedy John             | 1  | 50 | A23
>
>
> We'd like to have an algorithm able to understand that, notwithstanding
the name is slightly different, element A15 should be added to row 1) of F1,
> producing an output such as:
>
> 1) Street | J.F. Kennedy | 30 | A15
>
>
> hope this clarifies the issue.
>
> thanks a lot! Mario
>
>
>
> Il 21/06/2013 15:29, arun ha scritto:
>> HI,
>> Could you dput() your example datasets and also your expected result? 
The Census section is not clear.
>> A.K.
>>
>>
>>
>>
>> ----- Original Message -----
>> From: A M Lavezzi <mario.lavezzi at unipa.it>
>> To: r-help <r-help at r-project.org>
>> Cc:
>> Sent: Friday, June 21, 2013 5:56 AM
>> Subject: [R] matching similar character strings
>>
>> Hello everybody
>>
>> I have this problem: I need to match an addresses database F1 with the
>> information contained in a toponymic database F2.
>>
>> The format of F1 is given by three columns and 800 rows, with the
>> columns being:
>>
>> A1. Street/Road/Avenue
>> A2. Name
>> A3. Number
>>
>> Consider for instance Avenue J. Kennedy , 3011. In F1 this is:
>>
>> A1. Avenue
>> A2. J. Kennedy
>> A3. 3011
>>
>> The format of F2 file is instead given by 20000 rows and five columns:
>>
>> B1. Street/Road/Avenue
>> B2. Name
>> B3. Starting Street Number
>> B4. Ending Street Number
>> B5. Census section
>>
>> So my problem is attributing the  B5 Census section to every
>> observation of F1 if: A1=B1, A2=B2, and A3 is comprised between B3 and
>> B4.
>>
>> The problem is that while the information in A2 is irregularly
>> recorded, B2 has a given format that is Family name (space) Given
>> name.
>>
>> So I could have that while in B2 the information is:
>>
>> Kennedy John
>>
>> In A2 it could be:
>>
>> John Kennedy
>> JF Kennedy
>> J. Kennedy
>>
>> and so on.
>>
>> Thanks,
>>
>> Mario
>>
>
> --
> PLEASE NOTICE NEW EMAIL ADDRESS AND HOME PAGE URL
>
> Andrea Mario Lavezzi
> Dipartimento di Studi su Politica, Diritto e Societ?
> Universit? di Palermo
> Piazza Bologni 8
> 90134 Palermo, Italy
> tel. ++39 091 23892208
> fax ++39 091 6111268
> skype: lavezzimario
> email: mario.lavezzi (at) unipa.it
> web: http://www.unipa.it/~mario.lavezzi


-- 
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
Sezione Diritto e Societ?
Universit? di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi

A M Lavezzi

2013-Jul-09 11:19 UTC

head link

[R] matching similar character strings

Dear Arun

below I present the issue in dput(). Sorry for not being able to do it before.

In particular, following your previous suggestions, the databases are
as simplified as possible (all capital letters, the issue odd/even
solved by adding an additional 0-1 variable).

Now F2 contains all possible cases of street definition (there cannot
be overlapping odd or even number intervals), and the examples in F1
should cover all possible cases (there cannot be numbers not included
in any number interval of a given street in F2).

The table "expected" contains the way we would like to add the
variable "SECTION" to the items in F1.

It is not difficult to proceed with the matching through some of the variables:
- F1$Tipo vs F2$STRADA
- F1$Numero vs the interval (F2$NUMBER1, F2$NUMBER2)
- F1$Barrato vs the interval (F2$BARRATO1, F2$BARRATO2)

The problem I am unable to solve is how to compare and match
F1$Indirizzo and F2?$AREADICIRCOLAZIONE

F2?$AREADICIRCOLAZIONE has a regular structure, while F1$Indirizzo has
not. I tried various ways (grep, etc.), with no success.

Thanks for your time and patience, Mario
============================================================F1 <-
structure(list(
Nome.azienda = c("Rossi",
"Verdi","Bianchi","Viola","Neri","Gialli"),
Tipo = c("VIA","PIAZZA",
"V.","VIA","LARGO","VIA"),
Indirizzo = c("E DE AMICIS", "DE AMICIS E.", "DE
AMICIS","EDMONDO DE
AMICIS","EUROPA","G. GARIBALDI"),
Numero =
c("5","18","65","40","10","5"),
Barrato =
c("","","","","D",""),
Odd =
c("1","0","1","0","0","1")),
.Names =
c("Nome.azienda","Tipo","Indirizzo","Numero","Barrato","Odd"),
class = "data.frame",row.names =
c("17","18","19","20","21","22"))

F2 <-
structure(list(
CODICE = c(15620L, 15620L, 15620L, 15620L,
15620L,15620L,15800L,15800L,17450L,17450L,17450L,18000L,18000L),
STRADA = c("VIA", "VIA", "VIA", "VIA",
"VIA",
"VIA","PIAZZA","PIAZZA","LARGO","LARGO","LARGO","VIA","VIA"),
AREADICIRCOLAZIONE = c("DE AMICIS EDMONDO", "DE AMICIS
EDMONDO",
"DE AMICIS EDMONDO", "DE AMICIS EDMONDO", "DE AMICIS
EDMONDO",
"DE AMICIS EDMONDO","DE AMICIS EDMONDO", "DE AMICIS
EDMONDO","EUROPA","EUROPA","EUROPA","GARIBALDI
GIUSEPPE","GARIBALDI
GIUSEPPE"),
NUMBER1 = c(1L, 2L, 7L, 36L, 37L, 64L,2L,3L,2L,10L,3L,3L,4L),
BARRATO1 = c("", "", "", "",
"",
"","","","","B","","",""),
NUMBER2 = c(5L,34L, 17L, 62L, 67L, 84L,20L,25L,10L,30L,37L,13L,26L),
BARRATO2 = c("", "", "", "",
"","","","","A","","","",""),
ODD =
c("1","0","1","0","1","0","0","1","0","0","1","1","0"),
SECTION = c(1288, 1261, 1287, 1264, 1287,
1262,1500,1505,1510,1520,1530,1610,1615)),
.Names = c("CODICE","STRADA",
"AREADICIRCOLAZIONE", "NUMBER1",
"BARRATO1",
"NUMBER2","BARRATO2","ODD","SECTION"),
class = "data.frame",row.names = c("1","2",
"3", "4", "5",
"6","7","8","9","10","11","12","13"))

expected <-
structure(list(
Nome.azienda = c("Rossi",
"Verdi","Bianchi","Viola","Neri","Gialli"),
Tipo = c("VIA","PIAZZA",
"V.","VIA","LARGO","VIA"),
Indirizzo = c("E DE AMICIS", "DE AMICIS E.", "DE
AMICIS","EDMONDO DE
AMICIS","EUROPA","G. GARIBALDI"),
Numero =
c("5","18","65","40","10","5"),
Barrato =
c("","","","","D",""),
Odd =
c("1","0","1","0","0","1"),
SECTION =
c("1288","1500","1287","1264","1520","1610")),
.Names =
c("Nome.azienda","Tipo","Indirizzo","Numero","Barrato","Odd","SECTION"),
class = "data.frame",row.names =
c("17","18","19","20","21","22"))



On Sat, Jul 6, 2013 at 2:55 PM, arun <smartpink111 at yahoo.com>
wrote:> Dear Mario,
>
> It would be better if you post the examples using dput() and also the
expected outcome.  Also, suppose your F1_ex has 65, but F2_ex doesn't have
any odd number in that range but only even number from 38 to 72.  In that case,
it should return NA? Another possibility is overlapping ranges, using the same
example, if there are multiple ranges ex: 37 to 77 and 57 to 67, which one do
you use?
>   Arun
>
>
>
>
> ----- Original Message -----
> From: A M Lavezzi <mario.lavezzi at unipa.it>
> To: r-help <r-help at r-project.org>
> Cc:
> Sent: Saturday, July 6, 2013 8:36 AM
> Subject: Re: [R] matching similar character strings
>
> Dear Arun,
>
> thank you so much! The code you suggest captures what we have in mind.
> However, what we are looking for is something a bit more general
> (sorry: I realised that maybe this was not so clear from the
> beginning).
>
> In particular:
>
> - in F1_ex the address in the "Indirizzo" field could be spelled
more
> irregularly (ex: "Via De Amicis 18", "V. De Amicis 18",
"Via E. De
> Amicis 18", etc.)
>
> - in F2 the classification of the portions of the street is based on
> odd and even numbers. For example, if we had number "15" in F1 it
> should be matched to row 3 and not to row 2 of F2 (I actually provided
> a wrong example with number 65: row 2 of F1_ex is currently matched to
> row 6 of F2_ex which contains even numbers. Moreover, there are no odd
> street numbers in this street higher than 37)
>
> Thank you very much once again
>
> Mario
>
>
> On Wed, Jul 3, 2013 at 6:47 AM, arun <smartpink111 at yahoo.com>
wrote:
>> Dear Mario,
>> Not sure if this is what you wanted:
>> F1_ex<- read.table(text="
>>    Nome.azienda;Indirizzo
>> 17;Alterego;Via Edmondo De Amicis, 18
>> 18;Alterego;Via Edmondo De Amicis, 65
>> ",sep=";",header=TRUE,stringsAsFactors=FALSE)
>>
>> F2_ex<- read.table(text="
>>   
CODICE;STRADA;AREADICIRCOLAZIONE;NUMBER1;BARRATO1;NUMBER2;BARRATO2;SECTION
>> 1;15620;VIA;DE AMICIS EDMONDO;1;;5;;1288
>> 2;15620;VIA;DE AMICIS EDMONDO;2;;34;;1261
>> 3;15620;VIA;DE AMICIS EDMONDO;7;;17;;1287
>> 4;15620;VIA;DE AMICIS EDMONDO;36;;62;;1264
>> 5;15620;VIA;DE AMICIS EDMONDO;37;;37;;1287
>> 6;15620;VIA;DE AMICIS EDMONDO;64;;84;;1262
>> ",sep=";",header=TRUE,stringsAsFactors=FALSE)
>> library(stringr)
>> 
vec1<-sapply(lapply(toupper(str_trim(gsub("[0-9,]","",F1_ex[,2]))),word,c(1,3,4,2)),paste,collapse="
")
>>  vec2<- as.numeric(gsub("\\D+","",F1_ex[,2]))
>>  F1_ex[,1]<-F2_ex[sapply(vec2,function(x) which((x>F2_ex[,4]
& x< F2_ex[,6]) &
paste(F2_ex[,2],F2_ex[,3])%in%vec1)),"SECTION"]
>>  F1_ex
>> #   Nome.azienda                 Indirizzo
>> #17         1261 Via Edmondo De Amicis, 18
>> #18         1262 Via Edmondo De Amicis, 65
>> A.K.
>>
>>
>>
>>
>>
>> ----- Original Message -----
>> From: A M Lavezzi <mario.lavezzi at unipa.it>
>> To: r-help <r-help at r-project.org>
>> Cc:
>> Sent: Tuesday, July 2, 2013 10:22 AM
>> Subject: Re: [R] matching similar character strings
>>
>> Dear Arun,
>> please excuse me for this late reply, we had to stop working on this
>> temporaririly.
>>
>> Let me reproduce here two examples of rows from F1 and F2 (sorry, but
>> with dput() I am not able to produce a clear example)
>>
>>> F1_ex
>>         Nome.azienda                   Indirizzo
>> 17     Alterego             Via Edmondo De Amicis, 18
>>
>> On row 17 of F1 we have a firm named ("Nome.azienda")
'Alterego' whose
>> address ("indirizzo") is  'Via Edmondo de Amicis, 18'
>>
>> Below I reproduce the portion of F2 with information on the street
>> mentioned in F1_ex$Indirizzo.
>>
>>> F2_ex
>>
>>   CODICE    STRADA       AREADICIRCOLAZIONE          NUMBER1 BARRATO1
>> NUMBER2 BARRATO2 SECTION
>> 1  15620        VIA            DE AMICIS EDMONDO                     1
>>                           5                                 1288
>> 2  15620        VIA            DE AMICIS EDMONDO                     2
>>                          34                                 1261
>> 3  15620        VIA            DE AMICIS EDMONDO                     7
>>                          17                                 1287
>> 4  15620        VIA            DE AMICIS EDMONDO                    36
>>                          62                                1264
>> 5  15620        VIA            DE AMICIS EDMONDO                    37
>>                          37                                1287
>> 6  15620        VIA            DE AMICIS EDMONDO                    64
>>                          84                                1262
>>
>>
>> Line 1 says that the portion of VIA DE AMICIS EDMONDO
>> ("STRADA"+"AREADICIRCOLAZIONE"), with street
numbers between 1 and 5
>> belongs to SECTION 1288 (these are census sections).
("BARRATO1" and
>> "BARRATO2" refer to the letter in street numbers such as
12/A, 28/D,
>> etc. In the present example they are empty)
>>
>> Line 2 says that the portion of VIA DE AMICIS EDMONDO, with street
>> numbers between 2 and 34 belongs to SECTION 1261,
>>
>> etc.
>>
>> Our problem is to assign SECTION 1261 to 'Alterego', exploting
the
>> information on its address. The problem is that the syntax of the
>> street address in F1 is different from the syntax in F2.
>>
>> Hope I have clarified the issue
>>
>> thanks a lot
>> Mario
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jun 21, 2013 at 5:25 PM, arun <smartpink111 at yahoo.com>
wrote:
>>> Dear Mario,
>>> I didn't find any difference between 1st and 2nd row of F2,
except for the last three columns.  Question is that why should F1 1st row
should be merged to 2nd row of F2 instead of 1st row of F2. In your previous
example, you mentioned about A1, A2, ... and B1, B2, etc.  Here, it is not
provided.  As I mentioned before, it is better to provide the output of ?dput()
from a subset of dataset.
>>> dput(head(F1,20))
>>>
>>> dput(head(F2,20))
>>>
>>> #so that there would be atleast some matching pairs within the
example dataset.  Also, please post it to r-help as I will be able to check only
after a couple of hours
>>> Tx.
>>> Arun
>>>
>>>
>>>
>>> ----- Original Message -----
>>> From: Mario Lavezzi <mario.lavezzi at unipa.it>
>>> To: arun <smartpink111 at yahoo.com>
>>> Cc:
>>> Sent: Friday, June 21, 2013 11:08 AM
>>> Subject: Re: [R] matching similar character strings
>>>
>>> dear Arun
>>> thank you very much. Let me explain the problem:
>>>
>>> Imagine that a portion of the row in F1 is:
>>>
>>> ----------------------------
>>> F1
>>>
>>> 1) Street | J.F. Kennedy | 30
>>> ----------------------------
>>>
>>> it means that our unit of interest (a firm) has address: J.F.
Kennedy Street, 30
>>>
>>>
>>> The F2 database contains the list of all the streets of the city,
with additional variables characterizing that street (Census data). The database
>>> contains sometimes street divided in some parts, according to the
street number. For example:
>>>
>>>
>>> Example of three rows of F2 concerning Kennedy street and Kennedy
Road:
>>>
>>> F2
>>>
>>> 1) Street | Kennedy John Fitzgerald  | 1  | 20 | A12
>>> 2) Street | Kennedy John Fitzgerald  | 20 | 50 | A15
>>> 3) Road   | Kennedy John             | 1  | 50 | A23
>>>
>>>
>>> We'd like to have an algorithm able to understand that,
notwithstanding the name is slightly different, element A15 should be added to
row 1) of F1,
>>> producing an output such as:
>>>
>>> 1) Street | J.F. Kennedy | 30 | A15
>>>
>>>
>>> hope this clarifies the issue.
>>>
>>> thanks a lot! Mario
>>>
>>>
>>>
>>> Il 21/06/2013 15:29, arun ha scritto:
>>>> HI,
>>>> Could you dput() your example datasets and also your expected
result?  The Census section is not clear.
>>>> A.K.
>>>>
>>>>
>>>>
>>>>
>>>> ----- Original Message -----
>>>> From: A M Lavezzi <mario.lavezzi at unipa.it>
>>>> To: r-help <r-help at r-project.org>
>>>> Cc:
>>>> Sent: Friday, June 21, 2013 5:56 AM
>>>> Subject: [R] matching similar character strings
>>>>
>>>> Hello everybody
>>>>
>>>> I have this problem: I need to match an addresses database F1
with the
>>>> information contained in a toponymic database F2.
>>>>
>>>> The format of F1 is given by three columns and 800 rows, with
the
>>>> columns being:
>>>>
>>>> A1. Street/Road/Avenue
>>>> A2. Name
>>>> A3. Number
>>>>
>>>> Consider for instance Avenue J. Kennedy , 3011. In F1 this is:
>>>>
>>>> A1. Avenue
>>>> A2. J. Kennedy
>>>> A3. 3011
>>>>
>>>> The format of F2 file is instead given by 20000 rows and five
columns:
>>>>
>>>> B1. Street/Road/Avenue
>>>> B2. Name
>>>> B3. Starting Street Number
>>>> B4. Ending Street Number
>>>> B5. Census section
>>>>
>>>> So my problem is attributing the  B5 Census section to every
>>>> observation of F1 if: A1=B1, A2=B2, and A3 is comprised between
B3 and
>>>> B4.
>>>>
>>>> The problem is that while the information in A2 is irregularly
>>>> recorded, B2 has a given format that is Family name (space)
Given
>>>> name.
>>>>
>>>> So I could have that while in B2 the information is:
>>>>
>>>> Kennedy John
>>>>
>>>> In A2 it could be:
>>>>
>>>> John Kennedy
>>>> JF Kennedy
>>>> J. Kennedy
>>>>
>>>> and so on.
>>>>
>>>> Thanks,
>>>>
>>>> Mario
>>>>
>>>
>>> --
>>> PLEASE NOTICE NEW EMAIL ADDRESS AND HOME PAGE URL
>>>
>>> Andrea Mario Lavezzi
>>> Dipartimento di Studi su Politica, Diritto e Societ?
>>> Universit? di Palermo
>>> Piazza Bologni 8
>>> 90134 Palermo, Italy
>>> tel. ++39 091 23892208
>>> fax ++39 091 6111268
>>> skype: lavezzimario
>>> email: mario.lavezzi (at) unipa.it
>>> web: http://www.unipa.it/~mario.lavezzi
>>
>>
>>
>> --
>> Andrea Mario Lavezzi
>> Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
>> Sezione Diritto e Societ?
>> Universit? di Palermo
>> Piazza Bologni 8
>> 90134 Palermo, Italy
>> tel. ++39 091 23892208
>> fax ++39 091 6111268
>> skype: lavezzimario
>> email: mario.lavezzi (at) unipa.it
>> web: http://www.unipa.it/~mario.lavezzi
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Andrea Mario Lavezzi
> Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
> Sezione Diritto e Societ?
> Universit? di Palermo
> Piazza Bologni 8
> 90134 Palermo, Italy
> tel. ++39 091 23892208
> fax ++39 091 6111268
> skype: lavezzimario
> email: mario.lavezzi (at) unipa.it
> web: http://www.unipa.it/~mario.lavezzi
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Andrea Mario Lavezzi
Dipartimento di Scienze Giuridiche, della Societ? e dello Sport
Sezione Diritto e Societ?
Universit? di Palermo
Piazza Bologni 8
90134 Palermo, Italy
tel. ++39 091 23892208
fax ++39 091 6111268
skype: lavezzimario
email: mario.lavezzi (at) unipa.it
web: http://www.unipa.it/~mario.lavezzi

R help - Jun 2013 - matching similar character strings

[R] matching similar character strings

[R] matching similar character strings

[R] matching similar character strings

[R] matching similar character strings