thr3ads.net - R help - [R] Character manipulation using "strsplit" & vectorization [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Steven Kang

2009-Sep-08 04:39 UTC

[R] Character manipulation using "strsplit" & vectorization

Dear R users,


Suppose I have a data set with inconsistent names for a field.

I desire to make these to consistent names.

i.e

"University of New Jersey", "New Jersey Uni", "New
Jersey University" (3
different inconsistent names) to "The University of New Jersey"
(consistent
name)

Below are arbitrary data set produced from "state.name" (built in data
set
in R) and associated scripts.


d <- as.data.frame(c(state.name[30:40], paste(state.name[30:40],
"University", sep=" "), paste("Th University of",
state.name[30:40], sep="
"),paste("University o", state.name[30:40], sep=" ")))
da <- sapply(d, as.character)   # factor to character transformation

spl <- strsplit(da, " ")   # spliting components

dd <- character(dim(da)[1])   # initializing empty vector
for (i in 1:dim(da)[1])   {
    if (sum(c("New", "Jersey", "University") %in%
spl[[i]]) >= 3)
dd[i] <- "The University of New Jersey"
     else if (sum(c("New", "Mexico", "University")
%in% spl[[i]]) >= 3)
dd[i] <- "The University of New Mexico"
     else if (sum(c("New", "York") %in% spl[[i]]) >2)    
dd[i] <- "The University of New York"
     else if (sum(c("North", "Carolina") %in% spl[[i]])
>2)                   dd[i] <- "The university of North
Carolina"
}

Note: above shows only partial (if/else if) conditions.

Q1: The above "for" loop works fine (but very slow on large data
set..),
thus I would like to explore whether there is an alternative VECTORIZATION
method that may speed up the process.


Q2: Also, is there other way to extract a string from a phrase without using
"%in%"?

i.e> "ac" %in% unlist(strsplit("ac dc", " "))[1] TRUE

Your expertise in resolving this problem would be greatly appreciated.


Steven Kang

	[[alternative HTML version deleted]]

David Winsemius

2009-Sep-08 12:19 UTC

head link

[R] Character manipulation using "strsplit" & vectorization

On Sep 8, 2009, at 12:39 AM, Steven Kang wrote:
> Dear R users,
>
>
> Suppose I have a data set with inconsistent names for a field.
>
> I desire to make these to consistent names.
>
> i.e
>
> "University of New Jersey", "New Jersey Uni", "New
Jersey
> University" (3
> different inconsistent names) to "The University of New  
> Jersey" (consistent
> name)
>
> Below are arbitrary data set produced from "state.name" (built in
> data set
> in R) and associated scripts.
>
>
> d <- as.data.frame(c(state.name[30:40], paste(state.name[30:40],
> "University", sep=" "), paste("Th University
of", state.name[30:40],
> sep="
> "),paste("University o", state.name[30:40], sep="
")))
> da <- sapply(d, as.character)   # factor to character transformation
>
> spl <- strsplit(da, " ")   # spliting components
>
> dd <- character(dim(da)[1])   # initializing empty vector
> for (i in 1:dim(da)[1])   {
>    if (sum(c("New", "Jersey", "University")
%in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Jersey"
>     else if (sum(c("New", "Mexico",
"University") %in% spl[[i]]) >= 3)
> dd[i] <- "The University of New Mexico"
>     else if (sum(c("New", "York") %in% spl[[i]])
>> 2)                         dd[i] <- "The University of New
York"
>     else if (sum(c("North", "Carolina") %in% spl[[i]])
>> 2)                   dd[i] <- "The university of North
Carolina"
> }
>
> Note: above shows only partial (if/else if) conditions.
The if (cond ){ }else{} construct is for program control rather  
revision of vectors. You should consider using the   <- ifelse(cond )  
val1 , val2) construct.
>
> Q1: The above "for" loop works fine (but very slow on large data
> set..),
> thus I would like to explore whether there is an alternative  
> VECTORIZATION
> method that may speed up the process.
>
>
> Q2: Also, is there other way to extract a string from a phrase  
> without using
> "%in%"?
Many grep-isch functions are available that are vectorised regular  
expression "machines".

? grep  will show quite a few.

>
> i.e
>> "ac" %in% unlist(strsplit("ac dc", " "))
> [1] TRUE
>-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

Maybe Matching Threads

Search for more possibly parallel threads

R help - Sep 2009 - Character manipulation using "strsplit" & vectorization

[R] Character manipulation using "strsplit" & vectorization

[R] Character manipulation using "strsplit" & vectorization

Maybe Matching Threads