Sorry. Bad example on my part. Try this. V1 is ... V1 alabama bates tuscaloosa smith arkansas fayette little rock alaska juneau nome And I want: V1 V2 alabama bates alabama tuscaloosa alabama smith arkansas fayette arkansas little rock alaska juneau alaskas nome This is more representative of the problem, extended to all 50 states. - Nick On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote:> I'm not sure what's so complicated about that (am I missing > something?). You can search using grep, and replace using gsub, so > > tmpDF <- read.table(text="V1 V2 > A 5 > a1 1 > a2 1 > a3 1 > a4 1 > a5 1 > B 4 > b1 1 > b2 1 > b3 1 > b4 1", > header=TRUE) > tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] > data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "", tmpDF$V1))) > > Seems to do the trick. > > Best, > Ista > > On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: >> I have a string variable (V1) in a data frame structured as follows: >> >> V1 V2 >> A 5 >> a1 1 >> a2 1 >> a3 1 >> a4 1 >> a5 1 >> B 4 >> b1 1 >> b2 1 >> b3 1 >> b4 1 >> >> I want the following: >> >> V1 V2 V3 >> a1 1 A >> a2 1 A >> a3 1 A >> a4 1 A >> a5 1 A >> b1 1 B >> b2 1 B >> b3 1 B >> b4 1 B >> >> I am not sure how to go about making this transformation besides writing a long vector that contains each of the categorical string names (these are state names, so it would be a really long vector). Any help would be greatly appreciated. >> >> Thanks, >> >> Nicholas Pretnar >> Mizzou Economics Grad Assistant >> npretnar at gmail.com >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
On Jan 3, 2015, at 9:20 PM, npretnar wrote:> Sorry. Bad example on my part. Try this. V1 is ... > > V1 > alabama > bates > tuscaloosa > smith > arkansas > fayette > little rock > alaska > juneau > nome > > And I want: > > V1 V2 > alabama bates > alabama tuscaloosa > alabama smith > arkansas fayette > arkansas little rock > alaska juneau > alaskas nomedat$is_state <- grepl(tolower(paste(state.name, collapse="|")), dat$V1) dat$thisstate <- cumsum(rownames(dat) %in% which(dat$is_state) ) dat2 <- data.frame(V1 = dat$V1[dat$is_state][dat$thisstate[!dat$is_state] ] , V2 = dat$V1[ !dat$is_state] )> dat2V1 V2 1 alabama bates 2 alabama tuscaloosa 3 alabama smith 4 arkansas fayette 5 arkansas little 6 arkansas rock 7 alaska juneau 8 alaska nome -- David.> > This is more representative of the problem, extended to all 50 states. > > - Nick > > > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote: > >> I'm not sure what's so complicated about that (am I missing >> something?). You can search using grep, and replace using gsub, so >> >> tmpDF <- read.table(text="V1 V2 >> A 5 >> a1 1 >> a2 1 >> a3 1 >> a4 1 >> a5 1 >> B 4 >> b1 1 >> b2 1 >> b3 1 >> b4 1", >> header=TRUE) >> tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] >> data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "", tmpDF$V1))) >> >> Seems to do the trick. >> >> Best, >> Ista >> >> On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: >>> I have a string variable (V1) in a data frame structured as follows: >>> >>> V1 V2 >>> A 5 >>> a1 1 >>> a2 1 >>> a3 1 >>> a4 1 >>> a5 1 >>> B 4 >>> b1 1 >>> b2 1 >>> b3 1 >>> b4 1 >>> >>> I want the following: >>> >>> V1 V2 V3 >>> a1 1 A >>> a2 1 A >>> a3 1 A >>> a4 1 A >>> a5 1 A >>> b1 1 B >>> b2 1 B >>> b3 1 B >>> b4 1 B >>> >>> I am not sure how to go about making this transformation besides writing a long vector that contains each of the categorical string names (these are state names, so it would be a really long vector). Any help would be greatly appreciated. >>> >>> Thanks, >>> >>> Nicholas Pretnar >>> Mizzou Economics Grad Assistant >>> npretnar at gmail.comDavid Winsemius Alameda, CA, USA
I'm coming to R from Python, so I coded a Python3 solution:
#####################
data = """alabama
bates
tuscaloosa
smith
arkansas
fayette
little rock
alaska
juneau
nome
""".split()
state_list = ["alabama", "arkansas", "alaska"] #
etc.
return_list = []
for word in data:
if word in state_list:
current_state = word
else:
return_list.append([current_state, word])
print(return_list)
#####################
... and then translated it to R:
#####################
data = "alabama
bates
tuscaloosa
smith
arkansas
fayette
little rock
alaska
juneau
nome
"
data = strsplit(data, split="\n")[[1]]
states = vector()
cities = vector()
for (word in data) {
if (word %in% tolower(state.name)) {
current_state = word
} else {
states = c(states, current_state)
cities = c(cities, word)
}
}
print(data.frame(V1=states, V2=cities))
#####################
-John
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of David
> Winsemius
> Sent: Sunday, January 04, 2015 2:48 AM
> To: npretnar
> Cc: R-help at r-project.org
> Subject: Re: [R] Separating a Complicated String Vector
>
>
> On Jan 3, 2015, at 9:20 PM, npretnar wrote:
>
> > Sorry. Bad example on my part. Try this. V1 is ...
> >
> > V1
> > alabama
> > bates
> > tuscaloosa
> > smith
> > arkansas
> > fayette
> > little rock
> > alaska
> > juneau
> > nome
> >
> > And I want:
> >
> > V1 V2
> > alabama bates
> > alabama tuscaloosa
> > alabama smith
> > arkansas fayette
> > arkansas little rock
> > alaska juneau
> > alaskas nome
>
>
> dat$is_state <- grepl(tolower(paste(state.name,
collapse="|")), dat$V1)
>
> dat$thisstate <- cumsum(rownames(dat) %in% which(dat$is_state) )
> dat2 <- data.frame(V1 =
dat$V1[dat$is_state][dat$thisstate[!dat$is_state] ]
> ,
> V2 = dat$V1[ !dat$is_state] )
>
>
> > dat2
> V1 V2
> 1 alabama bates
> 2 alabama tuscaloosa
> 3 alabama smith
> 4 arkansas fayette
> 5 arkansas little
> 6 arkansas rock
> 7 alaska juneau
> 8 alaska nome
>
> --
> David.
>
> >
> > This is more representative of the problem, extended to all 50 states.
> >
> > - Nick
> >
> >
> > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote:
> >
> >> I'm not sure what's so complicated about that (am I
missing
> >> something?). You can search using grep, and replace using gsub, so
> >>
> >> tmpDF <- read.table(text="V1 V2
> >> A 5
> >> a1 1
> >> a2 1
> >> a3 1
> >> a4 1
> >> a5 1
> >> B 4
> >> b1 1
> >> b2 1
> >> b3 1
> >> b4 1",
> >> header=TRUE)
> >> tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ]
data.frame(tmpDF, V3 > >> toupper(gsub("[0-9]", "",
tmpDF$V1)))
> >>
> >> Seems to do the trick.
> >>
> >> Best,
> >> Ista
> >>
> >> On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at
gmail.com> wrote:
> >>> I have a string variable (V1) in a data frame structured as
follows:
> >>>
> >>> V1 V2
> >>> A 5
> >>> a1 1
> >>> a2 1
> >>> a3 1
> >>> a4 1
> >>> a5 1
> >>> B 4
> >>> b1 1
> >>> b2 1
> >>> b3 1
> >>> b4 1
> >>>
> >>> I want the following:
> >>>
> >>> V1 V2 V3
> >>> a1 1 A
> >>> a2 1 A
> >>> a3 1 A
> >>> a4 1 A
> >>> a5 1 A
> >>> b1 1 B
> >>> b2 1 B
> >>> b3 1 B
> >>> b4 1 B
> >>>
> >>> I am not sure how to go about making this transformation
besides
> writing a long vector that contains each of the categorical string names
> (these
> are state names, so it would be a really long vector). Any help would be
> greatly appreciated.
> >>>
> >>> Thanks,
> >>>
> >>> Nicholas Pretnar
> >>> Mizzou Economics Grad Assistant
> >>> npretnar at gmail.com
>
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
f <- function (x) {
isState <- is.element(tolower(x), tolower(state.name))
w <- which(isState)
data.frame(State = x[rep(w, diff(c(w, length(x) + 1)) - 1L)],
City = x[!isState])
}
E.g.,
V1 <-c("alabama", "bates", "tuscaloosa",
"smith", "arkansas", "fayette",
"little rock", "alaska", "juneau",
"nome")> f(V1)
State City
1 alabama bates
2 alabama tuscaloosa
3 alabama smith
4 arkansas fayette
5 arkansas little rock
6 alaska juneau
7 alaska nome
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Sat, Jan 3, 2015 at 9:20 PM, npretnar <npretnar at gmail.com> wrote:
> Sorry. Bad example on my part. Try this. V1 is ...
>
> V1
> alabama
> bates
> tuscaloosa
> smith
> arkansas
> fayette
> little rock
> alaska
> juneau
> nome
>
> And I want:
>
> V1 V2
> alabama bates
> alabama tuscaloosa
> alabama smith
> arkansas fayette
> arkansas little rock
> alaska juneau
> alaskas nome
>
> This is more representative of the problem, extended to all 50 states.
>
> - Nick
>
>
> On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote:
>
> > I'm not sure what's so complicated about that (am I missing
> > something?). You can search using grep, and replace using gsub, so
> >
> > tmpDF <- read.table(text="V1 V2
> > A 5
> > a1 1
> > a2 1
> > a3 1
> > a4 1
> > a5 1
> > B 4
> > b1 1
> > b2 1
> > b3 1
> > b4 1",
> > header=TRUE)
> > tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ]
> > data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "",
tmpDF$V1)))
> >
> > Seems to do the trick.
> >
> > Best,
> > Ista
> >
> > On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com>
wrote:
> >> I have a string variable (V1) in a data frame structured as
follows:
> >>
> >> V1 V2
> >> A 5
> >> a1 1
> >> a2 1
> >> a3 1
> >> a4 1
> >> a5 1
> >> B 4
> >> b1 1
> >> b2 1
> >> b3 1
> >> b4 1
> >>
> >> I want the following:
> >>
> >> V1 V2 V3
> >> a1 1 A
> >> a2 1 A
> >> a3 1 A
> >> a4 1 A
> >> a5 1 A
> >> b1 1 B
> >> b2 1 B
> >> b3 1 B
> >> b4 1 B
> >>
> >> I am not sure how to go about making this transformation besides
> writing a long vector that contains each of the categorical string names
> (these are state names, so it would be a really long vector). Any help
> would be greatly appreciated.
> >>
> >> Thanks,
> >>
> >> Nicholas Pretnar
> >> Mizzou Economics Grad Assistant
> >> npretnar at gmail.com
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]