Sorry. Bad example on my part. Try this. V1 is ... V1 alabama bates tuscaloosa smith arkansas fayette little rock alaska juneau nome And I want: V1 V2 alabama bates alabama tuscaloosa alabama smith arkansas fayette arkansas little rock alaska juneau alaskas nome This is more representative of the problem, extended to all 50 states. - Nick On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote:> I'm not sure what's so complicated about that (am I missing > something?). You can search using grep, and replace using gsub, so > > tmpDF <- read.table(text="V1 V2 > A 5 > a1 1 > a2 1 > a3 1 > a4 1 > a5 1 > B 4 > b1 1 > b2 1 > b3 1 > b4 1", > header=TRUE) > tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] > data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "", tmpDF$V1))) > > Seems to do the trick. > > Best, > Ista > > On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: >> I have a string variable (V1) in a data frame structured as follows: >> >> V1 V2 >> A 5 >> a1 1 >> a2 1 >> a3 1 >> a4 1 >> a5 1 >> B 4 >> b1 1 >> b2 1 >> b3 1 >> b4 1 >> >> I want the following: >> >> V1 V2 V3 >> a1 1 A >> a2 1 A >> a3 1 A >> a4 1 A >> a5 1 A >> b1 1 B >> b2 1 B >> b3 1 B >> b4 1 B >> >> I am not sure how to go about making this transformation besides writing a long vector that contains each of the categorical string names (these are state names, so it would be a really long vector). Any help would be greatly appreciated. >> >> Thanks, >> >> Nicholas Pretnar >> Mizzou Economics Grad Assistant >> npretnar at gmail.com >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
On Jan 3, 2015, at 9:20 PM, npretnar wrote:> Sorry. Bad example on my part. Try this. V1 is ... > > V1 > alabama > bates > tuscaloosa > smith > arkansas > fayette > little rock > alaska > juneau > nome > > And I want: > > V1 V2 > alabama bates > alabama tuscaloosa > alabama smith > arkansas fayette > arkansas little rock > alaska juneau > alaskas nomedat$is_state <- grepl(tolower(paste(state.name, collapse="|")), dat$V1) dat$thisstate <- cumsum(rownames(dat) %in% which(dat$is_state) ) dat2 <- data.frame(V1 = dat$V1[dat$is_state][dat$thisstate[!dat$is_state] ] , V2 = dat$V1[ !dat$is_state] )> dat2V1 V2 1 alabama bates 2 alabama tuscaloosa 3 alabama smith 4 arkansas fayette 5 arkansas little 6 arkansas rock 7 alaska juneau 8 alaska nome -- David.> > This is more representative of the problem, extended to all 50 states. > > - Nick > > > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote: > >> I'm not sure what's so complicated about that (am I missing >> something?). You can search using grep, and replace using gsub, so >> >> tmpDF <- read.table(text="V1 V2 >> A 5 >> a1 1 >> a2 1 >> a3 1 >> a4 1 >> a5 1 >> B 4 >> b1 1 >> b2 1 >> b3 1 >> b4 1", >> header=TRUE) >> tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] >> data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "", tmpDF$V1))) >> >> Seems to do the trick. >> >> Best, >> Ista >> >> On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: >>> I have a string variable (V1) in a data frame structured as follows: >>> >>> V1 V2 >>> A 5 >>> a1 1 >>> a2 1 >>> a3 1 >>> a4 1 >>> a5 1 >>> B 4 >>> b1 1 >>> b2 1 >>> b3 1 >>> b4 1 >>> >>> I want the following: >>> >>> V1 V2 V3 >>> a1 1 A >>> a2 1 A >>> a3 1 A >>> a4 1 A >>> a5 1 A >>> b1 1 B >>> b2 1 B >>> b3 1 B >>> b4 1 B >>> >>> I am not sure how to go about making this transformation besides writing a long vector that contains each of the categorical string names (these are state names, so it would be a really long vector). Any help would be greatly appreciated. >>> >>> Thanks, >>> >>> Nicholas Pretnar >>> Mizzou Economics Grad Assistant >>> npretnar at gmail.comDavid Winsemius Alameda, CA, USA
I'm coming to R from Python, so I coded a Python3 solution: ##################### data = """alabama bates tuscaloosa smith arkansas fayette little rock alaska juneau nome """.split() state_list = ["alabama", "arkansas", "alaska"] # etc. return_list = [] for word in data: if word in state_list: current_state = word else: return_list.append([current_state, word]) print(return_list) ##################### ... and then translated it to R: ##################### data = "alabama bates tuscaloosa smith arkansas fayette little rock alaska juneau nome " data = strsplit(data, split="\n")[[1]] states = vector() cities = vector() for (word in data) { if (word %in% tolower(state.name)) { current_state = word } else { states = c(states, current_state) cities = c(cities, word) } } print(data.frame(V1=states, V2=cities)) ##################### -John> -----Original Message----- > From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of David > Winsemius > Sent: Sunday, January 04, 2015 2:48 AM > To: npretnar > Cc: R-help at r-project.org > Subject: Re: [R] Separating a Complicated String Vector > > > On Jan 3, 2015, at 9:20 PM, npretnar wrote: > > > Sorry. Bad example on my part. Try this. V1 is ... > > > > V1 > > alabama > > bates > > tuscaloosa > > smith > > arkansas > > fayette > > little rock > > alaska > > juneau > > nome > > > > And I want: > > > > V1 V2 > > alabama bates > > alabama tuscaloosa > > alabama smith > > arkansas fayette > > arkansas little rock > > alaska juneau > > alaskas nome > > > dat$is_state <- grepl(tolower(paste(state.name, collapse="|")), dat$V1) > > dat$thisstate <- cumsum(rownames(dat) %in% which(dat$is_state) ) > dat2 <- data.frame(V1 = dat$V1[dat$is_state][dat$thisstate[!dat$is_state] ] > , > V2 = dat$V1[ !dat$is_state] ) > > > > dat2 > V1 V2 > 1 alabama bates > 2 alabama tuscaloosa > 3 alabama smith > 4 arkansas fayette > 5 arkansas little > 6 arkansas rock > 7 alaska juneau > 8 alaska nome > > -- > David. > > > > > This is more representative of the problem, extended to all 50 states. > > > > - Nick > > > > > > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote: > > > >> I'm not sure what's so complicated about that (am I missing > >> something?). You can search using grep, and replace using gsub, so > >> > >> tmpDF <- read.table(text="V1 V2 > >> A 5 > >> a1 1 > >> a2 1 > >> a3 1 > >> a4 1 > >> a5 1 > >> B 4 > >> b1 1 > >> b2 1 > >> b3 1 > >> b4 1", > >> header=TRUE) > >> tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] data.frame(tmpDF, V3 > >> toupper(gsub("[0-9]", "", tmpDF$V1))) > >> > >> Seems to do the trick. > >> > >> Best, > >> Ista > >> > >> On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: > >>> I have a string variable (V1) in a data frame structured as follows: > >>> > >>> V1 V2 > >>> A 5 > >>> a1 1 > >>> a2 1 > >>> a3 1 > >>> a4 1 > >>> a5 1 > >>> B 4 > >>> b1 1 > >>> b2 1 > >>> b3 1 > >>> b4 1 > >>> > >>> I want the following: > >>> > >>> V1 V2 V3 > >>> a1 1 A > >>> a2 1 A > >>> a3 1 A > >>> a4 1 A > >>> a5 1 A > >>> b1 1 B > >>> b2 1 B > >>> b3 1 B > >>> b4 1 B > >>> > >>> I am not sure how to go about making this transformation besides > writing a long vector that contains each of the categorical string names > (these > are state names, so it would be a really long vector). Any help would be > greatly appreciated. > >>> > >>> Thanks, > >>> > >>> Nicholas Pretnar > >>> Mizzou Economics Grad Assistant > >>> npretnar at gmail.com > > > David Winsemius > Alameda, CA, USA > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
f <- function (x) { isState <- is.element(tolower(x), tolower(state.name)) w <- which(isState) data.frame(State = x[rep(w, diff(c(w, length(x) + 1)) - 1L)], City = x[!isState]) } E.g., V1 <-c("alabama", "bates", "tuscaloosa", "smith", "arkansas", "fayette", "little rock", "alaska", "juneau", "nome")> f(V1)State City 1 alabama bates 2 alabama tuscaloosa 3 alabama smith 4 arkansas fayette 5 arkansas little rock 6 alaska juneau 7 alaska nome Bill Dunlap TIBCO Software wdunlap tibco.com On Sat, Jan 3, 2015 at 9:20 PM, npretnar <npretnar at gmail.com> wrote:> Sorry. Bad example on my part. Try this. V1 is ... > > V1 > alabama > bates > tuscaloosa > smith > arkansas > fayette > little rock > alaska > juneau > nome > > And I want: > > V1 V2 > alabama bates > alabama tuscaloosa > alabama smith > arkansas fayette > arkansas little rock > alaska juneau > alaskas nome > > This is more representative of the problem, extended to all 50 states. > > - Nick > > > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote: > > > I'm not sure what's so complicated about that (am I missing > > something?). You can search using grep, and replace using gsub, so > > > > tmpDF <- read.table(text="V1 V2 > > A 5 > > a1 1 > > a2 1 > > a3 1 > > a4 1 > > a5 1 > > B 4 > > b1 1 > > b2 1 > > b3 1 > > b4 1", > > header=TRUE) > > tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] > > data.frame(tmpDF, V3 = toupper(gsub("[0-9]", "", tmpDF$V1))) > > > > Seems to do the trick. > > > > Best, > > Ista > > > > On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote: > >> I have a string variable (V1) in a data frame structured as follows: > >> > >> V1 V2 > >> A 5 > >> a1 1 > >> a2 1 > >> a3 1 > >> a4 1 > >> a5 1 > >> B 4 > >> b1 1 > >> b2 1 > >> b3 1 > >> b4 1 > >> > >> I want the following: > >> > >> V1 V2 V3 > >> a1 1 A > >> a2 1 A > >> a3 1 A > >> a4 1 A > >> a5 1 A > >> b1 1 B > >> b2 1 B > >> b3 1 B > >> b4 1 B > >> > >> I am not sure how to go about making this transformation besides > writing a long vector that contains each of the categorical string names > (these are state names, so it would be a really long vector). Any help > would be greatly appreciated. > >> > >> Thanks, > >> > >> Nicholas Pretnar > >> Mizzou Economics Grad Assistant > >> npretnar at gmail.com > >> > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]