Afternoon, I unfortunately inherited a dataframe with a column that has many fields smashed together. My goal is to split the strings in the column into separate columns based on patterns. Example of what I'm working with: ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed") ugly Far as I can tell, there is not a single pattern that would work for splitting this string. Splitting on ":" is close but not quite consistent. Each of these attributes should be a separate column: attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity", "Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish species") So, conceptually, I want to do something like this, where the string is split for each of the patterns in attributes. However, strsplit only uses the 1st value of attributes strsplit(ugly, attributes) Should I loop through the values of "attributes"? Is there an argument in strsplit I'm missing that will do what I want? Different approach altogether? Thanks! Happy Friday. Joe [[alternative HTML version deleted]]
> On Oct 14, 2016, at 4:16 PM, Joe Ceradini <joeceradini at gmail.com> wrote: > > Afternoon, > > I unfortunately inherited a dataframe with a column that has many fields > smashed together. My goal is to split the strings in the column into > separate columns based on patterns. > > Example of what I'm working with: > > ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water > pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: > Manmade:no Permanence:permanent: Max water depth: <3: Primary > substrate: Silt/Mud: Evidence of cattle grazing: none: > Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no > amphibians observed") > ugly > > Far as I can tell, there is not a single pattern that would work for > splitting this string. Splitting on ":" is close but not quite consistent. > Each of these attributes should be a separate column: > > attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline > Emergent Veg(%)", "Fish present", "Fish species") > > So, conceptually, I want to do something like this, where the string is > split for each of the patterns in attributes. However, strsplit only uses > the 1st value of attributes > strsplit(ugly, attributes) > > Should I loop through the values of "attributes"? > Is there an argument in strsplit I'm missing that will do what I want? > Different approach altogether? > > Thanks! Happy Friday. > Joe > > [[alternative HTML version deleted]]Need to post in plain text. We cannot see where your "carriage returns" are located in that data. HTML uses some other character(s?) that doesn't get translated by our mailserver.> > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlYes, please do read that.> and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Replace newlines and colons with a space since they seem to be junk, generate a pattern to replace the attributes with a comma and do the replacement and finally read in what is left into a data frame using the attributes as column names. (I have indented each line of code below by 2 spaces so if any line starts before that then it's been wrapped around by the email and needs to be adjusted.) attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity", "Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish species") ugly2 <- gsub("[:\n]", " ", ugly) pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|") ugly3 <- gsub(pat, ",", ugly2) dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE, col.names = c("", attributes))[-1] On Fri, Oct 14, 2016 at 7:16 PM, Joe Ceradini <joeceradini at gmail.com> wrote:> Afternoon, > > I unfortunately inherited a dataframe with a column that has many fields > smashed together. My goal is to split the strings in the column into > separate columns based on patterns. > > Example of what I'm working with: > > ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water > pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: > Manmade:no Permanence:permanent: Max water depth: <3: Primary > substrate: Silt/Mud: Evidence of cattle grazing: none: > Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no > amphibians observed") > ugly > > Far as I can tell, there is not a single pattern that would work for > splitting this string. Splitting on ":" is close but not quite consistent. > Each of these attributes should be a separate column: > > attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline > Emergent Veg(%)", "Fish present", "Fish species") > > So, conceptually, I want to do something like this, where the string is > split for each of the patterns in attributes. However, strsplit only uses > the 1st value of attributes > strsplit(ugly, attributes) > > Should I loop through the values of "attributes"? > Is there an argument in strsplit I'm missing that will do what I want? > Different approach altogether? > > Thanks! Happy Friday. > Joe > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com