Joe Ceradini
2016-Oct-15 01:53 UTC
[R] Split strings based on multiple patterns (plain text)
Hopefully this looks better. I did not realize gmail default was html. I have a dataframe with a column that has many field smashed together. I need to split the strings in the column into separate columns based on patterns. Example of a string that needs to be split: ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed") ugly Far as I can tell, there is not a single pattern that would work for splitting. Splitting on ":" is close, but not quite right. Each of the below attributes should be in a separate column, and are present in the string (above) that needs to be split: attributes <- c("Water temp", "Waterbody type", "Water pH", "Conductivity", "Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish species") Conceptually, I want to use the vector of attributes to split the string. However, strsplit only uses the 1st value of the attributes object: strplit(ugly, attributes). Should I loop through the values of "attributes"? Is there an argument in strsplit I'm missing that will do what I want? Different approach altogether? Thanks! Happy Friday. Joe
Joe Ceradini
2016-Oct-15 01:55 UTC
[R] Split strings based on multiple patterns (plain text)
should be strsplit(ugly, attributes) not strplit(ugly, attributes).... On Fri, Oct 14, 2016 at 7:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote:> Hopefully this looks better. I did not realize gmail default was html. > > I have a dataframe with a column that has many field smashed together. > I need to split the strings in the column into separate columns based > on patterns. > > Example of a string that needs to be split: > > ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water > pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: > clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary > substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline > Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no > amphibians observed") > ugly > > Far as I can tell, there is not a single pattern that would work for > splitting. Splitting on ":" is close, but not quite right. Each of the > below attributes should be in a separate column, and are present in > the string (above) that needs to be split: > > attributes <- c("Water temp", "Waterbody type", "Water pH", > "Conductivity", "Water color", "Water turbidity", "Manmade", > "Permanence", "Max water depth", "Primary substrate", "Evidence of > cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish > species") > > Conceptually, I want to use the vector of attributes to split the > string. However, strsplit only uses the 1st value of the attributes > object: > > strplit(ugly, attributes). > > Should I loop through the values of "attributes"? > Is there an argument in strsplit I'm missing that will do what I want? > Different approach altogether? > > Thanks! Happy Friday. > Joe-- Cooperative Fish and Wildlife Research Unit Zoology and Physiology Dept. University of Wyoming JoeCeradini at gmail.com / 914.707.8506 wyocoopunit.org
David Winsemius
2016-Oct-15 06:40 UTC
[R] Split strings based on multiple patterns (plain text)
> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote: > > Hopefully this looks better. I did not realize gmail default was html. > > I have a dataframe with a column that has many field smashed together. > I need to split the strings in the column into separate columns based > on patterns. > > Example of a string that needs to be split: > > ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water > pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: > clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary > substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline > Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no > amphibians observed") > ugly > > Far as I can tell, there is not a single pattern that would work for > splitting. Splitting on ":" is close, but not quite right. Each of the > below attributes should be in a separate column, and are present in > the string (above) that needs to be split: > > attributes <- c("Water temp", "Waterbody type", "Water pH", > "Conductivity", "Water color", "Water turbidity", "Manmade", > "Permanence", "Max water depth", "Primary substrate", "Evidence of > cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish > species") > > Conceptually, I want to use the vector of attributes to split the > string. However, strsplit only uses the 1st value of the attributes > object: > > strplit(ugly, attributes).I tried this: strsplit( ugly, split=paste0(attributes, collapse="|") ) And noticed soem of hte attributes were not actually splitting so went back and did the data entry after making sure that there were no "\n"'s in the middle of attribute names: dput(attributes) c("Water temp", "Waterbody type", "Water pH", "Conductivity", "Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth", "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish species") strsplit( ugly, split=paste0(attributes, collapse="|") ) [[1]] [1] "" [2] ":14: F " [3] ":Permanent Lake/Pond: Water\npH:Unkwn: " [4] ":Unkwn: " [5] ": Clear: " [6] ":\nclear: " [7] ":no " [8] ":permanent: " [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline\nEmergent Veg(%): 1-25: " [10] ": yes: Fish species: unkwn: no\namphibians observed"> > Should I loop through the values of "attributes"? > Is there an argument in strsplit I'm missing that will do what I want? \\I don't think strsplit has such an argument. There may be packages that will support this. Perhaps the gubfn package?> Different approach altogether? > > Thanks! Happy Friday. > Joe > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius Alameda, CA, USA
Joe Ceradini
2016-Oct-15 20:32 UTC
[R] Split strings based on multiple patterns (plain text)
Thank you David Wolfskill, David Winsemius, and Gabor! All very helpful and interesting fixes for the problem (compiled below)! Now I will see which one works best on the 944 rows that each have a cell of smooshed attributes...the attribute names should be the same in all the rows, if there is any mercy :) Joe Ceradini University of Wyoming ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/14/16, David Wolfskill <david at catwhisker.org> wrote:> Happy Friday, indeed. > > It seems to me that the data need a bit of cleamup before attempting to > parse -- for example, that "F" looks to be improperly delimited by ':' > on either side. I can't tell from a single example if that's typical > (either for that field, or for random fields throughout the complete > dataset). On the off-chance it's the former, here's a bit of exercise > that may lead you a bit closer to a solution: > > First, starting with "ugly": > >> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water >> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: >> Manmade:no Permanence:permanent: Max water depth: <3: Primary substrate: >> Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%): >> 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed") >> ugly > [1] "Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no > Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: > Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish > present: yes: Fish species: unkwn: no amphibians observed" > > # First, see what a naive strsplit() does: > >> strsplit(ugly, ":") > [[1]] > [1] "Water temp" "14" > [3] " F Waterbody type" "Permanent Lake/Pond" > [5] " Water pH" "Unkwn" > [7] " Conductivity" "Unkwn" > [9] " Water color" " Clear" > [11] " Water turbidity" " clear" > [13] " Manmade" "no Permanence" > [15] "permanent" " Max water depth" > [17] " <3" " Primary substrate" > [19] " Silt/Mud" " Evidence of cattle grazing" > [21] " none" " Shoreline Emergent Veg(%)" > [23] " 1-25" " Fish present" > [25] " yes" " Fish species" > [27] " unkwn" " no amphibians observed" > > # OK; let's fix the "F": > >> ugly1 <- sub(": F ", "F: ", ugly) >> ugly1 > [1] "Water temp:14F: Waterbody type:Permanent Lake/Pond: Water pH:Unkwn: > Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no > Permanence:permanent: Max water depth: <3: Primary substrate: Silt/Mud: > Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish > present: yes: Fish species: unkwn: no amphibians observed" > > # Now, that substring "Manmade:no Permanence:permanent:" is problematic; > # the " " in there should apparently be ": " -- but we can't just do that > # to all " " substrings, because that would also affect > # "Permanence:permanent: Max water depth: <3:" -- the differnce, though, > # is that the one we don't want to change contains ": ", so let's change > # those. I'm assuming(!) that we don't really care about leading or > # trailing spaces in the fields: > >> ugly2 <- gsub(" *: *", ":", ugly1) >> ugly2 > [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water > pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water > turbidity:clear:Manmade:no Permanence:permanent:Max water depth:<3:Primary > substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent > Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed" > > # Now that " " shows up like a sore thumb. Just to make the point even > # clearer, try the "naive" strsplit on what we have: > >> strsplit(ugly2, ":") > [[1]] > [1] "Water temp" "14F" > [3] "Waterbody type" "Permanent Lake/Pond" > [5] "Water pH" "Unkwn" > [7] "Conductivity" "Unkwn" > [9] "Water color" "Clear" > [11] "Water turbidity" "clear" > [13] "Manmade" "no Permanence" > [15] "permanent" "Max water depth" > [17] "<3" "Primary substrate" > [19] "Silt/Mud" "Evidence of cattle grazing" > [21] "none" "Shoreline Emergent Veg(%)" > [23] "1-25" "Fish present" > [25] "yes" "Fish species" > [27] "unkwn" "no amphibians observed" > >> > > # Note element [14]: that's the one we need to fix. I'll assume(!) > # that that sort of thing may occur just about anywhere, so let's just > # whack 'em all: > >> ugly3 <- gsub(" ", ":", ugly2) >> ugly3 > [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water > pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water > turbidity:clear:Manmade:no:Permanence:permanent:Max water depth:<3:Primary > substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent > Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed" > > # Again, check a naive strsplpit(): > >> strsplit(ugly3, ":") > [[1]] > [1] "Water temp" "14F" > [3] "Waterbody type" "Permanent Lake/Pond" > [5] "Water pH" "Unkwn" > [7] "Conductivity" "Unkwn" > [9] "Water color" "Clear" > [11] "Water turbidity" "clear" > [13] "Manmade" "no" > [15] "Permanence" "permanent" > [17] "Max water depth" "<3" > [19] "Primary substrate" "Silt/Mud" > [21] "Evidence of cattle grazing" "none" > [23] "Shoreline Emergent Veg(%)" "1-25" > [25] "Fish present" "yes" > [27] "Fish species" "unkwn" > [29] "no amphibians observed" > >> > > # OK; not what we want, but it's a lot closer. Now, watch this: > >> ugly4 <- gsub("([^:]*:[^:]*): *", "\\1\001", ugly3, perl = TRUE) >> strsplit(ugly4, "\001") > [[1]] > [1] "Water temp:14F" "Waterbody type:Permanent > Lake/Pond" > [3] "Water pH:Unkwn" "Conductivity:Unkwn" > > [5] "Water color:Clear" "Water turbidity:clear" > > [7] "Manmade:no" "Permanence:permanent" > > [9] "Max water depth:<3" "Primary substrate:Silt/Mud" > > [11] "Evidence of cattle grazing:none" "Shoreline Emergent Veg(%):1-25" > > [13] "Fish present:yes" "Fish species:unkwn" > > [15] "no amphibians observed" > >> > > # At this point, at least elements [1] - [14] are each of the form > # "tag:value", and thus, readily parsable. Element [15] appears to be > # a somewhat-random comment; I suppose you could check for elements that > # lack a (single) ':' and treat them "specially".... > > I hope that helps. Good luck! > > Peace, > david > -- > David H. Wolfskill david at catwhisker.org > Those who would murder in the name of God or prophet are blasphemous > cowards. > > See http://www.catwhisker.org/~david/publickey.gpg for my public key.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/15/16, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> Replace newlines and colons with a space since they seem to be junk, > generate a pattern to replace the attributes with a comma and do the > replacement and finally read in what is left into a data frame using > the attributes as column names. > > (I have indented each line of code below by 2 spaces so if any line > starts before that then it's been wrapped around by the email and > needs to be adjusted.) > > attributes <- > c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", > "Primary substrate", "Evidence of cattle grazing", "Shoreline > Emergent Veg(%)", > "Fish present", "Fish species") > > ugly2 <- gsub("[:\n]", " ", ugly) > > pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|") > ugly3 <- gsub(pat, ",", ugly2) > > dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE, > col.names = c("", attributes))[-1]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On 10/15/16, David Winsemius <dwinsemius at comcast.net> wrote:> >> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote: >> >> Hopefully this looks better. I did not realize gmail default was html. >> >> I have a dataframe with a column that has many field smashed together. >> I need to split the strings in the column into separate columns based >> on patterns. >> >> Example of a string that needs to be split: >> >> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water >> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: >> clear: Manmade:no Permanence:permanent: Max water depth: <3: Primary >> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline >> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no >> amphibians observed") >> ugly >> >> Far as I can tell, there is not a single pattern that would work for >> splitting. Splitting on ":" is close, but not quite right. Each of the >> below attributes should be in a separate column, and are present in >> the string (above) that needs to be split: >> >> attributes <- c("Water temp", "Waterbody type", "Water pH", >> "Conductivity", "Water color", "Water turbidity", "Manmade", >> "Permanence", "Max water depth", "Primary substrate", "Evidence of >> cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish >> species") >> >> Conceptually, I want to use the vector of attributes to split the >> string. However, strsplit only uses the 1st value of the attributes >> object: >> >> strplit(ugly, attributes). > > I tried this: > > strsplit( ugly, split=paste0(attributes, collapse="|") ) > > And noticed soem of hte attributes were not actually splitting so went back > and did the data entry after making sure that there were no "\n"'s in the > middle of attribute names: > > dput(attributes) > c("Water temp", "Waterbody type", "Water pH", "Conductivity", > "Water color", "Water turbidity", "Manmade", "Permanence", "Max water > depth", > "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent > Veg(%)", > "Fish present", "Fish species") > > strsplit( ugly, split=paste0(attributes, collapse="|") ) > [[1]] > [1] "" > > [2] ":14: F " > > [3] ":Permanent Lake/Pond: Water\npH:Unkwn: " > > [4] ":Unkwn: " > > [5] ": Clear: " > > [6] ":\nclear: " > > [7] ":no " > > [8] ":permanent: " > > [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none: > Shoreline\nEmergent Veg(%): 1-25: " > [10] ": yes: Fish species: unkwn: no\namphibians observed" > >> >> Should I loop through the values of "attributes"? >> Is there an argument in strsplit I'm missing that will do what I want? \\ > > I don't think strsplit has such an argument. There may be packages that will > support this. Perhaps the gubfn package? > > >> Different approach altogether? >> >> Thanks! Happy Friday. >> Joe >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > David Winsemius > Alameda, CA, USA >