thr3ads.net - R help - [R] Split strings based on multiple patterns (plain text) [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Joe Ceradini

2016-Oct-15 01:53 UTC

[R] Split strings based on multiple patterns (plain text)

Hopefully this looks better. I did not realize gmail default was html.

I have a dataframe with a column that has many field smashed together.
I need to split the strings in the column into separate columns based
on patterns.

Example of a string that needs to be split:

ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
clear: Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
amphibians observed")
ugly

Far as I can tell, there is not a single pattern that would work for
splitting. Splitting on ":" is close, but not quite right. Each of the
below attributes should be in a separate column, and are present in
the string (above) that needs to be split:

attributes <- c("Water temp", "Waterbody type",
"Water pH",
"Conductivity", "Water color", "Water turbidity",
"Manmade",
"Permanence", "Max water depth", "Primary
substrate", "Evidence of
cattle grazing", "Shoreline Emergent Veg(%)", "Fish
present", "Fish
species")

Conceptually, I want to use the vector of attributes to split the
string. However, strsplit only uses the 1st value of the attributes
object:

strplit(ugly, attributes).

Should I loop through the values of "attributes"?
Is there an argument in strsplit I'm missing that will do what I want?
Different approach altogether?

Thanks! Happy Friday.
Joe

Joe Ceradini

2016-Oct-15 01:55 UTC

head link

[R] Split strings based on multiple patterns (plain text)

should be strsplit(ugly, attributes) not strplit(ugly, attributes)....

On Fri, Oct 14, 2016 at 7:53 PM, Joe Ceradini <joeceradini at gmail.com>
wrote:> Hopefully this looks better. I did not realize gmail default was html.
>
> I have a dataframe with a column that has many field smashed together.
> I need to split the strings in the column into separate columns based
> on patterns.
>
> Example of a string that needs to be split:
>
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
> clear: Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
>
> Far as I can tell, there is not a single pattern that would work for
> splitting. Splitting on ":" is close, but not quite right. Each
of the
> below attributes should be in a separate column, and are present in
> the string (above) that needs to be split:
>
> attributes <- c("Water temp", "Waterbody type",
"Water pH",
> "Conductivity", "Water color", "Water
turbidity", "Manmade",
> "Permanence", "Max water depth", "Primary
substrate", "Evidence of
> cattle grazing", "Shoreline Emergent Veg(%)", "Fish
present", "Fish
> species")
>
> Conceptually, I want to use the vector of attributes to split the
> string. However, strsplit only uses the 1st value of the attributes
> object:
>
> strplit(ugly, attributes).
>
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
>
> Thanks! Happy Friday.
> Joe


-- 
Cooperative Fish and Wildlife Research Unit
Zoology and Physiology Dept.
University of Wyoming
JoeCeradini at gmail.com / 914.707.8506
wyocoopunit.org

David Winsemius

2016-Oct-15 06:40 UTC

head link

[R] Split strings based on multiple patterns (plain text)

> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com>
wrote:
> 
> Hopefully this looks better. I did not realize gmail default was html.
> 
> I have a dataframe with a column that has many field smashed together.
> I need to split the strings in the column into separate columns based
> on patterns.
> 
> Example of a string that needs to be split:
> 
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
> clear: Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
> 
> Far as I can tell, there is not a single pattern that would work for
> splitting. Splitting on ":" is close, but not quite right. Each
of the
> below attributes should be in a separate column, and are present in
> the string (above) that needs to be split:
> 
> attributes <- c("Water temp", "Waterbody type",
"Water pH",
> "Conductivity", "Water color", "Water
turbidity", "Manmade",
> "Permanence", "Max water depth", "Primary
substrate", "Evidence of
> cattle grazing", "Shoreline Emergent Veg(%)", "Fish
present", "Fish
> species")
> 
> Conceptually, I want to use the vector of attributes to split the
> string. However, strsplit only uses the 1st value of the attributes
> object:
> 
> strplit(ugly, attributes).
I tried this:

strsplit( ugly, split=paste0(attributes, collapse="|")  )

And noticed soem of hte attributes were not actually splitting so went back and
did the data entry after making sure that there were no "\n"'s in
the middle of attribute names:

dput(attributes)
c("Water temp", "Waterbody type", "Water pH",
"Conductivity",
"Water color", "Water turbidity", "Manmade",
"Permanence", "Max water depth",
"Primary substrate", "Evidence of cattle grazing",
"Shoreline Emergent Veg(%)",
"Fish present", "Fish species")

strsplit( ugly, split=paste0(attributes, collapse="|")  )
[[1]]
 [1] ""
 [2] ":14: F "
 [3] ":Permanent Lake/Pond: Water\npH:Unkwn: "
 [4] ":Unkwn: "
 [5] ": Clear: "
 [6] ":\nclear: "
 [7] ":no  "
 [8] ":permanent:  "
 [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing:
none: Shoreline\nEmergent Veg(%): 1-25: "
[10] ": yes: Fish species: unkwn: no\namphibians observed"        
> 
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
\\
I don't think strsplit has such an argument. There may be packages that will
support this. Perhaps the gubfn package?

> Different approach altogether?
> 
> Thanks! Happy Friday.
> Joe
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Joe Ceradini

2016-Oct-15 20:32 UTC

head link

[R] Split strings based on multiple patterns (plain text)

Thank you David Wolfskill, David Winsemius, and Gabor! All very
helpful and interesting fixes for the problem (compiled below)! Now I
will see which one works best on the 944 rows that each have a cell of
smooshed attributes...the attribute names should be the same in all
the rows, if there is any mercy :)

Joe Ceradini
University of Wyoming

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/14/16, David Wolfskill <david at catwhisker.org>
wrote:> Happy Friday, indeed.
>
> It seems to me that the data need a bit of cleamup before attempting to
> parse -- for example, that "F" looks to be improperly delimited
by ':'
> on either side.  I can't tell from a single example if that's
typical
> (either for that field, or for random fields throughout the complete
> dataset).  On the off-chance it's the former, here's a bit of
exercise
> that may lead you a bit closer to a solution:
>
> First, starting with "ugly":
>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
clear:
>> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
substrate:
>> Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%):
>> 1-25: Fish present: yes: Fish species: unkwn: no amphibians
observed")
>> ugly
> [1] "Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent:  Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # First, see what a naive strsplit() does:
>
>> strsplit(ugly, ":")
> [[1]]
>  [1] "Water temp"                  "14"
>  [3] " F Waterbody type"           "Permanent
Lake/Pond"
>  [5] " Water pH"                   "Unkwn"
>  [7] " Conductivity"               "Unkwn"
>  [9] " Water color"                " Clear"
> [11] " Water turbidity"            " clear"
> [13] " Manmade"                    "no  Permanence"
> [15] "permanent"                   "  Max water depth"
> [17] " <3"                         " Primary
substrate"
> [19] " Silt/Mud"                   " Evidence of cattle
grazing"
> [21] " none"                       " Shoreline Emergent
Veg(%)"
> [23] " 1-25"                       " Fish present"
> [25] " yes"                        " Fish species"
> [27] " unkwn"                      " no amphibians
observed"
>
> # OK; let's fix the "F":
>
>> ugly1 <- sub(": F ", "F: ", ugly)
>> ugly1
> [1] "Water temp:14F: Waterbody type:Permanent Lake/Pond: Water
pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent:  Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # Now, that substring "Manmade:no  Permanence:permanent:" is
problematic;
> # the "  " in there should apparently be ": " -- but we
can't just do that
> # to all "  " substrings, because that would also affect
> # "Permanence:permanent:  Max water depth: <3:" -- the
differnce, though,
> # is that the one we don't want to change contains ":  ", so
let's change
> # those.  I'm assuming(!) that we don't really care about leading
or
> # trailing spaces in the fields:
>
>> ugly2 <- gsub(" *: *", ":", ugly1)
>> ugly2
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no  Permanence:permanent:Max water
depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians
observed"
>
> # Now that "  " shows up like a sore thumb.  Just to make the
point even
> # clearer, try the "naive" strsplit on what we have:
>
>> strsplit(ugly2, ":")
> [[1]]
>  [1] "Water temp"                 "14F"
>  [3] "Waterbody type"             "Permanent Lake/Pond"
>  [5] "Water pH"                   "Unkwn"
>  [7] "Conductivity"               "Unkwn"
>  [9] "Water color"                "Clear"
> [11] "Water turbidity"            "clear"
> [13] "Manmade"                    "no  Permanence"
> [15] "permanent"                  "Max water depth"
> [17] "<3"                         "Primary
substrate"
> [19] "Silt/Mud"                   "Evidence of cattle
grazing"
> [21] "none"                       "Shoreline Emergent
Veg(%)"
> [23] "1-25"                       "Fish present"
> [25] "yes"                        "Fish species"
> [27] "unkwn"                      "no amphibians
observed"
>
>>
>
> # Note element [14]:  that's the one we need to fix.  I'll
assume(!)
> # that that sort of thing may occur just about anywhere, so let's just
> # whack 'em all:
>
>> ugly3 <- gsub("  ", ":", ugly2)
>> ugly3
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no:Permanence:permanent:Max water
depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians
observed"
>
> # Again, check a naive strsplpit():
>
>> strsplit(ugly3, ":")
> [[1]]
>  [1] "Water temp"                 "14F"
>  [3] "Waterbody type"             "Permanent Lake/Pond"
>  [5] "Water pH"                   "Unkwn"
>  [7] "Conductivity"               "Unkwn"
>  [9] "Water color"                "Clear"
> [11] "Water turbidity"            "clear"
> [13] "Manmade"                    "no"
> [15] "Permanence"                 "permanent"
> [17] "Max water depth"            "<3"
> [19] "Primary substrate"          "Silt/Mud"
> [21] "Evidence of cattle grazing" "none"
> [23] "Shoreline Emergent Veg(%)"  "1-25"
> [25] "Fish present"               "yes"
> [27] "Fish species"               "unkwn"
> [29] "no amphibians observed"
>
>>
>
> # OK; not what we want, but it's a lot closer.  Now, watch this:
>
>> ugly4 <- gsub("([^:]*:[^:]*): *", "\\1\001",
ugly3, perl = TRUE)
>> strsplit(ugly4, "\001")
> [[1]]
>  [1] "Water temp:14F"                     "Waterbody
type:Permanent
> Lake/Pond"
>  [3] "Water pH:Unkwn"                    
"Conductivity:Unkwn"
>
>  [5] "Water color:Clear"                  "Water
turbidity:clear"
>
>  [7] "Manmade:no"                        
"Permanence:permanent"
>
>  [9] "Max water depth:<3"                 "Primary
substrate:Silt/Mud"
>
> [11] "Evidence of cattle grazing:none"    "Shoreline
Emergent Veg(%):1-25"
>
> [13] "Fish present:yes"                   "Fish
species:unkwn"
>
> [15] "no amphibians observed"
>
>>
>
> # At this point, at least elements [1] - [14] are each of the form
> # "tag:value", and thus, readily parsable.  Element [15] appears
to be
> # a somewhat-random comment; I suppose you could check for elements that
> # lack a (single) ':' and treat them "specially"....
>
> I hope that helps.  Good luck!
>
> Peace,
> david
> --
> David H. Wolfskill				david at catwhisker.org
> Those who would murder in the name of God or prophet are blasphemous
> cowards.
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/15/16, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> Replace newlines and colons with a space since they seem to be junk,
> generate a pattern to replace the attributes with a comma and do the
> replacement and finally read in what is left into a data frame using
> the attributes as column names.
>
> (I have indented each line of code below by 2 spaces so if any line
> starts before that then it's been wrapped around by the email and
> needs to be adjusted.)
>
>   attributes <-
>   c("Water temp", "Waterbody type", "Water
pH", "Conductivity",
>   "Water color", "Water turbidity",
"Manmade", "Permanence", "Max water
> depth",
>   "Primary substrate", "Evidence of cattle grazing",
"Shoreline
> Emergent Veg(%)",
>   "Fish present", "Fish species")
>
>   ugly2 <- gsub("[:\n]", " ", ugly)
>
>   pat <- paste(gsub("([[:punct:]])", ".",
attributes), collapse = "|")
>   ugly3 <- gsub(pat, ",", ugly2)
>
>   dd <- read.table(text = ugly3, sep = ",", strip.white =
TRUE,
> col.names = c("", attributes))[-1]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/15/16, David Winsemius <dwinsemius at comcast.net>
wrote:>
>> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at
gmail.com> wrote:
>>
>> Hopefully this looks better. I did not realize gmail default was html.
>>
>> I have a dataframe with a column that has many field smashed together.
>> I need to split the strings in the column into separate columns based
>> on patterns.
>>
>> Example of a string that needs to be split:
>>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
>> clear: Manmade:no  Permanence:permanent:  Max water depth: <3:
Primary
>> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
>> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
>> amphibians observed")
>> ugly
>>
>> Far as I can tell, there is not a single pattern that would work for
>> splitting. Splitting on ":" is close, but not quite right.
Each of the
>> below attributes should be in a separate column, and are present in
>> the string (above) that needs to be split:
>>
>> attributes <- c("Water temp", "Waterbody type",
"Water pH",
>> "Conductivity", "Water color", "Water
turbidity", "Manmade",
>> "Permanence", "Max water depth", "Primary
substrate", "Evidence of
>> cattle grazing", "Shoreline Emergent Veg(%)", "Fish
present", "Fish
>> species")
>>
>> Conceptually, I want to use the vector of attributes to split the
>> string. However, strsplit only uses the 1st value of the attributes
>> object:
>>
>> strplit(ugly, attributes).
>
> I tried this:
>
> strsplit( ugly, split=paste0(attributes, collapse="|")  )
>
> And noticed soem of hte attributes were not actually splitting so went back
> and did the data entry after making sure that there were no
"\n"'s in the
> middle of attribute names:
>
> dput(attributes)
> c("Water temp", "Waterbody type", "Water pH",
"Conductivity",
> "Water color", "Water turbidity", "Manmade",
"Permanence", "Max water
> depth",
> "Primary substrate", "Evidence of cattle grazing",
"Shoreline Emergent
> Veg(%)",
> "Fish present", "Fish species")
>
> strsplit( ugly, split=paste0(attributes, collapse="|")  )
> [[1]]
>  [1] ""
>
>  [2] ":14: F "
>
>  [3] ":Permanent Lake/Pond: Water\npH:Unkwn: "
>
>  [4] ":Unkwn: "
>
>  [5] ": Clear: "
>
>  [6] ":\nclear: "
>
>  [7] ":no  "
>
>  [8] ":permanent:  "
>
>  [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle
grazing: none:
> Shoreline\nEmergent Veg(%): 1-25: "
> [10] ": yes: Fish species: unkwn: no\namphibians observed"
>
>>
>> Should I loop through the values of "attributes"?
>> Is there an argument in strsplit I'm missing that will do what I
want? \\
>
> I don't think strsplit has such an argument. There may be packages that
will
> support this. Perhaps the gubfn package?
>
>
>> Different approach altogether?
>>
>> Thanks! Happy Friday.
>> Joe
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>

R help - Oct 2016 - Split strings based on multiple patterns (plain text)

[R] Split strings based on multiple patterns (plain text)

[R] Split strings based on multiple patterns (plain text)

[R] Split strings based on multiple patterns (plain text)

[R] Split strings based on multiple patterns (plain text)