thr3ads.net - R help - [R] Split strings based on multiple patterns [Oct 2016]

If this information is useful, please help other people find it:
Share via:

Joe Ceradini

2016-Oct-14 23:16 UTC

[R] Split strings based on multiple patterns

Afternoon,

I unfortunately inherited a dataframe with a column that has many fields
smashed together. My goal is to split the strings in the column into
separate columns based on patterns.

Example of what I'm working with:

ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
pH:Unkwn:
Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
substrate: Silt/Mud: Evidence of cattle grazing: none:
Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
amphibians observed")
ugly

Far as I can tell, there is not a single pattern that would work for
splitting this string. Splitting on ":" is close but not quite
consistent.
Each of these attributes should be a separate column:

attributes <- c("Water temp", "Waterbody type",
"Water pH", "Conductivity",
"Water color", "Water turbidity", "Manmade",
"Permanence", "Max water
depth", "Primary substrate", "Evidence of cattle
grazing", "Shoreline
Emergent Veg(%)", "Fish present", "Fish species")

So, conceptually, I want to do something like this, where the string is
split for each of the patterns in attributes. However, strsplit only uses
the 1st value of attributes
strsplit(ugly, attributes)

Should I loop through the values of "attributes"?
Is there an argument in strsplit I'm missing that will do what I want?
Different approach altogether?

Thanks! Happy Friday.
Joe

	[[alternative HTML version deleted]]

David Winsemius

2016-Oct-14 23:49 UTC

head link

[R] Split strings based on multiple patterns

> On Oct 14, 2016, at 4:16 PM, Joe Ceradini <joeceradini at gmail.com>
wrote:
> 
> Afternoon,
> 
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
> 
> Example of what I'm working with:
> 
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
> pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
> 
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite
consistent.
> Each of these attributes should be a separate column:
> 
> attributes <- c("Water temp", "Waterbody type",
"Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade",
"Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle
grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
> 
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
> 
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
> 
> Thanks! Happy Friday.
> Joe
> 
> 	[[alternative HTML version deleted]]
Need to post in plain text. We cannot see where your "carriage
returns" are located in that data. HTML uses some other character(s?) that
doesn't get translated by our mailserver.

> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
Yes, please do read that.
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA

Gabor Grothendieck

2016-Oct-15 11:50 UTC

head link

[R] Split strings based on multiple patterns

Replace newlines and colons with a space since they seem to be junk,
generate a pattern to replace the attributes with a comma and do the
replacement and finally read in what is left into a data frame using
the attributes as column names.

(I have indented each line of code below by 2 spaces so if any line
starts before that then it's been wrapped around by the email and
needs to be adjusted.)

  attributes <-
  c("Water temp", "Waterbody type", "Water pH",
"Conductivity",
  "Water color", "Water turbidity", "Manmade",
"Permanence", "Max water depth",
  "Primary substrate", "Evidence of cattle grazing",
"Shoreline
Emergent Veg(%)",
  "Fish present", "Fish species")

  ugly2 <- gsub("[:\n]", " ", ugly)

  pat <- paste(gsub("([[:punct:]])", ".", attributes),
collapse = "|")
  ugly3 <- gsub(pat, ",", ugly2)

  dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE,
col.names = c("", attributes))[-1]


On Fri, Oct 14, 2016 at 7:16 PM, Joe Ceradini <joeceradini at gmail.com>
wrote:> Afternoon,
>
> I unfortunately inherited a dataframe with a column that has many fields
> smashed together. My goal is to split the strings in the column into
> separate columns based on patterns.
>
> Example of what I'm working with:
>
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond:
Water
> pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
>
> Far as I can tell, there is not a single pattern that would work for
> splitting this string. Splitting on ":" is close but not quite
consistent.
> Each of these attributes should be a separate column:
>
> attributes <- c("Water temp", "Waterbody type",
"Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade",
"Permanence", "Max water
> depth", "Primary substrate", "Evidence of cattle
grazing", "Shoreline
> Emergent Veg(%)", "Fish present", "Fish species")
>
> So, conceptually, I want to do something like this, where the string is
> split for each of the patterns in attributes. However, strsplit only uses
> the 1st value of attributes
> strsplit(ugly, attributes)
>
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want?
> Different approach altogether?
>
> Thanks! Happy Friday.
> Joe
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

R help - Oct 2016 - Split strings based on multiple patterns

[R] Split strings based on multiple patterns

[R] Split strings based on multiple patterns

[R] Split strings based on multiple patterns