thr3ads.net - R help - [R] extending strsplit to handle missing text that doesn't have the target on which to split [May 2009]

If this information is useful, please help other people find it:
Share via:

Chris Evans

2009-May-07 13:30 UTC

[R] extending strsplit to handle missing text that doesn't have the target on which to split

I am sure there is an obvious answer to this that I'm missing but I
can't find it.  I'm parsing headers of Emails and most have a date like
this:
   "Wed, 16 Nov 2005 05:28:00 -0800"
and I can parse that using:

tmp.dat.data <- matrix(unlist(strsplit(headers$Date.line,",")),
    ncol = 2, byrow = TRUE)
before going on to look at the day and date/time data.

However, a very few headers I want to parse are missing the initial day
of the week and look like this:
   "15 Nov 2005 09:10:00 +0100"

That means that my use of strsplit() results in that date/time part
being all of the item in the list for those entries so the effect of
matrix(unlist()) is to pull the next list entry "up" in the matrix.
Because I happened to have only two errant entries I didn't see what was
happening for a moment. (An odd number gives a warning message about
dimensions not fitting but an odd number has silently moved things
up/left so doesn't: no quarrel with that from me, my stupidity that I
was slow to see what was happening!)

I'm sure I should be able to find a simple way to get around this but at
the moment I can't.

Here's a simple, reproducible example:

dat <- c("Tue, 15 Nov 2005 09:44:50 EST",
         "15 Nov 2005 09:10:00 +0100",
         "Tue, 15 Nov 2005 09:44:50 EST",
	 "Tue, 15 Nov 2005 16:29:57 +0000",
	 "Wed, 16 Nov 2005 07:00:45 EST",
	 "Wed, 16 Nov 2005 05:28:00 -0800",
	 "Wed, 16 Nov 2005 14:06:21 +0000",
         "15 Nov 2005 09:10:00 +0100")
tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2, byrow =
TRUE)


tmp.dat.data comes out as a 7x2 matrix contents:

     [,1]                          [,2]
[1,] "Tue"                         " 15 Nov 2005 09:44:50
EST"
[2,] "15 Nov 2005 09:10:00 +0100"  "Tue"
[3,] " 15 Nov 2005 09:44:50 EST"   "Tue"
[4,] " 15 Nov 2005 16:29:57 +0000" "Wed"
[5,] " 16 Nov 2005 07:00:45 EST"   "Wed"
[6,] " 16 Nov 2005 05:28:00 -0800" "Wed"
[7,] " 16 Nov 2005 14:06:21 +0000" "15 Nov 2005 09:10:00
+0100"

I'd like an 8x2 matrix with tmp.dat.data[2,1] == "" and
tmp.dat.data[8,1] == ""

I'm sure there must be a simple way to achieve this by rolling a
slightly different variant of strsplit that pads things and then
applying that to the input vector but I'm failing to see how to do this
at the moment.

TIA,

Chris

--
Applied researcher, neither statistician nor programmer!

jim holtman

2009-May-08 11:56 UTC

head link

[R] extending strsplit to handle missing text that doesn't have the target on which to split

Find the values that are missing a comma and add it:
> dat <- c("Tue, 15 Nov 2005 09:44:50 EST",+         "15 Nov 2005 09:10:00 +0100",
+         "Tue, 15 Nov 2005 09:44:50 EST",
+         "Tue, 15 Nov 2005 16:29:57 +0000",
+         "Wed, 16 Nov 2005 07:00:45 EST",
+         "Wed, 16 Nov 2005 05:28:00 -0800",
+         "Wed, 16 Nov 2005 14:06:21 +0000",
+         "15 Nov 2005 09:10:00 +0100")> # add comma if missing
> missing <- !grepl(',', dat)
> dat[missing] <- paste('', dat[missing], sep=',')
> tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2,
byrow = TRUE)
>
> tmp.dat.data     [,1]  [,2]
[1,] "Tue" " 15 Nov 2005 09:44:50 EST"
[2,] ""    "15 Nov 2005 09:10:00 +0100"
[3,] "Tue" " 15 Nov 2005 09:44:50 EST"
[4,] "Tue" " 15 Nov 2005 16:29:57 +0000"
[5,] "Wed" " 16 Nov 2005 07:00:45 EST"
[6,] "Wed" " 16 Nov 2005 05:28:00 -0800"
[7,] "Wed" " 16 Nov 2005 14:06:21 +0000"
[8,] ""    "15 Nov 2005 09:10:00
+0100">

On Thu, May 7, 2009 at 9:30 AM, Chris Evans <chrishold@psyctc.org> wrote:
> I am sure there is an obvious answer to this that I'm missing but I
> can't find it.  I'm parsing headers of Emails and most have a date
like
> this:
>   "Wed, 16 Nov 2005 05:28:00 -0800"
> and I can parse that using:
>
> tmp.dat.data <-
matrix(unlist(strsplit(headers$Date.line,",")),
>    ncol = 2, byrow = TRUE)
> before going on to look at the day and date/time data.
>
> However, a very few headers I want to parse are missing the initial day
> of the week and look like this:
>   "15 Nov 2005 09:10:00 +0100"
>
> That means that my use of strsplit() results in that date/time part
> being all of the item in the list for those entries so the effect of
> matrix(unlist()) is to pull the next list entry "up" in the
matrix.
> Because I happened to have only two errant entries I didn't see what
was
> happening for a moment. (An odd number gives a warning message about
> dimensions not fitting but an odd number has silently moved things
> up/left so doesn't: no quarrel with that from me, my stupidity that I
> was slow to see what was happening!)
>
> I'm sure I should be able to find a simple way to get around this but
at
> the moment I can't.
>
> Here's a simple, reproducible example:
>
> dat <- c("Tue, 15 Nov 2005 09:44:50 EST",
>         "15 Nov 2005 09:10:00 +0100",
>         "Tue, 15 Nov 2005 09:44:50 EST",
>         "Tue, 15 Nov 2005 16:29:57 +0000",
>         "Wed, 16 Nov 2005 07:00:45 EST",
>         "Wed, 16 Nov 2005 05:28:00 -0800",
>         "Wed, 16 Nov 2005 14:06:21 +0000",
>         "15 Nov 2005 09:10:00 +0100")
> tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2,
byrow = TRUE)
>
>
> tmp.dat.data comes out as a 7x2 matrix contents:
>
>     [,1]                          [,2]
> [1,] "Tue"                         " 15 Nov 2005 09:44:50
EST"
> [2,] "15 Nov 2005 09:10:00 +0100"  "Tue"
> [3,] " 15 Nov 2005 09:44:50 EST"   "Tue"
> [4,] " 15 Nov 2005 16:29:57 +0000" "Wed"
> [5,] " 16 Nov 2005 07:00:45 EST"   "Wed"
> [6,] " 16 Nov 2005 05:28:00 -0800" "Wed"
> [7,] " 16 Nov 2005 14:06:21 +0000" "15 Nov 2005 09:10:00
+0100"
>
> I'd like an 8x2 matrix with tmp.dat.data[2,1] == "" and
> tmp.dat.data[8,1] == ""
>
> I'm sure there must be a simple way to achieve this by rolling a
> slightly different variant of strsplit that pads things and then
> applying that to the input vector but I'm failing to see how to do this
> at the moment.
>
> TIA,
>
> Chris
>
> --
> Applied researcher, neither statistician nor programmer!
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
>
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more possibly parallel threads

R help - May 2009 - extending strsplit to handle missing text that doesn't have the target on which to split

[R] extending strsplit to handle missing text that doesn't have the target on which to split

[R] extending strsplit to handle missing text that doesn't have the target on which to split

Apparently Analagous Threads