Chris Evans
2009-May-07 13:30 UTC
[R] extending strsplit to handle missing text that doesn't have the target on which to split
I am sure there is an obvious answer to this that I'm missing but I
can't find it. I'm parsing headers of Emails and most have a date like
this:
"Wed, 16 Nov 2005 05:28:00 -0800"
and I can parse that using:
tmp.dat.data <- matrix(unlist(strsplit(headers$Date.line,",")),
ncol = 2, byrow = TRUE)
before going on to look at the day and date/time data.
However, a very few headers I want to parse are missing the initial day
of the week and look like this:
"15 Nov 2005 09:10:00 +0100"
That means that my use of strsplit() results in that date/time part
being all of the item in the list for those entries so the effect of
matrix(unlist()) is to pull the next list entry "up" in the matrix.
Because I happened to have only two errant entries I didn't see what was
happening for a moment. (An odd number gives a warning message about
dimensions not fitting but an odd number has silently moved things
up/left so doesn't: no quarrel with that from me, my stupidity that I
was slow to see what was happening!)
I'm sure I should be able to find a simple way to get around this but at
the moment I can't.
Here's a simple, reproducible example:
dat <- c("Tue, 15 Nov 2005 09:44:50 EST",
"15 Nov 2005 09:10:00 +0100",
"Tue, 15 Nov 2005 09:44:50 EST",
"Tue, 15 Nov 2005 16:29:57 +0000",
"Wed, 16 Nov 2005 07:00:45 EST",
"Wed, 16 Nov 2005 05:28:00 -0800",
"Wed, 16 Nov 2005 14:06:21 +0000",
"15 Nov 2005 09:10:00 +0100")
tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2, byrow =
TRUE)
tmp.dat.data comes out as a 7x2 matrix contents:
[,1] [,2]
[1,] "Tue" " 15 Nov 2005 09:44:50
EST"
[2,] "15 Nov 2005 09:10:00 +0100" "Tue"
[3,] " 15 Nov 2005 09:44:50 EST" "Tue"
[4,] " 15 Nov 2005 16:29:57 +0000" "Wed"
[5,] " 16 Nov 2005 07:00:45 EST" "Wed"
[6,] " 16 Nov 2005 05:28:00 -0800" "Wed"
[7,] " 16 Nov 2005 14:06:21 +0000" "15 Nov 2005 09:10:00
+0100"
I'd like an 8x2 matrix with tmp.dat.data[2,1] == "" and
tmp.dat.data[8,1] == ""
I'm sure there must be a simple way to achieve this by rolling a
slightly different variant of strsplit that pads things and then
applying that to the input vector but I'm failing to see how to do this
at the moment.
TIA,
Chris
--
Applied researcher, neither statistician nor programmer!
jim holtman
2009-May-08 11:56 UTC
[R] extending strsplit to handle missing text that doesn't have the target on which to split
Find the values that are missing a comma and add it:> dat <- c("Tue, 15 Nov 2005 09:44:50 EST",+ "15 Nov 2005 09:10:00 +0100", + "Tue, 15 Nov 2005 09:44:50 EST", + "Tue, 15 Nov 2005 16:29:57 +0000", + "Wed, 16 Nov 2005 07:00:45 EST", + "Wed, 16 Nov 2005 05:28:00 -0800", + "Wed, 16 Nov 2005 14:06:21 +0000", + "15 Nov 2005 09:10:00 +0100")> # add comma if missing > missing <- !grepl(',', dat) > dat[missing] <- paste('', dat[missing], sep=',') > tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2, byrow = TRUE) > > tmp.dat.data[,1] [,2] [1,] "Tue" " 15 Nov 2005 09:44:50 EST" [2,] "" "15 Nov 2005 09:10:00 +0100" [3,] "Tue" " 15 Nov 2005 09:44:50 EST" [4,] "Tue" " 15 Nov 2005 16:29:57 +0000" [5,] "Wed" " 16 Nov 2005 07:00:45 EST" [6,] "Wed" " 16 Nov 2005 05:28:00 -0800" [7,] "Wed" " 16 Nov 2005 14:06:21 +0000" [8,] "" "15 Nov 2005 09:10:00 +0100">On Thu, May 7, 2009 at 9:30 AM, Chris Evans <chrishold@psyctc.org> wrote:> I am sure there is an obvious answer to this that I'm missing but I > can't find it. I'm parsing headers of Emails and most have a date like > this: > "Wed, 16 Nov 2005 05:28:00 -0800" > and I can parse that using: > > tmp.dat.data <- matrix(unlist(strsplit(headers$Date.line,",")), > ncol = 2, byrow = TRUE) > before going on to look at the day and date/time data. > > However, a very few headers I want to parse are missing the initial day > of the week and look like this: > "15 Nov 2005 09:10:00 +0100" > > That means that my use of strsplit() results in that date/time part > being all of the item in the list for those entries so the effect of > matrix(unlist()) is to pull the next list entry "up" in the matrix. > Because I happened to have only two errant entries I didn't see what was > happening for a moment. (An odd number gives a warning message about > dimensions not fitting but an odd number has silently moved things > up/left so doesn't: no quarrel with that from me, my stupidity that I > was slow to see what was happening!) > > I'm sure I should be able to find a simple way to get around this but at > the moment I can't. > > Here's a simple, reproducible example: > > dat <- c("Tue, 15 Nov 2005 09:44:50 EST", > "15 Nov 2005 09:10:00 +0100", > "Tue, 15 Nov 2005 09:44:50 EST", > "Tue, 15 Nov 2005 16:29:57 +0000", > "Wed, 16 Nov 2005 07:00:45 EST", > "Wed, 16 Nov 2005 05:28:00 -0800", > "Wed, 16 Nov 2005 14:06:21 +0000", > "15 Nov 2005 09:10:00 +0100") > tmp.dat.data <- matrix(unlist(strsplit(dat,",")),ncol = 2, byrow = TRUE) > > > tmp.dat.data comes out as a 7x2 matrix contents: > > [,1] [,2] > [1,] "Tue" " 15 Nov 2005 09:44:50 EST" > [2,] "15 Nov 2005 09:10:00 +0100" "Tue" > [3,] " 15 Nov 2005 09:44:50 EST" "Tue" > [4,] " 15 Nov 2005 16:29:57 +0000" "Wed" > [5,] " 16 Nov 2005 07:00:45 EST" "Wed" > [6,] " 16 Nov 2005 05:28:00 -0800" "Wed" > [7,] " 16 Nov 2005 14:06:21 +0000" "15 Nov 2005 09:10:00 +0100" > > I'd like an 8x2 matrix with tmp.dat.data[2,1] == "" and > tmp.dat.data[8,1] == "" > > I'm sure there must be a simple way to achieve this by rolling a > slightly different variant of strsplit that pads things and then > applying that to the input vector but I'm failing to see how to do this > at the moment. > > TIA, > > Chris > > -- > Applied researcher, neither statistician nor programmer! > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? [[alternative HTML version deleted]]