thr3ads.net - R help - [R] how to separate string from numbers in a large txt file [May 2019]

If this information is useful, please help other people find it:
Share via:

Michael Boulineau

2019-May-17 20:18 UTC

[R] how to separate string from numbers in a large txt file

Very interesting. I'm sure I'll be trying to get rid of the byte order
mark eventually. But right now, I'm more worried about getting the
character vector into either a csv file or data.frame; that way, I can
be able to work with the data neatly tabulated into four columns:
date, time, person, comment. I assume it's a write.csv function, but I
don't know what arguments to put in it. header=FALSE? fill=T?

Micheal

On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:>
> If byte order mark is the issue then you can specify the file encoding as
"UTF-8-BOM" and it won't show up in your data any more.
>
> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help <r-help at
r-project.org> wrote:
> >The pattern I gave worked for the lines that you originally showed from
> >the
> >data file ('a'), before you put commas into them.  If the name
is
> >either of
> >the form "<name>" or "***" then the
"(<[^>]*>)" needs to be changed so
> >something like "(<[^>]*>|[*]{3})".
> >
> >The " ???" at the start of the imported data may come from
the byte
> >order
> >mark that Windows apps like to put at the front of a text file in UTF-8
> >or
> >UTF-16 format.
> >
> >Bill Dunlap
> >TIBCO Software
> >wdunlap tibco.com
> >
> >
> >On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
> >michael.p.boulineau at gmail.com> wrote:
> >
> >> This seemed to work:
> >>
> >> > a <- readLines
("hangouts-conversation-6.csv.txt")
> >> > b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", a)
> >> > b [1:84]
> >>
> >> And the first 85 lines looks like this:
> >>
> >> [83] "2016-06-28 21:02:28 *** Jane Doe started a video
chat"
> >> [84] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> >>
> >> Then they transition to the commas:
> >>
> >> > b [84:100]
> >>  [1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> >>  [2] "2016-07-01,02:50:35,<John Doe>,hey"
> >>  [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane
to Edinburgh"
> >>  [4] "2016-07-01,02:51:45,<John Doe>,thinking about my
boo"
> >>
> >> Even the strange bit on line 6347 was caught by this:
> >>
> >> > b [6346:6348]
> >> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
> >> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
> >> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a
discussion"
> >>
> >> Perhaps most awesomely, the code catches spaces that are
interposed
> >> into the comment itself:
> >>
> >> > b [4]
> >> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
> >>   > b [85]
> >> [1] "2016-07-01,02:50:35,<John Doe>,hey"
> >>
> >> Notice whether there is a space after the "hey" or not.
> >>
> >> These are the first two lines:
> >>
> >> [1] "???2016-01-27 09:14:40 *** Jane Doe started a video
chat"
> >> [2] "2016-01-27,09:15:20,<Jane
> >> Doe>,
> >>
>
>https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> >> "
> >>
> >> So, who knows what happened with the ??? at the beginning of [1]
> >> directly above. But notice how there are no commas in [1] but
there
> >> appear in [2]. I don't see why really long ones like [2]
directly
> >> above would be a problem, were they to be translated into a csv or
> >> data frame column.
> >>
> >> Now, with the commas in there, couldn't we write this into a
csv or a
> >> data.frame? Some of this data will end up being garbage, I
imagine.
> >> Like in [2] directly above. Or with [83] and [84] at the top of
this
> >> discussion post/email. Embarrassingly, I've been trying to
convert
> >> this into a data.frame or csv but I can't manage to. I've
been using
> >> the write.csv function, but I don't think I've been
getting the
> >> arguments correct.
> >>
> >> At the end of the day, I would like a data.frame and/or csv with
the
> >> following four columns: date, time, person, comment.
> >>
> >> I tried this, too:
> >>
> >> > c <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
> >> +                 a, proto=data.frame(stringsAsFactors=FALSE,
> >When="",
> >> Who="",
> >> +                                     What=""))
> >>
> >> But all I got was this:
> >>
> >> > c [1:100, ]
> >>     When  Who What
> >> 1   <NA> <NA> <NA>
> >> 2   <NA> <NA> <NA>
> >> 3   <NA> <NA> <NA>
> >> 4   <NA> <NA> <NA>
> >> 5   <NA> <NA> <NA>
> >> 6   <NA> <NA> <NA>
> >>
> >> It seems to have caught nothing.
> >>
> >> > unique (c)
> >>   When  Who What
> >> 1 <NA> <NA> <NA>
> >>
> >> But I like that it converted into columns. That's a really
great
> >> format. With a little tweaking, it'd be a great code for this
data
> >> set.
> >>
> >> Michael
> >>
> >> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> >> <r-help at r-project.org> wrote:
> >> >
> >> > Consider using readLines() and strcapture() for reading such
a
> >file.
> >> E.g.,
> >> > suppose readLines(files) produced a character vector like
> >> >
> >> > x <- c("2016-10-21 10:35:36 <Jane Doe>
What's your login",
> >> >           "2016-10-21 10:56:29 <John Doe>
John_Doe",
> >> >           "2016-10-21 10:56:37 <John Doe>
Admit#8242",
> >> >           "October 23, 1819 12:34 <Jane Eyre> I am
not an angel")
> >> >
> >> > Then you can make a data.frame with columns When, Who, and
What by
> >> > supplying a pattern containing three parenthesized capture
> >expressions:
> >> > > z <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >> > [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
> >> >              x, proto=data.frame(stringsAsFactors=FALSE,
When="",
> >Who="",
> >> > What=""))
> >> > > str(z)
> >> > 'data.frame':   4 obs. of  3 variables:
> >> >  $ When: chr  "2016-10-21 10:35:36"
"2016-10-21 10:56:29"
> >"2016-10-21
> >> > 10:56:37" NA
> >> >  $ Who : chr  "<Jane Doe>" "<John
Doe>" "<John Doe>" NA
> >> >  $ What: chr  "What's your login"
"John_Doe" "Admit#8242" NA
> >> >
> >> > Lines that don't match the pattern result in NA's -
you might make
> >a
> >> second
> >> > pass over the corresponding elements of x with a new pattern.
> >> >
> >> > You can convert the When column from character to time with
> >as.POSIXct().
> >> >
> >> > Bill Dunlap
> >> > TIBCO Software
> >> > wdunlap tibco.com
> >> >
> >> >
> >> > On Thu, May 16, 2019 at 8:30 PM David Winsemius
> ><dwinsemius at comcast.net>
> >> > wrote:
> >> >
> >> > >
> >> > > On 5/16/19 3:53 PM, Michael Boulineau wrote:
> >> > > > OK. So, I named the object test and then checked
the 6347th
> >item
> >> > > >
> >> > > >> test <- readLines
("hangouts-conversation.txt)
> >> > > >> test [6347]
> >> > > > [1] "2016-10-21 10:56:37 <John Doe>
Admit#8242"
> >> > > >
> >> > > > Perhaps where it was getting screwed up is, since
the end of
> >this is
> >> a
> >> > > > number (8242), then, given that there's no
space between the
> >number
> >> > > > and what ought to be the next row, R didn't
know where to draw
> >the
> >> > > > line. Sure enough, it looks like this when I go to
the original
> >file
> >> > > > and control f "#8242"
> >> > > >
> >> > > > 2016-10-21 10:35:36 <Jane Doe> What's
your login
> >> > > > 2016-10-21 10:56:29 <John Doe> John_Doe
> >> > > > 2016-10-21 10:56:37 <John Doe> Admit#8242
> >> > >
> >> > >
> >> > > An octothorpe is an end of line signifier and is
interpreted as
> >> allowing
> >> > > comments. You can prevent that interpretation with
suitable
> >choice of
> >> > > parameters to `read.table` or `read.csv`. I don't
understand why
> >that
> >> > > should cause anu error or a failure to match that
pattern.
> >> > >
> >> > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you
have a discussion
> >> > > >
> >> > > > Again, it doesn't look like that in the file.
Gmail
> >automatically
> >> > > > formats it like that when I paste it in. More to
the point, it
> >looks
> >> > > > like
> >> > > >
> >> > > > 2016-10-21 10:35:36 <Jane Doe> What's
your login2016-10-21
> >10:56:29
> >> > > > <John Doe> John_Doe2016-10-21 10:56:37
<John Doe>
> >> Admit#82422016-10-21
> >> > > > 11:00:13 <Jane Doe> Okay so you have a
discussion
> >> > > >
> >> > > > Notice Admit#82422016. So there's that.
> >> > > >
> >> > > > Then I built object test2.
> >> > > >
> >> > > > test2 <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4",
> >test)
> >> > > >
> >> > > > This worked for 84 lines, then this happened.
> >> > >
> >> > > It may have done something but as you later discovered
my first
> >code
> >> for
> >> > > the pattern was incorrect. I had tested it (and pasted
in the
> >results
> >> of
> >> > > the test) . The way to refer to a capture class is with
> >back-slashes
> >> > > before the numbers, not forward-slashes. Try this:
> >> > >
> >> > >
> >> > >  > newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)",
> >"\\1,\\2,\\3,\\4",
> >> chrvec)
> >> > >  > newvec
> >> > >   [1] "2016-07-01,02:50:35,<john>,hey"
> >> > >   [2] "2016-07-01,02:51:26,<jane>,waiting for
plane to Edinburgh"
> >> > >   [3] "2016-07-01,02:51:45,<john>,thinking
about my boo"
> >> > >   [4] "2016-07-01,02:52:07,<jane>,nothing
crappy has happened,
> >not
> >> really"
> >> > >   [5] "2016-07-01,02:52:20,<john>,plane went
by pretty fast,
> >didn't
> >> sleep"
> >> > >   [6] "2016-07-01,02:54:08,<jane>,no idea
what time it is or
> >where I am
> >> > > really"
> >> > >   [7] "2016-07-01,02:54:17,<john>,just know
it's london"
> >> > >   [8] "2016-07-01,02:56:44,<jane>,you are
probably asleep"
> >> > >   [9] "2016-07-01,02:58:45,<jane>,I hope fish
was fishy in a good
> >eay"
> >> > > [10] "2016-07-01 02:58:56 <jone>"
> >> > > [11] "2016-07-01 02:59:34 <jane>"
> >> > > [12] "2016-07-01,03:02:48,<john>,British
security is a little
> >more
> >> > > rigorous..."
> >> > >
> >> > >
> >> > > I made note of the fact that the 10th and 11th lines had
no
> >commas.
> >> > >
> >> > > >
> >> > > >> test2 [84]
> >> > > > [1] "2016-06-28 21:12:43 *** John Doe ended a
video chat"
> >> > >
> >> > > That line didn't have any "<" so
wasn't matched.
> >> > >
> >> > >
> >> > > You could remove all none matching lines for pattern of
> >> > >
> >> > >
dates<space>times<space>"<"<name>">"<space><anything>
> >> > >
> >> > >
> >> > > with:
> >> > >
> >> > >
> >> > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+>
.+$)", chrvec)]
> >> > >
> >> > >
> >> > > Do read:
> >> > >
> >> > > ?read.csv
> >> > >
> >> > > ?regex
> >> > >
> >> > >
> >> > > --
> >> > >
> >> > > David
> >> > >
> >> > >
> >> > > >> test2 [85]
> >> > > > [1] "//1,//2,//3,//4"
> >> > > >> test [85]
> >> > > > [1] "2016-07-01 02:50:35 <John Doe>
hey"
> >> > > >
> >> > > > Notice how I toggled back and forth between test
and test2
> >there. So,
> >> > > > whatever happened with the regex, it happened in
the switch
> >from 84
> >> to
> >> > > > 85, I guess. It went on like
> >> > > >
> >> > > > [990] "//1,//2,//3,//4"
> >> > > >   [991] "//1,//2,//3,//4"
> >> > > >   [992] "//1,//2,//3,//4"
> >> > > >   [993] "//1,//2,//3,//4"
> >> > > >   [994] "//1,//2,//3,//4"
> >> > > >   [995] "//1,//2,//3,//4"
> >> > > >   [996] "//1,//2,//3,//4"
> >> > > >   [997] "//1,//2,//3,//4"
> >> > > >   [998] "//1,//2,//3,//4"
> >> > > >   [999] "//1,//2,//3,//4"
> >> > > > [1000] "//1,//2,//3,//4"
> >> > > >
> >> > > > up until line 1000, then I reached max.print.
> >> > >
> >> > > > Michael
> >> > > >
> >> > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius
<
> >> dwinsemius at comcast.net>
> >> > > wrote:
> >> > > >>
> >> > > >> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >> > > >>> Thanks for this tip on etiquette, David. I
will be sure and
> >not do
> >> > > that again.
> >> > > >>>
> >> > > >>> I tried the read.fwf from the foreign
package, with a code
> >like
> >> this:
> >> > > >>>
> >> > > >>>    d <-
read.fwf("hangouts-conversation.txt",
> >> > > >>>                   widths= c(10,10,20,40),
> >> > > >>>
>
>col.names=c("date","time","person","comment"),
> >> > > >>>                   strip.white=TRUE)
> >> > > >>>
> >> > > >>> But it threw this error:
> >> > > >>>
> >> > > >>> Error in scan(file = file, what = what, sep
= sep, quote > >quote,
> >> dec
> >> > > = dec,  :
> >> > > >>>     line 6347 did not have 4 elements
> >> > > >>
> >> > > >> So what does line 6347 look like? (Use
`readLines` and print
> >it
> >> out.)
> >> > > >>
> >> > > >>> Interestingly, though, the error only
happened when I
> >increased the
> >> > > >>> width size. But I had to increase the size,
or else I
> >couldn't
> >> "see"
> >> > > >>> anything.  The comment was so small that
nothing was being
> >> captured by
> >> > > >>> the size of the column. so to speak.
> >> > > >>>
> >> > > >>> It seems like what's throwing me is
that there's no comma
> >that
> >> > > >>> demarcates the end of the text proper. For
example:
> >> > > >> Not sure why you thought there should be a
comma. Lines
> >usually end
> >> > > >> with  <cr> and or a <lf>.
> >> > > >>
> >> > > >>
> >> > > >> Once you have the raw text in a character
vector from
> >`readLines`
> >> named,
> >> > > >> say, 'chrvec', then you could
selectively substitute commas
> >for
> >> spaces
> >> > > >> with regex. (Now that you no longer desire to
remove the dates
> >and
> >> > > times.)
> >> > > >>
> >> > > >> sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4", chrvec)
> >> > > >>
> >> > > >> This will not do any replacements when the
pattern is not
> >matched.
> >> See
> >> > > >> this test:
> >> > > >>
> >> > > >>
> >> > > >>   > newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)",
> >"\\1,\\2,\\3,\\4",
> >> > > chrvec)
> >> > > >>   > newvec
> >> > > >>    [1]
"2016-07-01,02:50:35,<john>,hey"
> >> > > >>    [2]
"2016-07-01,02:51:26,<jane>,waiting for plane to
> >Edinburgh"
> >> > > >>    [3]
"2016-07-01,02:51:45,<john>,thinking about my boo"
> >> > > >>    [4]
"2016-07-01,02:52:07,<jane>,nothing crappy has
> >happened, not
> >> > > really"
> >> > > >>    [5]
"2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >didn't
> >> > > sleep"
> >> > > >>    [6]
"2016-07-01,02:54:08,<jane>,no idea what time it is or
> >where
> >> I am
> >> > > >> really"
> >> > > >>    [7]
"2016-07-01,02:54:17,<john>,just know it's london"
> >> > > >>    [8]
"2016-07-01,02:56:44,<jane>,you are probably asleep"
> >> > > >>    [9] "2016-07-01,02:58:45,<jane>,I
hope fish was fishy in a
> >good
> >> eay"
> >> > > >> [10] "2016-07-01 02:58:56
<jone>"
> >> > > >> [11] "2016-07-01 02:59:34
<jane>"
> >> > > >> [12]
"2016-07-01,03:02:48,<john>,British security is a little
> >more
> >> > > >> rigorous..."
> >> > > >>
> >> > > >>
> >> > > >> You should probably remove the "empty
comment" lines.
> >> > > >>
> >> > > >>
> >> > > >> --
> >> > > >>
> >> > > >> David.
> >> > > >>
> >> > > >>> 2016-07-01 15:34:30 <John Doe> Lame.
We were in a
> >> starbucks2016-07-01
> >> > > >>> 15:35:02 <Jane Doe> Hmm that's
interesting2016-07-01 15:35:09
> ><Jane
> >> > > >>> Doe> You must want coffees2016-07-01
15:35:25 <John Doe>
> >There was
> >> > > >>> lots of Starbucks in my day2016-07-01
15:35:47
> >> > > >>>
> >> > > >>> It was interesting, too, when I pasted the
text into the
> >email, it
> >> > > >>> self-formatted into the way I wanted it to
look. I had to
> >manually
> >> > > >>> make it look like it does above, since
that's the way that it
> >> looks in
> >> > > >>> the txt file. I wonder if it's being
organized by XML or
> >something.
> >> > > >>>
> >> > > >>> Anyways, There's always a space between
the two sideways
> >carrots,
> >> just
> >> > > >>> like there is right now: <John Doe>
See. Space. And there's
> >always
> >> a
> >> > > >>> space between the data and time. Like this.
2016-07-01
> >15:34:30
> >> See.
> >> > > >>> Space. But there's never a space
between the end of the
> >comment and
> >> > > >>> the next date. Like this: We were in a
starbucks2016-07-01
> >15:35:02
> >> > > >>> See. starbucks and 2016 are smooshed
together.
> >> > > >>>
> >> > > >>> This code is also on the table right now
too.
> >> > > >>>
> >> > > >>> a <- read.table("E:/working
> >> > > >>>
directory/-189/hangouts-conversation2.txt", quote="\"",
> >> > > >>> comment.char="", fill=TRUE)
> >> > > >>>
> >> > > >>>
> >> > >
> >>
>
>h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >> > > >>>
> >> > > >>>
aa<-gsub("[^[:digit:]]","",h)
> >> > > >>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
> >> > > >>>
> >> > > >>> Those last lines are a work in progress. I
wish I could
> >import a
> >> > > >>> picture of what it looks like when it's
translated into a
> >data
> >> frame.
> >> > > >>> The fill=TRUE helped to get the data in
table that kind of
> >sort of
> >> > > >>> works, but the comments keep bleeding into
the data and time
> >> column.
> >> > > >>> It's like
> >> > > >>>
> >> > > >>> 2016-07-01 15:59:17 <Jane Doe>
Seriously I've never been
> >> > > >>> over               there
> >> > > >>> 2016-07-01 15:59:27 <Jane Doe> It
confuses me :(
> >> > > >>>
> >> > > >>> And then, maybe, the "seriously"
will be in a column all to
> >> itself, as
> >> > > >>> will be the "I've'"and
the "never" etc.
> >> > > >>>
> >> > > >>> I will use a regular expression if I have
to, but it would be
> >nice
> >> to
> >> > > >>> keep the dates and times on there.
Originally, I thought they
> >were
> >> > > >>> meaningless, but I've since changed my
mind on that count.
> >The
> >> time of
> >> > > >>> day isn't so important. But, especially
since, say, Gmail
> >itself
> >> knows
> >> > > >>> how to quickly recognize what it is, I know
it can be done. I
> >know
> >> > > >>> this data has structure to it.
> >> > > >>>
> >> > > >>> Michael
> >> > > >>>
> >> > > >>>
> >> > > >>>
> >> > > >>> On Wed, May 15, 2019 at 8:47 PM David
Winsemius <
> >> > > dwinsemius at comcast.net> wrote:
> >> > > >>>> On 5/15/19 4:07 PM, Michael Boulineau
wrote:
> >> > > >>>>> I have a wild and crazy text file,
the head of which looks
> >like
> >> this:
> >> > > >>>>>
> >> > > >>>>> 2016-07-01 02:50:35 <john>
hey
> >> > > >>>>> 2016-07-01 02:51:26 <jane>
waiting for plane to Edinburgh
> >> > > >>>>> 2016-07-01 02:51:45 <john>
thinking about my boo
> >> > > >>>>> 2016-07-01 02:52:07 <jane>
nothing crappy has happened, not
> >> really
> >> > > >>>>> 2016-07-01 02:52:20 <john>
plane went by pretty fast,
> >didn't
> >> sleep
> >> > > >>>>> 2016-07-01 02:54:08 <jane> no
idea what time it is or where
> >I am
> >> > > really
> >> > > >>>>> 2016-07-01 02:54:17 <john>
just know it's london
> >> > > >>>>> 2016-07-01 02:56:44 <jane>
you are probably asleep
> >> > > >>>>> 2016-07-01 02:58:45 <jane> I
hope fish was fishy in a good
> >eay
> >> > > >>>>> 2016-07-01 02:58:56 <jone>
> >> > > >>>>> 2016-07-01 02:59:34 <jane>
> >> > > >>>>> 2016-07-01 03:02:48 <john>
British security is a little
> >more
> >> > > rigorous...
> >> > > >>>> Looks entirely not-"crazy".
Typical log file format.
> >> > > >>>>
> >> > > >>>> Two possibilities: 1) Use `read.fwf`
from pkg foreign; 2)
> >Use
> >> regex
> >> > > >>>> (i.e. the sub-function) to strip
everything up to the "<".
> >Read
> >> > > >>>> `?regex`. Since that's not a
metacharacters you could use a
> >> pattern
> >> > > >>>> ".+<" and replace with
"".
> >> > > >>>>
> >> > > >>>> And do read the Posting Guide.
Cross-posting to
> >StackOverflow and
> >> > > Rhelp,
> >> > > >>>> at least within hours of each, is
considered poor manners.
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> --
> >> > > >>>>
> >> > > >>>> David.
> >> > > >>>>
> >> > > >>>>> It goes on for a while. It's a
big file. But I feel like
> >it's
> >> going
> >> > > to
> >> > > >>>>> be difficult to annotate with the
coreNLP library or
> >package. I'm
> >> > > >>>>> doing natural language processing.
In other words, I'm
> >curious
> >> as to
> >> > > >>>>> how I would shave off the dates,
that is, to make it look
> >like:
> >> > > >>>>>
> >> > > >>>>> <john> hey
> >> > > >>>>> <jane> waiting for plane to
Edinburgh
> >> > > >>>>>     <john> thinking about my
boo
> >> > > >>>>> <jane> nothing crappy has
happened, not really
> >> > > >>>>> <john> plane went by pretty
fast, didn't sleep
> >> > > >>>>> <jane> no idea what time it
is or where I am really
> >> > > >>>>> <john> just know it's
london
> >> > > >>>>> <jane> you are probably
asleep
> >> > > >>>>> <jane> I hope fish was fishy
in a good eay
> >> > > >>>>>     <jone>
> >> > > >>>>> <jane>
> >> > > >>>>> <john> British security is a
little more rigorous...
> >> > > >>>>>
> >> > > >>>>> To be clear, then, I'm trying
to clean a large text file by
> >> writing a
> >> > > >>>>> regular expression? such that I
create a new object with no
> >> numbers
> >> > > or
> >> > > >>>>> dates.
> >> > > >>>>>
> >> > > >>>>> Michael
> >> > > >>>>>
> >> > > >>>>>
______________________________________________
> >> > > >>>>> R-help at r-project.org mailing
list -- To UNSUBSCRIBE and
> >more,
> >> see
> >> > > >>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
> >> > > >>>>> PLEASE do read the posting guide
> >> > > http://www.R-project.org/posting-guide.html
> >> > > >>>>> and provide commented, minimal,
self-contained,
> >reproducible
> >> code.
> >> > > >>>
______________________________________________
> >> > > >>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> >see
> >> > > >>>
https://stat.ethz.ch/mailman/listinfo/r-help
> >> > > >>> PLEASE do read the posting guide
> >> > > http://www.R-project.org/posting-guide.html
> >> > > >>> and provide commented, minimal,
self-contained, reproducible
> >code.
> >> > > >> ______________________________________________
> >> > > >> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> >see
> >> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> > > >> PLEASE do read the posting guide
> >> > > http://www.R-project.org/posting-guide.html
> >> > > >> and provide commented, minimal, self-contained,
reproducible
> >code.
> >> > > > ______________________________________________
> >> > > > R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> >see
> >> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > > > PLEASE do read the posting guide
> >> > > http://www.R-project.org/posting-guide.html
> >> > > > and provide commented, minimal, self-contained,
reproducible
> >code.
> >> > >
> >> > > ______________________________________________
> >> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >> > > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > > PLEASE do read the posting guide
> >> > > http://www.R-project.org/posting-guide.html
> >> > > and provide commented, minimal, self-contained,
reproducible
> >code.
> >> > >
> >> >
> >> >         [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible
code.
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >       [[alternative HTML version deleted]]
> >
> >______________________________________________
> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >https://stat.ethz.ch/mailman/listinfo/r-help
> >PLEASE do read the posting guide
> >http://www.R-project.org/posting-guide.html
> >and provide commented, minimal, self-contained, reproducible code.
>
> --
> Sent from my phone. Please excuse my brevity.

Boris Steipe

2019-May-17 23:28 UTC

head link

[R] how to separate string from numbers in a large txt file

Don't start putting in extra commas and then reading this as csv. That
approach is broken. The correct approach is what Bill outlined: read everything
with readLines(), and then use a proper regular expression with strcapture().

You need to pre-process the object that readLines() gives you: replace the
contents of the videochat lines, and make it conform to the format of the other
lines before you process it into your data frame.

Approximately something like 

# read the raw data
tmp <- readLines("hangouts-conversation-6.csv.txt")

# process all video chat lines
patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year time
)*** (word word)
tmp <- gsub(patt, "\\1<\\2> ", tmp)

# next, use strcapture()

Note that this makes the assumption that your names are always exactly two words
containing only letters. If that assumption is not true, more though needs to go
into the regex. But you can test that:

patt <- " <\\w+ \\w+> "   #" <word word> "
sum( ! grepl(patt, tmp)))

... will give the number of lines that remain in your file that do not have a
tag that can be interpreted as "Who"

Once that is fine, use Bill's approach - or a regular expression of your own
design - to create your data frame.

Hope this helps,
Boris



> On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulineau at
gmail.com> wrote:
> 
> Very interesting. I'm sure I'll be trying to get rid of the byte
order
> mark eventually. But right now, I'm more worried about getting the
> character vector into either a csv file or data.frame; that way, I can
> be able to work with the data neatly tabulated into four columns:
> date, time, person, comment. I assume it's a write.csv function, but I
> don't know what arguments to put in it. header=FALSE? fill=T?
> 
> Micheal
> 
> On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:
>> 
>> If byte order mark is the issue then you can specify the file encoding
as "UTF-8-BOM" and it won't show up in your data any more.
>> 
>> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help <r-help
at r-project.org> wrote:
>>> The pattern I gave worked for the lines that you originally showed
from
>>> the
>>> data file ('a'), before you put commas into them.  If the
name is
>>> either of
>>> the form "<name>" or "***" then the
"(<[^>]*>)" needs to be changed so
>>> something like "(<[^>]*>|[*]{3})".
>>> 
>>> The " ???" at the start of the imported data may come
from the byte
>>> order
>>> mark that Windows apps like to put at the front of a text file in
UTF-8
>>> or
>>> UTF-16 format.
>>> 
>>> Bill Dunlap
>>> TIBCO Software
>>> wdunlap tibco.com
>>> 
>>> 
>>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
>>> michael.p.boulineau at gmail.com> wrote:
>>> 
>>>> This seemed to work:
>>>> 
>>>>> a <- readLines
("hangouts-conversation-6.csv.txt")
>>>>> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", a)
>>>>> b [1:84]
>>>> 
>>>> And the first 85 lines looks like this:
>>>> 
>>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a video
chat"
>>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
>>>> 
>>>> Then they transition to the commas:
>>>> 
>>>>> b [84:100]
>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
>>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
>>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for
plane to Edinburgh"
>>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking about
my boo"
>>>> 
>>>> Even the strange bit on line 6347 was caught by this:
>>>> 
>>>>> b [6346:6348]
>>>> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
>>>> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
>>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have
a discussion"
>>>> 
>>>> Perhaps most awesomely, the code catches spaces that are
interposed
>>>> into the comment itself:
>>>> 
>>>>> b [4]
>>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
>>>>> b [85]
>>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
>>>> 
>>>> Notice whether there is a space after the "hey" or
not.
>>>> 
>>>> These are the first two lines:
>>>> 
>>>> [1] "???2016-01-27 09:14:40 *** Jane Doe started a video
chat"
>>>> [2] "2016-01-27,09:15:20,<Jane
>>>> Doe>,
>>>> 
>>>
https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
>>>> "
>>>> 
>>>> So, who knows what happened with the ??? at the beginning of
[1]
>>>> directly above. But notice how there are no commas in [1] but
there
>>>> appear in [2]. I don't see why really long ones like [2]
directly
>>>> above would be a problem, were they to be translated into a csv
or
>>>> data frame column.
>>>> 
>>>> Now, with the commas in there, couldn't we write this into
a csv or a
>>>> data.frame? Some of this data will end up being garbage, I
imagine.
>>>> Like in [2] directly above. Or with [83] and [84] at the top of
this
>>>> discussion post/email. Embarrassingly, I've been trying to
convert
>>>> this into a data.frame or csv but I can't manage to.
I've been using
>>>> the write.csv function, but I don't think I've been
getting the
>>>> arguments correct.
>>>> 
>>>> At the end of the day, I would like a data.frame and/or csv
with the
>>>> following four columns: date, time, person, comment.
>>>> 
>>>> I tried this, too:
>>>> 
>>>>> c <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
>>>> +                 a, proto=data.frame(stringsAsFactors=FALSE,
>>> When="",
>>>> Who="",
>>>> +                                     What=""))
>>>> 
>>>> But all I got was this:
>>>> 
>>>>> c [1:100, ]
>>>>    When  Who What
>>>> 1   <NA> <NA> <NA>
>>>> 2   <NA> <NA> <NA>
>>>> 3   <NA> <NA> <NA>
>>>> 4   <NA> <NA> <NA>
>>>> 5   <NA> <NA> <NA>
>>>> 6   <NA> <NA> <NA>
>>>> 
>>>> It seems to have caught nothing.
>>>> 
>>>>> unique (c)
>>>>  When  Who What
>>>> 1 <NA> <NA> <NA>
>>>> 
>>>> But I like that it converted into columns. That's a really
great
>>>> format. With a little tweaking, it'd be a great code for
this data
>>>> set.
>>>> 
>>>> Michael
>>>> 
>>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
>>>> <r-help at r-project.org> wrote:
>>>>> 
>>>>> Consider using readLines() and strcapture() for reading
such a
>>> file.
>>>> E.g.,
>>>>> suppose readLines(files) produced a character vector like
>>>>> 
>>>>> x <- c("2016-10-21 10:35:36 <Jane Doe>
What's your login",
>>>>>          "2016-10-21 10:56:29 <John Doe>
John_Doe",
>>>>>          "2016-10-21 10:56:37 <John Doe>
Admit#8242",
>>>>>          "October 23, 1819 12:34 <Jane Eyre> I
am not an angel")
>>>>> 
>>>>> Then you can make a data.frame with columns When, Who, and
What by
>>>>> supplying a pattern containing three parenthesized capture
>>> expressions:
>>>>>> z <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
>>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
>>>>>             x, proto=data.frame(stringsAsFactors=FALSE,
When="",
>>> Who="",
>>>>> What=""))
>>>>>> str(z)
>>>>> 'data.frame':   4 obs. of  3 variables:
>>>>> $ When: chr  "2016-10-21 10:35:36"
"2016-10-21 10:56:29"
>>> "2016-10-21
>>>>> 10:56:37" NA
>>>>> $ Who : chr  "<Jane Doe>" "<John
Doe>" "<John Doe>" NA
>>>>> $ What: chr  "What's your login"
"John_Doe" "Admit#8242" NA
>>>>> 
>>>>> Lines that don't match the pattern result in NA's -
you might make
>>> a
>>>> second
>>>>> pass over the corresponding elements of x with a new
pattern.
>>>>> 
>>>>> You can convert the When column from character to time with
>>> as.POSIXct().
>>>>> 
>>>>> Bill Dunlap
>>>>> TIBCO Software
>>>>> wdunlap tibco.com
>>>>> 
>>>>> 
>>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
>>> <dwinsemius at comcast.net>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
>>>>>>> OK. So, I named the object test and then checked
the 6347th
>>> item
>>>>>>> 
>>>>>>>> test <- readLines
("hangouts-conversation.txt)
>>>>>>>> test [6347]
>>>>>>> [1] "2016-10-21 10:56:37 <John Doe>
Admit#8242"
>>>>>>> 
>>>>>>> Perhaps where it was getting screwed up is, since
the end of
>>> this is
>>>> a
>>>>>>> number (8242), then, given that there's no
space between the
>>> number
>>>>>>> and what ought to be the next row, R didn't
know where to draw
>>> the
>>>>>>> line. Sure enough, it looks like this when I go to
the original
>>> file
>>>>>>> and control f "#8242"
>>>>>>> 
>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's
your login
>>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
>>>>>>> 2016-10-21 10:56:37 <John Doe> Admit#8242
>>>>>> 
>>>>>> 
>>>>>> An octothorpe is an end of line signifier and is
interpreted as
>>>> allowing
>>>>>> comments. You can prevent that interpretation with
suitable
>>> choice of
>>>>>> parameters to `read.table` or `read.csv`. I don't
understand why
>>> that
>>>>>> should cause anu error or a failure to match that
pattern.
>>>>>> 
>>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so you
have a discussion
>>>>>>> 
>>>>>>> Again, it doesn't look like that in the file.
Gmail
>>> automatically
>>>>>>> formats it like that when I paste it in. More to
the point, it
>>> looks
>>>>>>> like
>>>>>>> 
>>>>>>> 2016-10-21 10:35:36 <Jane Doe> What's
your login2016-10-21
>>> 10:56:29
>>>>>>> <John Doe> John_Doe2016-10-21 10:56:37
<John Doe>
>>>> Admit#82422016-10-21
>>>>>>> 11:00:13 <Jane Doe> Okay so you have a
discussion
>>>>>>> 
>>>>>>> Notice Admit#82422016. So there's that.
>>>>>>> 
>>>>>>> Then I built object test2.
>>>>>>> 
>>>>>>> test2 <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4",
>>> test)
>>>>>>> 
>>>>>>> This worked for 84 lines, then this happened.
>>>>>> 
>>>>>> It may have done something but as you later discovered
my first
>>> code
>>>> for
>>>>>> the pattern was incorrect. I had tested it (and pasted
in the
>>> results
>>>> of
>>>>>> the test) . The way to refer to a capture class is with
>>> back-slashes
>>>>>> before the numbers, not forward-slashes. Try this:
>>>>>> 
>>>>>> 
>>>>>>> newvec <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)",
>>> "\\1,\\2,\\3,\\4",
>>>> chrvec)
>>>>>>> newvec
>>>>>>  [1] "2016-07-01,02:50:35,<john>,hey"
>>>>>>  [2] "2016-07-01,02:51:26,<jane>,waiting for
plane to Edinburgh"
>>>>>>  [3] "2016-07-01,02:51:45,<john>,thinking
about my boo"
>>>>>>  [4] "2016-07-01,02:52:07,<jane>,nothing
crappy has happened,
>>> not
>>>> really"
>>>>>>  [5] "2016-07-01,02:52:20,<john>,plane went
by pretty fast,
>>> didn't
>>>> sleep"
>>>>>>  [6] "2016-07-01,02:54:08,<jane>,no idea
what time it is or
>>> where I am
>>>>>> really"
>>>>>>  [7] "2016-07-01,02:54:17,<john>,just know
it's london"
>>>>>>  [8] "2016-07-01,02:56:44,<jane>,you are
probably asleep"
>>>>>>  [9] "2016-07-01,02:58:45,<jane>,I hope fish
was fishy in a good
>>> eay"
>>>>>> [10] "2016-07-01 02:58:56 <jone>"
>>>>>> [11] "2016-07-01 02:59:34 <jane>"
>>>>>> [12] "2016-07-01,03:02:48,<john>,British
security is a little
>>> more
>>>>>> rigorous..."
>>>>>> 
>>>>>> 
>>>>>> I made note of the fact that the 10th and 11th lines
had no
>>> commas.
>>>>>> 
>>>>>>> 
>>>>>>>> test2 [84]
>>>>>>> [1] "2016-06-28 21:12:43 *** John Doe ended a
video chat"
>>>>>> 
>>>>>> That line didn't have any "<" so
wasn't matched.
>>>>>> 
>>>>>> 
>>>>>> You could remove all none matching lines for pattern of
>>>>>> 
>>>>>>
dates<space>times<space>"<"<name>">"<space><anything>
>>>>>> 
>>>>>> 
>>>>>> with:
>>>>>> 
>>>>>> 
>>>>>> chrvec <- chrvec[ grepl("^.{10} .{8} <.+>
.+$)", chrvec)]
>>>>>> 
>>>>>> 
>>>>>> Do read:
>>>>>> 
>>>>>> ?read.csv
>>>>>> 
>>>>>> ?regex
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>>>> test2 [85]
>>>>>>> [1] "//1,//2,//3,//4"
>>>>>>>> test [85]
>>>>>>> [1] "2016-07-01 02:50:35 <John Doe>
hey"
>>>>>>> 
>>>>>>> Notice how I toggled back and forth between test
and test2
>>> there. So,
>>>>>>> whatever happened with the regex, it happened in
the switch
>>> from 84
>>>> to
>>>>>>> 85, I guess. It went on like
>>>>>>> 
>>>>>>> [990] "//1,//2,//3,//4"
>>>>>>>  [991] "//1,//2,//3,//4"
>>>>>>>  [992] "//1,//2,//3,//4"
>>>>>>>  [993] "//1,//2,//3,//4"
>>>>>>>  [994] "//1,//2,//3,//4"
>>>>>>>  [995] "//1,//2,//3,//4"
>>>>>>>  [996] "//1,//2,//3,//4"
>>>>>>>  [997] "//1,//2,//3,//4"
>>>>>>>  [998] "//1,//2,//3,//4"
>>>>>>>  [999] "//1,//2,//3,//4"
>>>>>>> [1000] "//1,//2,//3,//4"
>>>>>>> 
>>>>>>> up until line 1000, then I reached max.print.
>>>>>> 
>>>>>>> Michael
>>>>>>> 
>>>>>>> On Thu, May 16, 2019 at 1:05 PM David Winsemius
<
>>>> dwinsemius at comcast.net>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
>>>>>>>>> Thanks for this tip on etiquette, David. I
will be sure and
>>> not do
>>>>>> that again.
>>>>>>>>> 
>>>>>>>>> I tried the read.fwf from the foreign
package, with a code
>>> like
>>>> this:
>>>>>>>>> 
>>>>>>>>>   d <-
read.fwf("hangouts-conversation.txt",
>>>>>>>>>                  widths= c(10,10,20,40),
>>>>>>>>> 
>>>
col.names=c("date","time","person","comment"),
>>>>>>>>>                  strip.white=TRUE)
>>>>>>>>> 
>>>>>>>>> But it threw this error:
>>>>>>>>> 
>>>>>>>>> Error in scan(file = file, what = what, sep
= sep, quote >>> quote,
>>>> dec
>>>>>> = dec,  :
>>>>>>>>>    line 6347 did not have 4 elements
>>>>>>>> 
>>>>>>>> So what does line 6347 look like? (Use
`readLines` and print
>>> it
>>>> out.)
>>>>>>>> 
>>>>>>>>> Interestingly, though, the error only
happened when I
>>> increased the
>>>>>>>>> width size. But I had to increase the size,
or else I
>>> couldn't
>>>> "see"
>>>>>>>>> anything.  The comment was so small that
nothing was being
>>>> captured by
>>>>>>>>> the size of the column. so to speak.
>>>>>>>>> 
>>>>>>>>> It seems like what's throwing me is
that there's no comma
>>> that
>>>>>>>>> demarcates the end of the text proper. For
example:
>>>>>>>> Not sure why you thought there should be a
comma. Lines
>>> usually end
>>>>>>>> with  <cr> and or a <lf>.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Once you have the raw text in a character
vector from
>>> `readLines`
>>>> named,
>>>>>>>> say, 'chrvec', then you could
selectively substitute commas
>>> for
>>>> spaces
>>>>>>>> with regex. (Now that you no longer desire to
remove the dates
>>> and
>>>>>> times.)
>>>>>>>> 
>>>>>>>> sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4", chrvec)
>>>>>>>> 
>>>>>>>> This will not do any replacements when the
pattern is not
>>> matched.
>>>> See
>>>>>>>> this test:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)",
>>> "\\1,\\2,\\3,\\4",
>>>>>> chrvec)
>>>>>>>>> newvec
>>>>>>>>   [1]
"2016-07-01,02:50:35,<john>,hey"
>>>>>>>>   [2]
"2016-07-01,02:51:26,<jane>,waiting for plane to
>>> Edinburgh"
>>>>>>>>   [3]
"2016-07-01,02:51:45,<john>,thinking about my boo"
>>>>>>>>   [4]
"2016-07-01,02:52:07,<jane>,nothing crappy has
>>> happened, not
>>>>>> really"
>>>>>>>>   [5]
"2016-07-01,02:52:20,<john>,plane went by pretty fast,
>>> didn't
>>>>>> sleep"
>>>>>>>>   [6] "2016-07-01,02:54:08,<jane>,no
idea what time it is or
>>> where
>>>> I am
>>>>>>>> really"
>>>>>>>>   [7]
"2016-07-01,02:54:17,<john>,just know it's london"
>>>>>>>>   [8]
"2016-07-01,02:56:44,<jane>,you are probably asleep"
>>>>>>>>   [9] "2016-07-01,02:58:45,<jane>,I
hope fish was fishy in a
>>> good
>>>> eay"
>>>>>>>> [10] "2016-07-01 02:58:56
<jone>"
>>>>>>>> [11] "2016-07-01 02:59:34
<jane>"
>>>>>>>> [12]
"2016-07-01,03:02:48,<john>,British security is a little
>>> more
>>>>>>>> rigorous..."
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You should probably remove the "empty
comment" lines.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> David.
>>>>>>>> 
>>>>>>>>> 2016-07-01 15:34:30 <John Doe> Lame.
We were in a
>>>> starbucks2016-07-01
>>>>>>>>> 15:35:02 <Jane Doe> Hmm that's
interesting2016-07-01 15:35:09
>>> <Jane
>>>>>>>>> Doe> You must want coffees2016-07-01
15:35:25 <John Doe>
>>> There was
>>>>>>>>> lots of Starbucks in my day2016-07-01
15:35:47
>>>>>>>>> 
>>>>>>>>> It was interesting, too, when I pasted the
text into the
>>> email, it
>>>>>>>>> self-formatted into the way I wanted it to
look. I had to
>>> manually
>>>>>>>>> make it look like it does above, since
that's the way that it
>>>> looks in
>>>>>>>>> the txt file. I wonder if it's being
organized by XML or
>>> something.
>>>>>>>>> 
>>>>>>>>> Anyways, There's always a space between
the two sideways
>>> carrots,
>>>> just
>>>>>>>>> like there is right now: <John Doe>
See. Space. And there's
>>> always
>>>> a
>>>>>>>>> space between the data and time. Like this.
2016-07-01
>>> 15:34:30
>>>> See.
>>>>>>>>> Space. But there's never a space
between the end of the
>>> comment and
>>>>>>>>> the next date. Like this: We were in a
starbucks2016-07-01
>>> 15:35:02
>>>>>>>>> See. starbucks and 2016 are smooshed
together.
>>>>>>>>> 
>>>>>>>>> This code is also on the table right now
too.
>>>>>>>>> 
>>>>>>>>> a <- read.table("E:/working
>>>>>>>>>
directory/-189/hangouts-conversation2.txt", quote="\"",
>>>>>>>>> comment.char="", fill=TRUE)
>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>> 
>>>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>>>>>>>>> 
>>>>>>>>>
aa<-gsub("[^[:digit:]]","",h)
>>>>>>>>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
>>>>>>>>> 
>>>>>>>>> Those last lines are a work in progress. I
wish I could
>>> import a
>>>>>>>>> picture of what it looks like when it's
translated into a
>>> data
>>>> frame.
>>>>>>>>> The fill=TRUE helped to get the data in
table that kind of
>>> sort of
>>>>>>>>> works, but the comments keep bleeding into
the data and time
>>>> column.
>>>>>>>>> It's like
>>>>>>>>> 
>>>>>>>>> 2016-07-01 15:59:17 <Jane Doe>
Seriously I've never been
>>>>>>>>> over               there
>>>>>>>>> 2016-07-01 15:59:27 <Jane Doe> It
confuses me :(
>>>>>>>>> 
>>>>>>>>> And then, maybe, the "seriously"
will be in a column all to
>>>> itself, as
>>>>>>>>> will be the "I've'"and
the "never" etc.
>>>>>>>>> 
>>>>>>>>> I will use a regular expression if I have
to, but it would be
>>> nice
>>>> to
>>>>>>>>> keep the dates and times on there.
Originally, I thought they
>>> were
>>>>>>>>> meaningless, but I've since changed my
mind on that count.
>>> The
>>>> time of
>>>>>>>>> day isn't so important. But, especially
since, say, Gmail
>>> itself
>>>> knows
>>>>>>>>> how to quickly recognize what it is, I know
it can be done. I
>>> know
>>>>>>>>> this data has structure to it.
>>>>>>>>> 
>>>>>>>>> Michael
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David
Winsemius <
>>>>>> dwinsemius at comcast.net> wrote:
>>>>>>>>>> On 5/15/19 4:07 PM, Michael Boulineau
wrote:
>>>>>>>>>>> I have a wild and crazy text file,
the head of which looks
>>> like
>>>> this:
>>>>>>>>>>> 
>>>>>>>>>>> 2016-07-01 02:50:35 <john>
hey
>>>>>>>>>>> 2016-07-01 02:51:26 <jane>
waiting for plane to Edinburgh
>>>>>>>>>>> 2016-07-01 02:51:45 <john>
thinking about my boo
>>>>>>>>>>> 2016-07-01 02:52:07 <jane>
nothing crappy has happened, not
>>>> really
>>>>>>>>>>> 2016-07-01 02:52:20 <john>
plane went by pretty fast,
>>> didn't
>>>> sleep
>>>>>>>>>>> 2016-07-01 02:54:08 <jane> no
idea what time it is or where
>>> I am
>>>>>> really
>>>>>>>>>>> 2016-07-01 02:54:17 <john>
just know it's london
>>>>>>>>>>> 2016-07-01 02:56:44 <jane>
you are probably asleep
>>>>>>>>>>> 2016-07-01 02:58:45 <jane> I
hope fish was fishy in a good
>>> eay
>>>>>>>>>>> 2016-07-01 02:58:56 <jone>
>>>>>>>>>>> 2016-07-01 02:59:34 <jane>
>>>>>>>>>>> 2016-07-01 03:02:48 <john>
British security is a little
>>> more
>>>>>> rigorous...
>>>>>>>>>> Looks entirely not-"crazy".
Typical log file format.
>>>>>>>>>> 
>>>>>>>>>> Two possibilities: 1) Use `read.fwf`
from pkg foreign; 2)
>>> Use
>>>> regex
>>>>>>>>>> (i.e. the sub-function) to strip
everything up to the "<".
>>> Read
>>>>>>>>>> `?regex`. Since that's not a
metacharacters you could use a
>>>> pattern
>>>>>>>>>> ".+<" and replace with
"".
>>>>>>>>>> 
>>>>>>>>>> And do read the Posting Guide.
Cross-posting to
>>> StackOverflow and
>>>>>> Rhelp,
>>>>>>>>>> at least within hours of each, is
considered poor manners.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> 
>>>>>>>>>> David.
>>>>>>>>>> 
>>>>>>>>>>> It goes on for a while. It's a
big file. But I feel like
>>> it's
>>>> going
>>>>>> to
>>>>>>>>>>> be difficult to annotate with the
coreNLP library or
>>> package. I'm
>>>>>>>>>>> doing natural language processing.
In other words, I'm
>>> curious
>>>> as to
>>>>>>>>>>> how I would shave off the dates,
that is, to make it look
>>> like:
>>>>>>>>>>> 
>>>>>>>>>>> <john> hey
>>>>>>>>>>> <jane> waiting for plane to
Edinburgh
>>>>>>>>>>>    <john> thinking about my
boo
>>>>>>>>>>> <jane> nothing crappy has
happened, not really
>>>>>>>>>>> <john> plane went by pretty
fast, didn't sleep
>>>>>>>>>>> <jane> no idea what time it
is or where I am really
>>>>>>>>>>> <john> just know it's
london
>>>>>>>>>>> <jane> you are probably
asleep
>>>>>>>>>>> <jane> I hope fish was fishy
in a good eay
>>>>>>>>>>>    <jone>
>>>>>>>>>>> <jane>
>>>>>>>>>>> <john> British security is a
little more rigorous...
>>>>>>>>>>> 
>>>>>>>>>>> To be clear, then, I'm trying
to clean a large text file by
>>>> writing a
>>>>>>>>>>> regular expression? such that I
create a new object with no
>>>> numbers
>>>>>> or
>>>>>>>>>>> dates.
>>>>>>>>>>> 
>>>>>>>>>>> Michael
>>>>>>>>>>> 
>>>>>>>>>>>
______________________________________________
>>>>>>>>>>> R-help at r-project.org mailing
list -- To UNSUBSCRIBE and
>>> more,
>>>> see
>>>>>>>>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>>>> and provide commented, minimal,
self-contained,
>>> reproducible
>>>> code.
>>>>>>>>>
______________________________________________
>>>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
>>> see
>>>>>>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>>> and provide commented, minimal,
self-contained, reproducible
>>> code.
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
>>> see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained,
reproducible
>>> code.
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
>>> see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained,
reproducible
>>> code.
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained,
reproducible
>>> code.
>>>>>> 
>>>>> 
>>>>>        [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>>> 
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>> 
>>> 
>>>      [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> 
>> --
>> Sent from my phone. Please excuse my brevity.
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Michael Boulineau

2019-May-18 22:32 UTC

head link

[R] how to separate string from numbers in a large txt file

Going back and thinking through what Boris and William were saying
(also Ivan), I tried this:

a <- readLines ("hangouts-conversation-6.csv.txt")
b <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+)"
c <- gsub(b, "\\1<\\2> ", a)> head (c)[1] "???2016-01-27 09:14:40 *** Jane Doe started a video chat"
[2] "2016-01-27 09:15:20 <Jane Doe>
https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf"
[3] "2016-01-27 09:15:20 <Jane Doe> Hey "
[4] "2016-01-27 09:15:22 <John Doe>  ended a video chat"
[5] "2016-01-27 21:07:11 <Jane Doe>  started a video chat"
[6] "2016-01-27 21:26:57 <John Doe>  ended a video chat"

The ??? is still there, since I forgot to do what Ivan had suggested, namely,

a <- readLines(con <- file("hangouts-conversation-6.csv.txt",
encoding
= "UTF-8")); close(con); rm(con)

But then the new code is still turning out only NAs when I apply
strcapture (). This was what happened next:
> d <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}+ [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>)
*(.*$)",
+                 c, proto=data.frame(stringsAsFactors=FALSE, When="",
Who="",
+                                    
What=""))> head (d)  When  Who What
1 <NA> <NA> <NA>
2 <NA> <NA> <NA>
3 <NA> <NA> <NA>
4 <NA> <NA> <NA>
5 <NA> <NA> <NA>
6 <NA> <NA> <NA>

I've been reading up on regular expressions, too, so this code seems
spot on. What's going wrong?

Michael

On Fri, May 17, 2019 at 4:28 PM Boris Steipe <boris.steipe at utoronto.ca>
wrote:>
> Don't start putting in extra commas and then reading this as csv. That
approach is broken. The correct approach is what Bill outlined: read everything
with readLines(), and then use a proper regular expression with strcapture().
>
> You need to pre-process the object that readLines() gives you: replace the
contents of the videochat lines, and make it conform to the format of the other
lines before you process it into your data frame.
>
> Approximately something like
>
> # read the raw data
> tmp <- readLines("hangouts-conversation-6.csv.txt")
>
> # process all video chat lines
> patt <- "^([0-9-]{10} [0-9:]{8} )[*]{3} (\\w+ \\w+) "  # (year
time )*** (word word)
> tmp <- gsub(patt, "\\1<\\2> ", tmp)
>
> # next, use strcapture()
>
> Note that this makes the assumption that your names are always exactly two
words containing only letters. If that assumption is not true, more though needs
to go into the regex. But you can test that:
>
> patt <- " <\\w+ \\w+> "   #" <word word>
"
> sum( ! grepl(patt, tmp)))
>
> ... will give the number of lines that remain in your file that do not have
a tag that can be interpreted as "Who"
>
> Once that is fine, use Bill's approach - or a regular expression of
your own design - to create your data frame.
>
> Hope this helps,
> Boris
>
>
>
>
> > On 2019-05-17, at 16:18, Michael Boulineau <michael.p.boulineau at
gmail.com> wrote:
> >
> > Very interesting. I'm sure I'll be trying to get rid of the
byte order
> > mark eventually. But right now, I'm more worried about getting the
> > character vector into either a csv file or data.frame; that way, I can
> > be able to work with the data neatly tabulated into four columns:
> > date, time, person, comment. I assume it's a write.csv function,
but I
> > don't know what arguments to put in it. header=FALSE? fill=T?
> >
> > Micheal
> >
> > On Fri, May 17, 2019 at 1:03 PM Jeff Newmiller <jdnewmil at
dcn.davis.ca.us> wrote:
> >>
> >> If byte order mark is the issue then you can specify the file
encoding as "UTF-8-BOM" and it won't show up in your data any
more.
> >>
> >> On May 17, 2019 12:12:17 PM PDT, William Dunlap via R-help
<r-help at r-project.org> wrote:
> >>> The pattern I gave worked for the lines that you originally
showed from
> >>> the
> >>> data file ('a'), before you put commas into them.  If
the name is
> >>> either of
> >>> the form "<name>" or "***" then the
"(<[^>]*>)" needs to be changed so
> >>> something like "(<[^>]*>|[*]{3})".
> >>>
> >>> The " ???" at the start of the imported data may
come from the byte
> >>> order
> >>> mark that Windows apps like to put at the front of a text file
in UTF-8
> >>> or
> >>> UTF-16 format.
> >>>
> >>> Bill Dunlap
> >>> TIBCO Software
> >>> wdunlap tibco.com
> >>>
> >>>
> >>> On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
> >>> michael.p.boulineau at gmail.com> wrote:
> >>>
> >>>> This seemed to work:
> >>>>
> >>>>> a <- readLines
("hangouts-conversation-6.csv.txt")
> >>>>> b <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "\\1,\\2,\\3,\\4", a)
> >>>>> b [1:84]
> >>>>
> >>>> And the first 85 lines looks like this:
> >>>>
> >>>> [83] "2016-06-28 21:02:28 *** Jane Doe started a
video chat"
> >>>> [84] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> >>>>
> >>>> Then they transition to the commas:
> >>>>
> >>>>> b [84:100]
> >>>> [1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> >>>> [2] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>> [3] "2016-07-01,02:51:26,<John Doe>,waiting for
plane to Edinburgh"
> >>>> [4] "2016-07-01,02:51:45,<John Doe>,thinking
about my boo"
> >>>>
> >>>> Even the strange bit on line 6347 was caught by this:
> >>>>
> >>>>> b [6346:6348]
> >>>> [1] "2016-10-21,10:56:29,<John
Doe>,John_Doe"
> >>>> [2] "2016-10-21,10:56:37,<John
Doe>,Admit#8242"
> >>>> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you
have a discussion"
> >>>>
> >>>> Perhaps most awesomely, the code catches spaces that are
interposed
> >>>> into the comment itself:
> >>>>
> >>>>> b [4]
> >>>> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
> >>>>> b [85]
> >>>> [1] "2016-07-01,02:50:35,<John Doe>,hey"
> >>>>
> >>>> Notice whether there is a space after the "hey"
or not.
> >>>>
> >>>> These are the first two lines:
> >>>>
> >>>> [1] "???2016-01-27 09:14:40 *** Jane Doe started a
video chat"
> >>>> [2] "2016-01-27,09:15:20,<Jane
> >>>> Doe>,
> >>>>
> >>>
https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> >>>> "
> >>>>
> >>>> So, who knows what happened with the ??? at the beginning
of [1]
> >>>> directly above. But notice how there are no commas in [1]
but there
> >>>> appear in [2]. I don't see why really long ones like
[2] directly
> >>>> above would be a problem, were they to be translated into
a csv or
> >>>> data frame column.
> >>>>
> >>>> Now, with the commas in there, couldn't we write this
into a csv or a
> >>>> data.frame? Some of this data will end up being garbage, I
imagine.
> >>>> Like in [2] directly above. Or with [83] and [84] at the
top of this
> >>>> discussion post/email. Embarrassingly, I've been
trying to convert
> >>>> this into a data.frame or csv but I can't manage to.
I've been using
> >>>> the write.csv function, but I don't think I've
been getting the
> >>>> arguments correct.
> >>>>
> >>>> At the end of the day, I would like a data.frame and/or
csv with the
> >>>> following four columns: date, time, person, comment.
> >>>>
> >>>> I tried this, too:
> >>>>
> >>>>> c <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
> >>>> +                 a,
proto=data.frame(stringsAsFactors=FALSE,
> >>> When="",
> >>>> Who="",
> >>>> +                                     What=""))
> >>>>
> >>>> But all I got was this:
> >>>>
> >>>>> c [1:100, ]
> >>>>    When  Who What
> >>>> 1   <NA> <NA> <NA>
> >>>> 2   <NA> <NA> <NA>
> >>>> 3   <NA> <NA> <NA>
> >>>> 4   <NA> <NA> <NA>
> >>>> 5   <NA> <NA> <NA>
> >>>> 6   <NA> <NA> <NA>
> >>>>
> >>>> It seems to have caught nothing.
> >>>>
> >>>>> unique (c)
> >>>>  When  Who What
> >>>> 1 <NA> <NA> <NA>
> >>>>
> >>>> But I like that it converted into columns. That's a
really great
> >>>> format. With a little tweaking, it'd be a great code
for this data
> >>>> set.
> >>>>
> >>>> Michael
> >>>>
> >>>> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> >>>> <r-help at r-project.org> wrote:
> >>>>>
> >>>>> Consider using readLines() and strcapture() for
reading such a
> >>> file.
> >>>> E.g.,
> >>>>> suppose readLines(files) produced a character vector
like
> >>>>>
> >>>>> x <- c("2016-10-21 10:35:36 <Jane Doe>
What's your login",
> >>>>>          "2016-10-21 10:56:29 <John Doe>
John_Doe",
> >>>>>          "2016-10-21 10:56:37 <John Doe>
Admit#8242",
> >>>>>          "October 23, 1819 12:34 <Jane
Eyre> I am not an angel")
> >>>>>
> >>>>> Then you can make a data.frame with columns When, Who,
and What by
> >>>>> supplying a pattern containing three parenthesized
capture
> >>> expressions:
> >>>>>> z <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> >>>>> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2})
+(<[^>]*>) *(.*$)",
> >>>>>             x,
proto=data.frame(stringsAsFactors=FALSE, When="",
> >>> Who="",
> >>>>> What=""))
> >>>>>> str(z)
> >>>>> 'data.frame':   4 obs. of  3 variables:
> >>>>> $ When: chr  "2016-10-21 10:35:36"
"2016-10-21 10:56:29"
> >>> "2016-10-21
> >>>>> 10:56:37" NA
> >>>>> $ Who : chr  "<Jane Doe>"
"<John Doe>" "<John Doe>" NA
> >>>>> $ What: chr  "What's your login"
"John_Doe" "Admit#8242" NA
> >>>>>
> >>>>> Lines that don't match the pattern result in
NA's - you might make
> >>> a
> >>>> second
> >>>>> pass over the corresponding elements of x with a new
pattern.
> >>>>>
> >>>>> You can convert the When column from character to time
with
> >>> as.POSIXct().
> >>>>>
> >>>>> Bill Dunlap
> >>>>> TIBCO Software
> >>>>> wdunlap tibco.com
> >>>>>
> >>>>>
> >>>>> On Thu, May 16, 2019 at 8:30 PM David Winsemius
> >>> <dwinsemius at comcast.net>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> >>>>>>> OK. So, I named the object test and then
checked the 6347th
> >>> item
> >>>>>>>
> >>>>>>>> test <- readLines
("hangouts-conversation.txt)
> >>>>>>>> test [6347]
> >>>>>>> [1] "2016-10-21 10:56:37 <John Doe>
Admit#8242"
> >>>>>>>
> >>>>>>> Perhaps where it was getting screwed up is,
since the end of
> >>> this is
> >>>> a
> >>>>>>> number (8242), then, given that there's no
space between the
> >>> number
> >>>>>>> and what ought to be the next row, R
didn't know where to draw
> >>> the
> >>>>>>> line. Sure enough, it looks like this when I
go to the original
> >>> file
> >>>>>>> and control f "#8242"
> >>>>>>>
> >>>>>>> 2016-10-21 10:35:36 <Jane Doe>
What's your login
> >>>>>>> 2016-10-21 10:56:29 <John Doe> John_Doe
> >>>>>>> 2016-10-21 10:56:37 <John Doe>
Admit#8242
> >>>>>>
> >>>>>>
> >>>>>> An octothorpe is an end of line signifier and is
interpreted as
> >>>> allowing
> >>>>>> comments. You can prevent that interpretation with
suitable
> >>> choice of
> >>>>>> parameters to `read.table` or `read.csv`. I
don't understand why
> >>> that
> >>>>>> should cause anu error or a failure to match that
pattern.
> >>>>>>
> >>>>>>> 2016-10-21 11:00:13 <Jane Doe> Okay so
you have a discussion
> >>>>>>>
> >>>>>>> Again, it doesn't look like that in the
file. Gmail
> >>> automatically
> >>>>>>> formats it like that when I paste it in. More
to the point, it
> >>> looks
> >>>>>>> like
> >>>>>>>
> >>>>>>> 2016-10-21 10:35:36 <Jane Doe>
What's your login2016-10-21
> >>> 10:56:29
> >>>>>>> <John Doe> John_Doe2016-10-21 10:56:37
<John Doe>
> >>>> Admit#82422016-10-21
> >>>>>>> 11:00:13 <Jane Doe> Okay so you have a
discussion
> >>>>>>>
> >>>>>>> Notice Admit#82422016. So there's that.
> >>>>>>>
> >>>>>>> Then I built object test2.
> >>>>>>>
> >>>>>>> test2 <- sub("^(.{10}) (.{8})
(<.+>) (.+$)", "//1,//2,//3,//4",
> >>> test)
> >>>>>>>
> >>>>>>> This worked for 84 lines, then this happened.
> >>>>>>
> >>>>>> It may have done something but as you later
discovered my first
> >>> code
> >>>> for
> >>>>>> the pattern was incorrect. I had tested it (and
pasted in the
> >>> results
> >>>> of
> >>>>>> the test) . The way to refer to a capture class is
with
> >>> back-slashes
> >>>>>> before the numbers, not forward-slashes. Try this:
> >>>>>>
> >>>>>>
> >>>>>>> newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)",
> >>> "\\1,\\2,\\3,\\4",
> >>>> chrvec)
> >>>>>>> newvec
> >>>>>>  [1]
"2016-07-01,02:50:35,<john>,hey"
> >>>>>>  [2]
"2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh"
> >>>>>>  [3]
"2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>  [4]
"2016-07-01,02:52:07,<jane>,nothing crappy has happened,
> >>> not
> >>>> really"
> >>>>>>  [5] "2016-07-01,02:52:20,<john>,plane
went by pretty fast,
> >>> didn't
> >>>> sleep"
> >>>>>>  [6] "2016-07-01,02:54:08,<jane>,no
idea what time it is or
> >>> where I am
> >>>>>> really"
> >>>>>>  [7] "2016-07-01,02:54:17,<john>,just
know it's london"
> >>>>>>  [8] "2016-07-01,02:56:44,<jane>,you
are probably asleep"
> >>>>>>  [9] "2016-07-01,02:58:45,<jane>,I hope
fish was fishy in a good
> >>> eay"
> >>>>>> [10] "2016-07-01 02:58:56 <jone>"
> >>>>>> [11] "2016-07-01 02:59:34 <jane>"
> >>>>>> [12]
"2016-07-01,03:02:48,<john>,British security is a little
> >>> more
> >>>>>> rigorous..."
> >>>>>>
> >>>>>>
> >>>>>> I made note of the fact that the 10th and 11th
lines had no
> >>> commas.
> >>>>>>
> >>>>>>>
> >>>>>>>> test2 [84]
> >>>>>>> [1] "2016-06-28 21:12:43 *** John Doe
ended a video chat"
> >>>>>>
> >>>>>> That line didn't have any "<" so
wasn't matched.
> >>>>>>
> >>>>>>
> >>>>>> You could remove all none matching lines for
pattern of
> >>>>>>
> >>>>>>
dates<space>times<space>"<"<name>">"<space><anything>
> >>>>>>
> >>>>>>
> >>>>>> with:
> >>>>>>
> >>>>>>
> >>>>>> chrvec <- chrvec[ grepl("^.{10} .{8}
<.+> .+$)", chrvec)]
> >>>>>>
> >>>>>>
> >>>>>> Do read:
> >>>>>>
> >>>>>> ?read.csv
> >>>>>>
> >>>>>> ?regex
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>>
> >>>>>> David
> >>>>>>
> >>>>>>
> >>>>>>>> test2 [85]
> >>>>>>> [1] "//1,//2,//3,//4"
> >>>>>>>> test [85]
> >>>>>>> [1] "2016-07-01 02:50:35 <John Doe>
hey"
> >>>>>>>
> >>>>>>> Notice how I toggled back and forth between
test and test2
> >>> there. So,
> >>>>>>> whatever happened with the regex, it happened
in the switch
> >>> from 84
> >>>> to
> >>>>>>> 85, I guess. It went on like
> >>>>>>>
> >>>>>>> [990] "//1,//2,//3,//4"
> >>>>>>>  [991] "//1,//2,//3,//4"
> >>>>>>>  [992] "//1,//2,//3,//4"
> >>>>>>>  [993] "//1,//2,//3,//4"
> >>>>>>>  [994] "//1,//2,//3,//4"
> >>>>>>>  [995] "//1,//2,//3,//4"
> >>>>>>>  [996] "//1,//2,//3,//4"
> >>>>>>>  [997] "//1,//2,//3,//4"
> >>>>>>>  [998] "//1,//2,//3,//4"
> >>>>>>>  [999] "//1,//2,//3,//4"
> >>>>>>> [1000] "//1,//2,//3,//4"
> >>>>>>>
> >>>>>>> up until line 1000, then I reached max.print.
> >>>>>>
> >>>>>>> Michael
> >>>>>>>
> >>>>>>> On Thu, May 16, 2019 at 1:05 PM David
Winsemius <
> >>>> dwinsemius at comcast.net>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> On 5/16/19 12:30 PM, Michael Boulineau
wrote:
> >>>>>>>>> Thanks for this tip on etiquette,
David. I will be sure and
> >>> not do
> >>>>>> that again.
> >>>>>>>>>
> >>>>>>>>> I tried the read.fwf from the foreign
package, with a code
> >>> like
> >>>> this:
> >>>>>>>>>
> >>>>>>>>>   d <-
read.fwf("hangouts-conversation.txt",
> >>>>>>>>>                  widths=
c(10,10,20,40),
> >>>>>>>>>
> >>>
col.names=c("date","time","person","comment"),
> >>>>>>>>>                  strip.white=TRUE)
> >>>>>>>>>
> >>>>>>>>> But it threw this error:
> >>>>>>>>>
> >>>>>>>>> Error in scan(file = file, what =
what, sep = sep, quote > >>> quote,
> >>>> dec
> >>>>>> = dec,  :
> >>>>>>>>>    line 6347 did not have 4 elements
> >>>>>>>>
> >>>>>>>> So what does line 6347 look like? (Use
`readLines` and print
> >>> it
> >>>> out.)
> >>>>>>>>
> >>>>>>>>> Interestingly, though, the error only
happened when I
> >>> increased the
> >>>>>>>>> width size. But I had to increase the
size, or else I
> >>> couldn't
> >>>> "see"
> >>>>>>>>> anything.  The comment was so small
that nothing was being
> >>>> captured by
> >>>>>>>>> the size of the column. so to speak.
> >>>>>>>>>
> >>>>>>>>> It seems like what's throwing me
is that there's no comma
> >>> that
> >>>>>>>>> demarcates the end of the text proper.
For example:
> >>>>>>>> Not sure why you thought there should be a
comma. Lines
> >>> usually end
> >>>>>>>> with  <cr> and or a <lf>.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Once you have the raw text in a character
vector from
> >>> `readLines`
> >>>> named,
> >>>>>>>> say, 'chrvec', then you could
selectively substitute commas
> >>> for
> >>>> spaces
> >>>>>>>> with regex. (Now that you no longer desire
to remove the dates
> >>> and
> >>>>>> times.)
> >>>>>>>>
> >>>>>>>> sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4", chrvec)
> >>>>>>>>
> >>>>>>>> This will not do any replacements when the
pattern is not
> >>> matched.
> >>>> See
> >>>>>>>> this test:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)",
> >>> "\\1,\\2,\\3,\\4",
> >>>>>> chrvec)
> >>>>>>>>> newvec
> >>>>>>>>   [1]
"2016-07-01,02:50:35,<john>,hey"
> >>>>>>>>   [2]
"2016-07-01,02:51:26,<jane>,waiting for plane to
> >>> Edinburgh"
> >>>>>>>>   [3]
"2016-07-01,02:51:45,<john>,thinking about my boo"
> >>>>>>>>   [4]
"2016-07-01,02:52:07,<jane>,nothing crappy has
> >>> happened, not
> >>>>>> really"
> >>>>>>>>   [5]
"2016-07-01,02:52:20,<john>,plane went by pretty fast,
> >>> didn't
> >>>>>> sleep"
> >>>>>>>>   [6]
"2016-07-01,02:54:08,<jane>,no idea what time it is or
> >>> where
> >>>> I am
> >>>>>>>> really"
> >>>>>>>>   [7]
"2016-07-01,02:54:17,<john>,just know it's london"
> >>>>>>>>   [8]
"2016-07-01,02:56:44,<jane>,you are probably asleep"
> >>>>>>>>   [9]
"2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
> >>> good
> >>>> eay"
> >>>>>>>> [10] "2016-07-01 02:58:56
<jone>"
> >>>>>>>> [11] "2016-07-01 02:59:34
<jane>"
> >>>>>>>> [12]
"2016-07-01,03:02:48,<john>,British security is a little
> >>> more
> >>>>>>>> rigorous..."
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> You should probably remove the "empty
comment" lines.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>>
> >>>>>>>> David.
> >>>>>>>>
> >>>>>>>>> 2016-07-01 15:34:30 <John Doe>
Lame. We were in a
> >>>> starbucks2016-07-01
> >>>>>>>>> 15:35:02 <Jane Doe> Hmm
that's interesting2016-07-01 15:35:09
> >>> <Jane
> >>>>>>>>> Doe> You must want
coffees2016-07-01 15:35:25 <John Doe>
> >>> There was
> >>>>>>>>> lots of Starbucks in my day2016-07-01
15:35:47
> >>>>>>>>>
> >>>>>>>>> It was interesting, too, when I pasted
the text into the
> >>> email, it
> >>>>>>>>> self-formatted into the way I wanted
it to look. I had to
> >>> manually
> >>>>>>>>> make it look like it does above, since
that's the way that it
> >>>> looks in
> >>>>>>>>> the txt file. I wonder if it's
being organized by XML or
> >>> something.
> >>>>>>>>>
> >>>>>>>>> Anyways, There's always a space
between the two sideways
> >>> carrots,
> >>>> just
> >>>>>>>>> like there is right now: <John
Doe> See. Space. And there's
> >>> always
> >>>> a
> >>>>>>>>> space between the data and time. Like
this. 2016-07-01
> >>> 15:34:30
> >>>> See.
> >>>>>>>>> Space. But there's never a space
between the end of the
> >>> comment and
> >>>>>>>>> the next date. Like this: We were in a
starbucks2016-07-01
> >>> 15:35:02
> >>>>>>>>> See. starbucks and 2016 are smooshed
together.
> >>>>>>>>>
> >>>>>>>>> This code is also on the table right
now too.
> >>>>>>>>>
> >>>>>>>>> a <- read.table("E:/working
> >>>>>>>>>
directory/-189/hangouts-conversation2.txt", quote="\"",
> >>>>>>>>> comment.char="", fill=TRUE)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>
> >>>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>>>>>>>
> >>>>>>>>>
aa<-gsub("[^[:digit:]]","",h)
> >>>>>>>>> my.data.num <-
as.numeric(str_extract(h, "[0-9]+"))
> >>>>>>>>>
> >>>>>>>>> Those last lines are a work in
progress. I wish I could
> >>> import a
> >>>>>>>>> picture of what it looks like when
it's translated into a
> >>> data
> >>>> frame.
> >>>>>>>>> The fill=TRUE helped to get the data
in table that kind of
> >>> sort of
> >>>>>>>>> works, but the comments keep bleeding
into the data and time
> >>>> column.
> >>>>>>>>> It's like
> >>>>>>>>>
> >>>>>>>>> 2016-07-01 15:59:17 <Jane Doe>
Seriously I've never been
> >>>>>>>>> over               there
> >>>>>>>>> 2016-07-01 15:59:27 <Jane Doe>
It confuses me :(
> >>>>>>>>>
> >>>>>>>>> And then, maybe, the
"seriously" will be in a column all to
> >>>> itself, as
> >>>>>>>>> will be the
"I've'"and the "never" etc.
> >>>>>>>>>
> >>>>>>>>> I will use a regular expression if I
have to, but it would be
> >>> nice
> >>>> to
> >>>>>>>>> keep the dates and times on there.
Originally, I thought they
> >>> were
> >>>>>>>>> meaningless, but I've since
changed my mind on that count.
> >>> The
> >>>> time of
> >>>>>>>>> day isn't so important. But,
especially since, say, Gmail
> >>> itself
> >>>> knows
> >>>>>>>>> how to quickly recognize what it is, I
know it can be done. I
> >>> know
> >>>>>>>>> this data has structure to it.
> >>>>>>>>>
> >>>>>>>>> Michael
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, May 15, 2019 at 8:47 PM David
Winsemius <
> >>>>>> dwinsemius at comcast.net> wrote:
> >>>>>>>>>> On 5/15/19 4:07 PM, Michael
Boulineau wrote:
> >>>>>>>>>>> I have a wild and crazy text
file, the head of which looks
> >>> like
> >>>> this:
> >>>>>>>>>>>
> >>>>>>>>>>> 2016-07-01 02:50:35
<john> hey
> >>>>>>>>>>> 2016-07-01 02:51:26
<jane> waiting for plane to Edinburgh
> >>>>>>>>>>> 2016-07-01 02:51:45
<john> thinking about my boo
> >>>>>>>>>>> 2016-07-01 02:52:07
<jane> nothing crappy has happened, not
> >>>> really
> >>>>>>>>>>> 2016-07-01 02:52:20
<john> plane went by pretty fast,
> >>> didn't
> >>>> sleep
> >>>>>>>>>>> 2016-07-01 02:54:08
<jane> no idea what time it is or where
> >>> I am
> >>>>>> really
> >>>>>>>>>>> 2016-07-01 02:54:17
<john> just know it's london
> >>>>>>>>>>> 2016-07-01 02:56:44
<jane> you are probably asleep
> >>>>>>>>>>> 2016-07-01 02:58:45
<jane> I hope fish was fishy in a good
> >>> eay
> >>>>>>>>>>> 2016-07-01 02:58:56
<jone>
> >>>>>>>>>>> 2016-07-01 02:59:34
<jane>
> >>>>>>>>>>> 2016-07-01 03:02:48
<john> British security is a little
> >>> more
> >>>>>> rigorous...
> >>>>>>>>>> Looks entirely
not-"crazy". Typical log file format.
> >>>>>>>>>>
> >>>>>>>>>> Two possibilities: 1) Use
`read.fwf` from pkg foreign; 2)
> >>> Use
> >>>> regex
> >>>>>>>>>> (i.e. the sub-function) to strip
everything up to the "<".
> >>> Read
> >>>>>>>>>> `?regex`. Since that's not a
metacharacters you could use a
> >>>> pattern
> >>>>>>>>>> ".+<" and replace
with "".
> >>>>>>>>>>
> >>>>>>>>>> And do read the Posting Guide.
Cross-posting to
> >>> StackOverflow and
> >>>>>> Rhelp,
> >>>>>>>>>> at least within hours of each, is
considered poor manners.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>>
> >>>>>>>>>> David.
> >>>>>>>>>>
> >>>>>>>>>>> It goes on for a while.
It's a big file. But I feel like
> >>> it's
> >>>> going
> >>>>>> to
> >>>>>>>>>>> be difficult to annotate with
the coreNLP library or
> >>> package. I'm
> >>>>>>>>>>> doing natural language
processing. In other words, I'm
> >>> curious
> >>>> as to
> >>>>>>>>>>> how I would shave off the
dates, that is, to make it look
> >>> like:
> >>>>>>>>>>>
> >>>>>>>>>>> <john> hey
> >>>>>>>>>>> <jane> waiting for plane
to Edinburgh
> >>>>>>>>>>>    <john> thinking about
my boo
> >>>>>>>>>>> <jane> nothing crappy
has happened, not really
> >>>>>>>>>>> <john> plane went by
pretty fast, didn't sleep
> >>>>>>>>>>> <jane> no idea what time
it is or where I am really
> >>>>>>>>>>> <john> just know
it's london
> >>>>>>>>>>> <jane> you are probably
asleep
> >>>>>>>>>>> <jane> I hope fish was
fishy in a good eay
> >>>>>>>>>>>    <jone>
> >>>>>>>>>>> <jane>
> >>>>>>>>>>> <john> British security
is a little more rigorous...
> >>>>>>>>>>>
> >>>>>>>>>>> To be clear, then, I'm
trying to clean a large text file by
> >>>> writing a
> >>>>>>>>>>> regular expression? such that
I create a new object with no
> >>>> numbers
> >>>>>> or
> >>>>>>>>>>> dates.
> >>>>>>>>>>>
> >>>>>>>>>>> Michael
> >>>>>>>>>>>
> >>>>>>>>>>>
______________________________________________
> >>>>>>>>>>> R-help at r-project.org
mailing list -- To UNSUBSCRIBE and
> >>> more,
> >>>> see
> >>>>>>>>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>>>> PLEASE do read the posting
guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>>>> and provide commented,
minimal, self-contained,
> >>> reproducible
> >>>> code.
> >>>>>>>>>
______________________________________________
> >>>>>>>>> R-help at r-project.org mailing list
-- To UNSUBSCRIBE and more,
> >>> see
> >>>>>>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>>> and provide commented, minimal,
self-contained, reproducible
> >>> code.
> >>>>>>>>
______________________________________________
> >>>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> >>> see
> >>>>>>>>
https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>> and provide commented, minimal,
self-contained, reproducible
> >>> code.
> >>>>>>> ______________________________________________
> >>>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> >>> see
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>> and provide commented, minimal,
self-contained, reproducible
> >>> code.
> >>>>>>
> >>>>>> ______________________________________________
> >>>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>> PLEASE do read the posting guide
> >>>>>> http://www.R-project.org/posting-guide.html
> >>>>>> and provide commented, minimal, self-contained,
reproducible
> >>> code.
> >>>>>>
> >>>>>
> >>>>>        [[alternative HTML version deleted]]
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained,
reproducible code.
> >>>>
> >>>
> >>>      [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >>
> >> --
> >> Sent from my phone. Please excuse my brevity.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

R help - May 2019 - how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file