thr3ads.net - R help - [R] how to separate string from numbers in a large txt file [May 2019]

If this information is useful, please help other people find it:
Share via:

William Dunlap

2019-May-17 15:20 UTC

[R] how to separate string from numbers in a large txt file

Consider using readLines() and strcapture() for reading such a file.  E.g.,
suppose readLines(files) produced a character vector like

x <- c("2016-10-21 10:35:36 <Jane Doe> What's your
login",
          "2016-10-21 10:56:29 <John Doe> John_Doe",
          "2016-10-21 10:56:37 <John Doe> Admit#8242",
          "October 23, 1819 12:34 <Jane Eyre> I am not an
angel")

Then you can make a data.frame with columns When, Who, and What by
supplying a pattern containing three parenthesized capture
expressions:> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
             x, proto=data.frame(stringsAsFactors=FALSE, When="",
Who="",
What=""))> str(z)'data.frame':   4 obs. of  3 variables:
 $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
"2016-10-21
10:56:37" NA
 $ Who : chr  "<Jane Doe>" "<John Doe>"
"<John Doe>" NA
 $ What: chr  "What's your login" "John_Doe"
"Admit#8242" NA

Lines that don't match the pattern result in NA's - you might make a
second
pass over the corresponding elements of x with a new pattern.

You can convert the When column from character to time with as.POSIXct().

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at
comcast.net>
wrote:
>
> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> > OK. So, I named the object test and then checked the 6347th item
> >
> >> test <- readLines ("hangouts-conversation.txt)
> >> test [6347]
> > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> >
> > Perhaps where it was getting screwed up is, since the end of this is a
> > number (8242), then, given that there's no space between the
number
> > and what ought to be the next row, R didn't know where to draw the
> > line. Sure enough, it looks like this when I go to the original file
> > and control f "#8242"
> >
> > 2016-10-21 10:35:36 <Jane Doe> What's your login
> > 2016-10-21 10:56:29 <John Doe> John_Doe
> > 2016-10-21 10:56:37 <John Doe> Admit#8242
>
>
> An octothorpe is an end of line signifier and is interpreted as allowing
> comments. You can prevent that interpretation with suitable choice of
> parameters to `read.table` or `read.csv`. I don't understand why that
> should cause anu error or a failure to match that pattern.
>
> > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
> >
> > Again, it doesn't look like that in the file. Gmail automatically
> > formats it like that when I paste it in. More to the point, it looks
> > like
> >
> > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
10:56:29
> > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
Admit#82422016-10-21
> > 11:00:13 <Jane Doe> Okay so you have a discussion
> >
> > Notice Admit#82422016. So there's that.
> >
> > Then I built object test2.
> >
> > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", test)
> >
> > This worked for 84 lines, then this happened.
>
> It may have done something but as you later discovered my first code for
> the pattern was incorrect. I had tested it (and pasted in the results of
> the test) . The way to refer to a capture class is with back-slashes
> before the numbers, not forward-slashes. Try this:
>
>
>  > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
>  > newvec
>   [1] "2016-07-01,02:50:35,<john>,hey"
>   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
>   [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
not really"
>   [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
didn't sleep"
>   [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
where I am
> really"
>   [7] "2016-07-01,02:54:17,<john>,just know it's
london"
>   [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>   [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
good eay"
> [10] "2016-07-01 02:58:56 <jone>"
> [11] "2016-07-01 02:59:34 <jane>"
> [12] "2016-07-01,03:02:48,<john>,British security is a little
more
> rigorous..."
>
>
> I made note of the fact that the 10th and 11th lines had no commas.
>
> >
> >> test2 [84]
> > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>
> That line didn't have any "<" so wasn't matched.
>
>
> You could remove all none matching lines for pattern of
>
>
dates<space>times<space>"<"<name>">"<space><anything>
>
>
> with:
>
>
> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)",
chrvec)]
>
>
> Do read:
>
> ?read.csv
>
> ?regex
>
>
> --
>
> David
>
>
> >> test2 [85]
> > [1] "//1,//2,//3,//4"
> >> test [85]
> > [1] "2016-07-01 02:50:35 <John Doe> hey"
> >
> > Notice how I toggled back and forth between test and test2 there. So,
> > whatever happened with the regex, it happened in the switch from 84 to
> > 85, I guess. It went on like
> >
> > [990] "//1,//2,//3,//4"
> >   [991] "//1,//2,//3,//4"
> >   [992] "//1,//2,//3,//4"
> >   [993] "//1,//2,//3,//4"
> >   [994] "//1,//2,//3,//4"
> >   [995] "//1,//2,//3,//4"
> >   [996] "//1,//2,//3,//4"
> >   [997] "//1,//2,//3,//4"
> >   [998] "//1,//2,//3,//4"
> >   [999] "//1,//2,//3,//4"
> > [1000] "//1,//2,//3,//4"
> >
> > up until line 1000, then I reached max.print.
>
> > Michael
> >
> > On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at
comcast.net>
> wrote:
> >>
> >> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >>> Thanks for this tip on etiquette, David. I will be sure and
not do
> that again.
> >>>
> >>> I tried the read.fwf from the foreign package, with a code
like this:
> >>>
> >>>    d <- read.fwf("hangouts-conversation.txt",
> >>>                   widths= c(10,10,20,40),
> >>>                  
col.names=c("date","time","person","comment"),
> >>>                   strip.white=TRUE)
> >>>
> >>> But it threw this error:
> >>>
> >>> Error in scan(file = file, what = what, sep = sep, quote =
quote, dec
> = dec,  :
> >>>     line 6347 did not have 4 elements
> >>
> >> So what does line 6347 look like? (Use `readLines` and print it
out.)
> >>
> >>> Interestingly, though, the error only happened when I
increased the
> >>> width size. But I had to increase the size, or else I
couldn't "see"
> >>> anything.  The comment was so small that nothing was being
captured by
> >>> the size of the column. so to speak.
> >>>
> >>> It seems like what's throwing me is that there's no
comma that
> >>> demarcates the end of the text proper. For example:
> >> Not sure why you thought there should be a comma. Lines usually
end
> >> with  <cr> and or a <lf>.
> >>
> >>
> >> Once you have the raw text in a character vector from `readLines`
named,
> >> say, 'chrvec', then you could selectively substitute
commas for spaces
> >> with regex. (Now that you no longer desire to remove the dates and
> times.)
> >>
> >> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
> >>
> >> This will not do any replacements when the pattern is not matched.
See
> >> this test:
> >>
> >>
> >>   > newvec <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "\\1,\\2,\\3,\\4",
> chrvec)
> >>   > newvec
> >>    [1] "2016-07-01,02:50:35,<john>,hey"
> >>    [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
> >>    [3] "2016-07-01,02:51:45,<john>,thinking about my
boo"
> >>    [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
happened, not
> really"
> >>    [5] "2016-07-01,02:52:20,<john>,plane went by pretty
fast, didn't
> sleep"
> >>    [6] "2016-07-01,02:54:08,<jane>,no idea what time it
is or where I am
> >> really"
> >>    [7] "2016-07-01,02:54:17,<john>,just know it's
london"
> >>    [8] "2016-07-01,02:56:44,<jane>,you are probably
asleep"
> >>    [9] "2016-07-01,02:58:45,<jane>,I hope fish was
fishy in a good eay"
> >> [10] "2016-07-01 02:58:56 <jone>"
> >> [11] "2016-07-01 02:59:34 <jane>"
> >> [12] "2016-07-01,03:02:48,<john>,British security is a
little more
> >> rigorous..."
> >>
> >>
> >> You should probably remove the "empty comment" lines.
> >>
> >>
> >> --
> >>
> >> David.
> >>
> >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
starbucks2016-07-01
> >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01
15:35:09 <Jane
> >>> Doe> You must want coffees2016-07-01 15:35:25 <John
Doe> There was
> >>> lots of Starbucks in my day2016-07-01 15:35:47
> >>>
> >>> It was interesting, too, when I pasted the text into the
email, it
> >>> self-formatted into the way I wanted it to look. I had to
manually
> >>> make it look like it does above, since that's the way that
it looks in
> >>> the txt file. I wonder if it's being organized by XML or
something.
> >>>
> >>> Anyways, There's always a space between the two sideways
carrots, just
> >>> like there is right now: <John Doe> See. Space. And
there's always a
> >>> space between the data and time. Like this. 2016-07-01
15:34:30 See.
> >>> Space. But there's never a space between the end of the
comment and
> >>> the next date. Like this: We were in a starbucks2016-07-01
15:35:02
> >>> See. starbucks and 2016 are smooshed together.
> >>>
> >>> This code is also on the table right now too.
> >>>
> >>> a <- read.table("E:/working
> >>> directory/-189/hangouts-conversation2.txt",
quote="\"",
> >>> comment.char="", fill=TRUE)
> >>>
> >>>
>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>
> >>> aa<-gsub("[^[:digit:]]","",h)
> >>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
> >>>
> >>> Those last lines are a work in progress. I wish I could import
a
> >>> picture of what it looks like when it's translated into a
data frame.
> >>> The fill=TRUE helped to get the data in table that kind of
sort of
> >>> works, but the comments keep bleeding into the data and time
column.
> >>> It's like
> >>>
> >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never
been
> >>> over               there
> >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >>>
> >>> And then, maybe, the "seriously" will be in a column
all to itself, as
> >>> will be the "I've'"and the "never"
etc.
> >>>
> >>> I will use a regular expression if I have to, but it would be
nice to
> >>> keep the dates and times on there. Originally, I thought they
were
> >>> meaningless, but I've since changed my mind on that count.
The time of
> >>> day isn't so important. But, especially since, say, Gmail
itself knows
> >>> how to quickly recognize what it is, I know it can be done. I
know
> >>> this data has structure to it.
> >>>
> >>> Michael
> >>>
> >>>
> >>>
> >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> dwinsemius at comcast.net> wrote:
> >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>>>> I have a wild and crazy text file, the head of which
looks like this:
> >>>>>
> >>>>> 2016-07-01 02:50:35 <john> hey
> >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to
Edinburgh
> >>>>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has
happened, not really
> >>>>> 2016-07-01 02:52:20 <john> plane went by pretty
fast, didn't sleep
> >>>>> 2016-07-01 02:54:08 <jane> no idea what time it
is or where I am
> really
> >>>>> 2016-07-01 02:54:17 <john> just know it's
london
> >>>>> 2016-07-01 02:56:44 <jane> you are probably
asleep
> >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy
in a good eay
> >>>>> 2016-07-01 02:58:56 <jone>
> >>>>> 2016-07-01 02:59:34 <jane>
> >>>>> 2016-07-01 03:02:48 <john> British security is a
little more
> rigorous...
> >>>> Looks entirely not-"crazy". Typical log file
format.
> >>>>
> >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
Use regex
> >>>> (i.e. the sub-function) to strip everything up to the
"<". Read
> >>>> `?regex`. Since that's not a metacharacters you could
use a pattern
> >>>> ".+<" and replace with "".
> >>>>
> >>>> And do read the Posting Guide. Cross-posting to
StackOverflow and
> Rhelp,
> >>>> at least within hours of each, is considered poor manners.
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> David.
> >>>>
> >>>>> It goes on for a while. It's a big file. But I
feel like it's going
> to
> >>>>> be difficult to annotate with the coreNLP library or
package. I'm
> >>>>> doing natural language processing. In other words,
I'm curious as to
> >>>>> how I would shave off the dates, that is, to make it
look like:
> >>>>>
> >>>>> <john> hey
> >>>>> <jane> waiting for plane to Edinburgh
> >>>>>     <john> thinking about my boo
> >>>>> <jane> nothing crappy has happened, not really
> >>>>> <john> plane went by pretty fast, didn't
sleep
> >>>>> <jane> no idea what time it is or where I am
really
> >>>>> <john> just know it's london
> >>>>> <jane> you are probably asleep
> >>>>> <jane> I hope fish was fishy in a good eay
> >>>>>     <jone>
> >>>>> <jane>
> >>>>> <john> British security is a little more
rigorous...
> >>>>>
> >>>>> To be clear, then, I'm trying to clean a large
text file by writing a
> >>>>> regular expression? such that I create a new object
with no numbers
> or
> >>>>> dates.
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Michael Boulineau

2019-May-17 18:36 UTC

head link

[R] how to separate string from numbers in a large txt file

This seemed to work:
> a <- readLines ("hangouts-conversation-6.csv.txt")
> b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", a)
> b [1:84]
And the first 85 lines looks like this:

[83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
[84] "2016-06-28 21:12:43 *** John Doe ended a video chat"

Then they transition to the commas:
> b [84:100] [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
 [2] "2016-07-01,02:50:35,<John Doe>,hey"
 [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to
Edinburgh"
 [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"

Even the strange bit on line 6347 was caught by this:
> b [6346:6348][1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
[2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
[3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a
discussion"

Perhaps most awesomely, the code catches spaces that are interposed
into the comment itself:
> b [4][1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
  > b [85]
[1] "2016-07-01,02:50:35,<John Doe>,hey"

Notice whether there is a space after the "hey" or not.

These are the first two lines:

[1] "???2016-01-27 09:14:40 *** Jane Doe started a video chat"
[2] "2016-01-27,09:15:20,<Jane
Doe>,https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf"

So, who knows what happened with the ??? at the beginning of [1]
directly above. But notice how there are no commas in [1] but there
appear in [2]. I don't see why really long ones like [2] directly
above would be a problem, were they to be translated into a csv or
data frame column.

Now, with the commas in there, couldn't we write this into a csv or a
data.frame? Some of this data will end up being garbage, I imagine.
Like in [2] directly above. Or with [83] and [84] at the top of this
discussion post/email. Embarrassingly, I've been trying to convert
this into a data.frame or csv but I can't manage to. I've been using
the write.csv function, but I don't think I've been getting the
arguments correct.

At the end of the day, I would like a data.frame and/or csv with the
following four columns: date, time, person, comment.

I tried this, too:
> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}+ [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>)
*(.*$)",
+                 a, proto=data.frame(stringsAsFactors=FALSE, When="",
Who="",
+                                     What=""))

But all I got was this:
> c [1:100, ]    When  Who What
1   <NA> <NA> <NA>
2   <NA> <NA> <NA>
3   <NA> <NA> <NA>
4   <NA> <NA> <NA>
5   <NA> <NA> <NA>
6   <NA> <NA> <NA>

It seems to have caught nothing.
> unique (c)  When  Who What
1 <NA> <NA> <NA>

But I like that it converted into columns. That's a really great
format. With a little tweaking, it'd be a great code for this data
set.

Michael

On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
<r-help at r-project.org> wrote:>
> Consider using readLines() and strcapture() for reading such a file.  E.g.,
> suppose readLines(files) produced a character vector like
>
> x <- c("2016-10-21 10:35:36 <Jane Doe> What's your
login",
>           "2016-10-21 10:56:29 <John Doe> John_Doe",
>           "2016-10-21 10:56:37 <John Doe> Admit#8242",
>           "October 23, 1819 12:34 <Jane Eyre> I am not an
angel")
>
> Then you can make a data.frame with columns When, Who, and What by
> supplying a pattern containing three parenthesized capture expressions:
> > z <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>)
*(.*$)",
>              x, proto=data.frame(stringsAsFactors=FALSE, When="",
Who="",
> What=""))
> > str(z)
> 'data.frame':   4 obs. of  3 variables:
>  $ When: chr  "2016-10-21 10:35:36" "2016-10-21
10:56:29" "2016-10-21
> 10:56:37" NA
>  $ Who : chr  "<Jane Doe>" "<John Doe>"
"<John Doe>" NA
>  $ What: chr  "What's your login" "John_Doe"
"Admit#8242" NA
>
> Lines that don't match the pattern result in NA's - you might make
a second
> pass over the corresponding elements of x with a new pattern.
>
> You can convert the When column from character to time with as.POSIXct().
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
>
> On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at
comcast.net>
> wrote:
>
> >
> > On 5/16/19 3:53 PM, Michael Boulineau wrote:
> > > OK. So, I named the object test and then checked the 6347th item
> > >
> > >> test <- readLines ("hangouts-conversation.txt)
> > >> test [6347]
> > > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> > >
> > > Perhaps where it was getting screwed up is, since the end of this
is a
> > > number (8242), then, given that there's no space between the
number
> > > and what ought to be the next row, R didn't know where to
draw the
> > > line. Sure enough, it looks like this when I go to the original
file
> > > and control f "#8242"
> > >
> > > 2016-10-21 10:35:36 <Jane Doe> What's your login
> > > 2016-10-21 10:56:29 <John Doe> John_Doe
> > > 2016-10-21 10:56:37 <John Doe> Admit#8242
> >
> >
> > An octothorpe is an end of line signifier and is interpreted as
allowing
> > comments. You can prevent that interpretation with suitable choice of
> > parameters to `read.table` or `read.csv`. I don't understand why
that
> > should cause anu error or a failure to match that pattern.
> >
> > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a
discussion
> > >
> > > Again, it doesn't look like that in the file. Gmail
automatically
> > > formats it like that when I paste it in. More to the point, it
looks
> > > like
> > >
> > > 2016-10-21 10:35:36 <Jane Doe> What's your
login2016-10-21 10:56:29
> > > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
Admit#82422016-10-21
> > > 11:00:13 <Jane Doe> Okay so you have a discussion
> > >
> > > Notice Admit#82422016. So there's that.
> > >
> > > Then I built object test2.
> > >
> > > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", test)
> > >
> > > This worked for 84 lines, then this happened.
> >
> > It may have done something but as you later discovered my first code
for
> > the pattern was incorrect. I had tested it (and pasted in the results
of
> > the test) . The way to refer to a capture class is with back-slashes
> > before the numbers, not forward-slashes. Try this:
> >
> >
> >  > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
> >  > newvec
> >   [1] "2016-07-01,02:50:35,<john>,hey"
> >   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
> >   [3] "2016-07-01,02:51:45,<john>,thinking about my
boo"
> >   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
happened, not really"
> >   [5] "2016-07-01,02:52:20,<john>,plane went by pretty
fast, didn't sleep"
> >   [6] "2016-07-01,02:54:08,<jane>,no idea what time it is
or where I am
> > really"
> >   [7] "2016-07-01,02:54:17,<john>,just know it's
london"
> >   [8] "2016-07-01,02:56:44,<jane>,you are probably
asleep"
> >   [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in
a good eay"
> > [10] "2016-07-01 02:58:56 <jone>"
> > [11] "2016-07-01 02:59:34 <jane>"
> > [12] "2016-07-01,03:02:48,<john>,British security is a
little more
> > rigorous..."
> >
> >
> > I made note of the fact that the 10th and 11th lines had no commas.
> >
> > >
> > >> test2 [84]
> > > [1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> >
> > That line didn't have any "<" so wasn't matched.
> >
> >
> > You could remove all none matching lines for pattern of
> >
> >
dates<space>times<space>"<"<name>">"<space><anything>
> >
> >
> > with:
> >
> >
> > chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)",
chrvec)]
> >
> >
> > Do read:
> >
> > ?read.csv
> >
> > ?regex
> >
> >
> > --
> >
> > David
> >
> >
> > >> test2 [85]
> > > [1] "//1,//2,//3,//4"
> > >> test [85]
> > > [1] "2016-07-01 02:50:35 <John Doe> hey"
> > >
> > > Notice how I toggled back and forth between test and test2 there.
So,
> > > whatever happened with the regex, it happened in the switch from
84 to
> > > 85, I guess. It went on like
> > >
> > > [990] "//1,//2,//3,//4"
> > >   [991] "//1,//2,//3,//4"
> > >   [992] "//1,//2,//3,//4"
> > >   [993] "//1,//2,//3,//4"
> > >   [994] "//1,//2,//3,//4"
> > >   [995] "//1,//2,//3,//4"
> > >   [996] "//1,//2,//3,//4"
> > >   [997] "//1,//2,//3,//4"
> > >   [998] "//1,//2,//3,//4"
> > >   [999] "//1,//2,//3,//4"
> > > [1000] "//1,//2,//3,//4"
> > >
> > > up until line 1000, then I reached max.print.
> >
> > > Michael
> > >
> > > On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at
comcast.net>
> > wrote:
> > >>
> > >> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> > >>> Thanks for this tip on etiquette, David. I will be sure
and not do
> > that again.
> > >>>
> > >>> I tried the read.fwf from the foreign package, with a
code like this:
> > >>>
> > >>>    d <-
read.fwf("hangouts-conversation.txt",
> > >>>                   widths= c(10,10,20,40),
> > >>>                  
col.names=c("date","time","person","comment"),
> > >>>                   strip.white=TRUE)
> > >>>
> > >>> But it threw this error:
> > >>>
> > >>> Error in scan(file = file, what = what, sep = sep, quote
= quote, dec
> > = dec,  :
> > >>>     line 6347 did not have 4 elements
> > >>
> > >> So what does line 6347 look like? (Use `readLines` and print
it out.)
> > >>
> > >>> Interestingly, though, the error only happened when I
increased the
> > >>> width size. But I had to increase the size, or else I
couldn't "see"
> > >>> anything.  The comment was so small that nothing was
being captured by
> > >>> the size of the column. so to speak.
> > >>>
> > >>> It seems like what's throwing me is that there's
no comma that
> > >>> demarcates the end of the text proper. For example:
> > >> Not sure why you thought there should be a comma. Lines
usually end
> > >> with  <cr> and or a <lf>.
> > >>
> > >>
> > >> Once you have the raw text in a character vector from
`readLines` named,
> > >> say, 'chrvec', then you could selectively substitute
commas for spaces
> > >> with regex. (Now that you no longer desire to remove the
dates and
> > times.)
> > >>
> > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
> > >>
> > >> This will not do any replacements when the pattern is not
matched. See
> > >> this test:
> > >>
> > >>
> > >>   > newvec <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "\\1,\\2,\\3,\\4",
> > chrvec)
> > >>   > newvec
> > >>    [1] "2016-07-01,02:50:35,<john>,hey"
> > >>    [2] "2016-07-01,02:51:26,<jane>,waiting for
plane to Edinburgh"
> > >>    [3] "2016-07-01,02:51:45,<john>,thinking about
my boo"
> > >>    [4] "2016-07-01,02:52:07,<jane>,nothing crappy
has happened, not
> > really"
> > >>    [5] "2016-07-01,02:52:20,<john>,plane went by
pretty fast, didn't
> > sleep"
> > >>    [6] "2016-07-01,02:54:08,<jane>,no idea what
time it is or where I am
> > >> really"
> > >>    [7] "2016-07-01,02:54:17,<john>,just know
it's london"
> > >>    [8] "2016-07-01,02:56:44,<jane>,you are
probably asleep"
> > >>    [9] "2016-07-01,02:58:45,<jane>,I hope fish was
fishy in a good eay"
> > >> [10] "2016-07-01 02:58:56 <jone>"
> > >> [11] "2016-07-01 02:59:34 <jane>"
> > >> [12] "2016-07-01,03:02:48,<john>,British security
is a little more
> > >> rigorous..."
> > >>
> > >>
> > >> You should probably remove the "empty comment"
lines.
> > >>
> > >>
> > >> --
> > >>
> > >> David.
> > >>
> > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
starbucks2016-07-01
> > >>> 15:35:02 <Jane Doe> Hmm that's
interesting2016-07-01 15:35:09 <Jane
> > >>> Doe> You must want coffees2016-07-01 15:35:25 <John
Doe> There was
> > >>> lots of Starbucks in my day2016-07-01 15:35:47
> > >>>
> > >>> It was interesting, too, when I pasted the text into the
email, it
> > >>> self-formatted into the way I wanted it to look. I had to
manually
> > >>> make it look like it does above, since that's the way
that it looks in
> > >>> the txt file. I wonder if it's being organized by XML
or something.
> > >>>
> > >>> Anyways, There's always a space between the two
sideways carrots, just
> > >>> like there is right now: <John Doe> See. Space. And
there's always a
> > >>> space between the data and time. Like this. 2016-07-01
15:34:30 See.
> > >>> Space. But there's never a space between the end of
the comment and
> > >>> the next date. Like this: We were in a
starbucks2016-07-01 15:35:02
> > >>> See. starbucks and 2016 are smooshed together.
> > >>>
> > >>> This code is also on the table right now too.
> > >>>
> > >>> a <- read.table("E:/working
> > >>> directory/-189/hangouts-conversation2.txt",
quote="\"",
> > >>> comment.char="", fill=TRUE)
> > >>>
> > >>>
> >
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> > >>>
> > >>> aa<-gsub("[^[:digit:]]","",h)
> > >>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
> > >>>
> > >>> Those last lines are a work in progress. I wish I could
import a
> > >>> picture of what it looks like when it's translated
into a data frame.
> > >>> The fill=TRUE helped to get the data in table that kind
of sort of
> > >>> works, but the comments keep bleeding into the data and
time column.
> > >>> It's like
> > >>>
> > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've
never been
> > >>> over               there
> > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> > >>>
> > >>> And then, maybe, the "seriously" will be in a
column all to itself, as
> > >>> will be the "I've'"and the
"never" etc.
> > >>>
> > >>> I will use a regular expression if I have to, but it
would be nice to
> > >>> keep the dates and times on there. Originally, I thought
they were
> > >>> meaningless, but I've since changed my mind on that
count. The time of
> > >>> day isn't so important. But, especially since, say,
Gmail itself knows
> > >>> how to quickly recognize what it is, I know it can be
done. I know
> > >>> this data has structure to it.
> > >>>
> > >>> Michael
> > >>>
> > >>>
> > >>>
> > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> > dwinsemius at comcast.net> wrote:
> > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> > >>>>> I have a wild and crazy text file, the head of
which looks like this:
> > >>>>>
> > >>>>> 2016-07-01 02:50:35 <john> hey
> > >>>>> 2016-07-01 02:51:26 <jane> waiting for
plane to Edinburgh
> > >>>>> 2016-07-01 02:51:45 <john> thinking about
my boo
> > >>>>> 2016-07-01 02:52:07 <jane> nothing crappy
has happened, not really
> > >>>>> 2016-07-01 02:52:20 <john> plane went by
pretty fast, didn't sleep
> > >>>>> 2016-07-01 02:54:08 <jane> no idea what
time it is or where I am
> > really
> > >>>>> 2016-07-01 02:54:17 <john> just know
it's london
> > >>>>> 2016-07-01 02:56:44 <jane> you are probably
asleep
> > >>>>> 2016-07-01 02:58:45 <jane> I hope fish was
fishy in a good eay
> > >>>>> 2016-07-01 02:58:56 <jone>
> > >>>>> 2016-07-01 02:59:34 <jane>
> > >>>>> 2016-07-01 03:02:48 <john> British security
is a little more
> > rigorous...
> > >>>> Looks entirely not-"crazy". Typical log
file format.
> > >>>>
> > >>>> Two possibilities: 1) Use `read.fwf` from pkg
foreign; 2) Use regex
> > >>>> (i.e. the sub-function) to strip everything up to the
"<". Read
> > >>>> `?regex`. Since that's not a metacharacters you
could use a pattern
> > >>>> ".+<" and replace with "".
> > >>>>
> > >>>> And do read the Posting Guide. Cross-posting to
StackOverflow and
> > Rhelp,
> > >>>> at least within hours of each, is considered poor
manners.
> > >>>>
> > >>>>
> > >>>> --
> > >>>>
> > >>>> David.
> > >>>>
> > >>>>> It goes on for a while. It's a big file. But
I feel like it's going
> > to
> > >>>>> be difficult to annotate with the coreNLP library
or package. I'm
> > >>>>> doing natural language processing. In other
words, I'm curious as to
> > >>>>> how I would shave off the dates, that is, to make
it look like:
> > >>>>>
> > >>>>> <john> hey
> > >>>>> <jane> waiting for plane to Edinburgh
> > >>>>>     <john> thinking about my boo
> > >>>>> <jane> nothing crappy has happened, not
really
> > >>>>> <john> plane went by pretty fast,
didn't sleep
> > >>>>> <jane> no idea what time it is or where I
am really
> > >>>>> <john> just know it's london
> > >>>>> <jane> you are probably asleep
> > >>>>> <jane> I hope fish was fishy in a good eay
> > >>>>>     <jone>
> > >>>>> <jane>
> > >>>>> <john> British security is a little more
rigorous...
> > >>>>>
> > >>>>> To be clear, then, I'm trying to clean a
large text file by writing a
> > >>>>> regular expression? such that I create a new
object with no numbers
> > or
> > >>>>> dates.
> > >>>>>
> > >>>>> Michael
> > >>>>>
> > >>>>> ______________________________________________
> > >>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>>>> PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > >>>>> and provide commented, minimal, self-contained,
reproducible code.
> > >>> ______________________________________________
> > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained,
reproducible code.
> > >> ______________________________________________
> > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > >> PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > >> and provide commented, minimal, self-contained, reproducible
code.
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2019-May-17 19:12 UTC

head link

[R] how to separate string from numbers in a large txt file

The pattern I gave worked for the lines that you originally showed from the
data file ('a'), before you put commas into them.  If the name is either
of
the form "<name>" or "***" then the
"(<[^>]*>)" needs to be changed so
something like "(<[^>]*>|[*]{3})".

The " ???" at the start of the imported data may come from the byte
order
mark that Windows apps like to put at the front of a text file in UTF-8 or
UTF-16 format.

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, May 17, 2019 at 11:53 AM Michael Boulineau <
michael.p.boulineau at gmail.com> wrote:
> This seemed to work:
>
> > a <- readLines ("hangouts-conversation-6.csv.txt")
> > b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", a)
> > b [1:84]
>
> And the first 85 lines looks like this:
>
> [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat"
> [84] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>
> Then they transition to the commas:
>
> > b [84:100]
>  [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>  [2] "2016-07-01,02:50:35,<John Doe>,hey"
>  [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to
Edinburgh"
>  [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo"
>
> Even the strange bit on line 6347 was caught by this:
>
> > b [6346:6348]
> [1] "2016-10-21,10:56:29,<John Doe>,John_Doe"
> [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242"
> [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a
discussion"
>
> Perhaps most awesomely, the code catches spaces that are interposed
> into the comment itself:
>
> > b [4]
> [1] "2016-01-27,09:15:20,<Jane Doe>,Hey "
>   > b [85]
> [1] "2016-07-01,02:50:35,<John Doe>,hey"
>
> Notice whether there is a space after the "hey" or not.
>
> These are the first two lines:
>
> [1] "???2016-01-27 09:14:40 *** Jane Doe started a video chat"
> [2] "2016-01-27,09:15:20,<Jane
> Doe>,
>
https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf
> "
>
> So, who knows what happened with the ??? at the beginning of [1]
> directly above. But notice how there are no commas in [1] but there
> appear in [2]. I don't see why really long ones like [2] directly
> above would be a problem, were they to be translated into a csv or
> data frame column.
>
> Now, with the commas in there, couldn't we write this into a csv or a
> data.frame? Some of this data will end up being garbage, I imagine.
> Like in [2] directly above. Or with [83] and [84] at the top of this
> discussion post/email. Embarrassingly, I've been trying to convert
> this into a data.frame or csv but I can't manage to. I've been
using
> the write.csv function, but I don't think I've been getting the
> arguments correct.
>
> At the end of the day, I would like a data.frame and/or csv with the
> following four columns: date, time, person, comment.
>
> I tried this, too:
>
> > c <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>)
*(.*$)",
> +                 a, proto=data.frame(stringsAsFactors=FALSE,
When="",
> Who="",
> +                                     What=""))
>
> But all I got was this:
>
> > c [1:100, ]
>     When  Who What
> 1   <NA> <NA> <NA>
> 2   <NA> <NA> <NA>
> 3   <NA> <NA> <NA>
> 4   <NA> <NA> <NA>
> 5   <NA> <NA> <NA>
> 6   <NA> <NA> <NA>
>
> It seems to have caught nothing.
>
> > unique (c)
>   When  Who What
> 1 <NA> <NA> <NA>
>
> But I like that it converted into columns. That's a really great
> format. With a little tweaking, it'd be a great code for this data
> set.
>
> Michael
>
> On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help
> <r-help at r-project.org> wrote:
> >
> > Consider using readLines() and strcapture() for reading such a file.
> E.g.,
> > suppose readLines(files) produced a character vector like
> >
> > x <- c("2016-10-21 10:35:36 <Jane Doe> What's your
login",
> >           "2016-10-21 10:56:29 <John Doe> John_Doe",
> >           "2016-10-21 10:56:37 <John Doe> Admit#8242",
> >           "October 23, 1819 12:34 <Jane Eyre> I am not an
angel")
> >
> > Then you can make a data.frame with columns When, Who, and What by
> > supplying a pattern containing three parenthesized capture
expressions:
> > > z <-
strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}
> > [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>)
*(.*$)",
> >              x, proto=data.frame(stringsAsFactors=FALSE,
When="", Who="",
> > What=""))
> > > str(z)
> > 'data.frame':   4 obs. of  3 variables:
> >  $ When: chr  "2016-10-21 10:35:36" "2016-10-21
10:56:29" "2016-10-21
> > 10:56:37" NA
> >  $ Who : chr  "<Jane Doe>" "<John
Doe>" "<John Doe>" NA
> >  $ What: chr  "What's your login" "John_Doe"
"Admit#8242" NA
> >
> > Lines that don't match the pattern result in NA's - you might
make a
> second
> > pass over the corresponding elements of x with a new pattern.
> >
> > You can convert the When column from character to time with
as.POSIXct().
> >
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> >
> > On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at
comcast.net>
> > wrote:
> >
> > >
> > > On 5/16/19 3:53 PM, Michael Boulineau wrote:
> > > > OK. So, I named the object test and then checked the 6347th
item
> > > >
> > > >> test <- readLines ("hangouts-conversation.txt)
> > > >> test [6347]
> > > > [1] "2016-10-21 10:56:37 <John Doe>
Admit#8242"
> > > >
> > > > Perhaps where it was getting screwed up is, since the end of
this is
> a
> > > > number (8242), then, given that there's no space between
the number
> > > > and what ought to be the next row, R didn't know where
to draw the
> > > > line. Sure enough, it looks like this when I go to the
original file
> > > > and control f "#8242"
> > > >
> > > > 2016-10-21 10:35:36 <Jane Doe> What's your login
> > > > 2016-10-21 10:56:29 <John Doe> John_Doe
> > > > 2016-10-21 10:56:37 <John Doe> Admit#8242
> > >
> > >
> > > An octothorpe is an end of line signifier and is interpreted as
> allowing
> > > comments. You can prevent that interpretation with suitable
choice of
> > > parameters to `read.table` or `read.csv`. I don't understand
why that
> > > should cause anu error or a failure to match that pattern.
> > >
> > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a
discussion
> > > >
> > > > Again, it doesn't look like that in the file. Gmail
automatically
> > > > formats it like that when I paste it in. More to the point,
it looks
> > > > like
> > > >
> > > > 2016-10-21 10:35:36 <Jane Doe> What's your
login2016-10-21 10:56:29
> > > > <John Doe> John_Doe2016-10-21 10:56:37 <John
Doe>
> Admit#82422016-10-21
> > > > 11:00:13 <Jane Doe> Okay so you have a discussion
> > > >
> > > > Notice Admit#82422016. So there's that.
> > > >
> > > > Then I built object test2.
> > > >
> > > > test2 <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "//1,//2,//3,//4", test)
> > > >
> > > > This worked for 84 lines, then this happened.
> > >
> > > It may have done something but as you later discovered my first
code
> for
> > > the pattern was incorrect. I had tested it (and pasted in the
results
> of
> > > the test) . The way to refer to a capture class is with
back-slashes
> > > before the numbers, not forward-slashes. Try this:
> > >
> > >
> > >  > newvec <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "\\1,\\2,\\3,\\4",
> chrvec)
> > >  > newvec
> > >   [1] "2016-07-01,02:50:35,<john>,hey"
> > >   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
> > >   [3] "2016-07-01,02:51:45,<john>,thinking about my
boo"
> > >   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
happened, not
> really"
> > >   [5] "2016-07-01,02:52:20,<john>,plane went by pretty
fast, didn't
> sleep"
> > >   [6] "2016-07-01,02:54:08,<jane>,no idea what time it
is or where I am
> > > really"
> > >   [7] "2016-07-01,02:54:17,<john>,just know it's
london"
> > >   [8] "2016-07-01,02:56:44,<jane>,you are probably
asleep"
> > >   [9] "2016-07-01,02:58:45,<jane>,I hope fish was
fishy in a good eay"
> > > [10] "2016-07-01 02:58:56 <jone>"
> > > [11] "2016-07-01 02:59:34 <jane>"
> > > [12] "2016-07-01,03:02:48,<john>,British security is a
little more
> > > rigorous..."
> > >
> > >
> > > I made note of the fact that the 10th and 11th lines had no
commas.
> > >
> > > >
> > > >> test2 [84]
> > > > [1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"
> > >
> > > That line didn't have any "<" so wasn't
matched.
> > >
> > >
> > > You could remove all none matching lines for pattern of
> > >
> > >
dates<space>times<space>"<"<name>">"<space><anything>
> > >
> > >
> > > with:
> > >
> > >
> > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+>
.+$)", chrvec)]
> > >
> > >
> > > Do read:
> > >
> > > ?read.csv
> > >
> > > ?regex
> > >
> > >
> > > --
> > >
> > > David
> > >
> > >
> > > >> test2 [85]
> > > > [1] "//1,//2,//3,//4"
> > > >> test [85]
> > > > [1] "2016-07-01 02:50:35 <John Doe> hey"
> > > >
> > > > Notice how I toggled back and forth between test and test2
there. So,
> > > > whatever happened with the regex, it happened in the switch
from 84
> to
> > > > 85, I guess. It went on like
> > > >
> > > > [990] "//1,//2,//3,//4"
> > > >   [991] "//1,//2,//3,//4"
> > > >   [992] "//1,//2,//3,//4"
> > > >   [993] "//1,//2,//3,//4"
> > > >   [994] "//1,//2,//3,//4"
> > > >   [995] "//1,//2,//3,//4"
> > > >   [996] "//1,//2,//3,//4"
> > > >   [997] "//1,//2,//3,//4"
> > > >   [998] "//1,//2,//3,//4"
> > > >   [999] "//1,//2,//3,//4"
> > > > [1000] "//1,//2,//3,//4"
> > > >
> > > > up until line 1000, then I reached max.print.
> > >
> > > > Michael
> > > >
> > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius <
> dwinsemius at comcast.net>
> > > wrote:
> > > >>
> > > >> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> > > >>> Thanks for this tip on etiquette, David. I will be
sure and not do
> > > that again.
> > > >>>
> > > >>> I tried the read.fwf from the foreign package, with
a code like
> this:
> > > >>>
> > > >>>    d <-
read.fwf("hangouts-conversation.txt",
> > > >>>                   widths= c(10,10,20,40),
> > > >>>                  
col.names=c("date","time","person","comment"),
> > > >>>                   strip.white=TRUE)
> > > >>>
> > > >>> But it threw this error:
> > > >>>
> > > >>> Error in scan(file = file, what = what, sep = sep,
quote = quote,
> dec
> > > = dec,  :
> > > >>>     line 6347 did not have 4 elements
> > > >>
> > > >> So what does line 6347 look like? (Use `readLines` and
print it
> out.)
> > > >>
> > > >>> Interestingly, though, the error only happened when
I increased the
> > > >>> width size. But I had to increase the size, or else
I couldn't
> "see"
> > > >>> anything.  The comment was so small that nothing was
being
> captured by
> > > >>> the size of the column. so to speak.
> > > >>>
> > > >>> It seems like what's throwing me is that
there's no comma that
> > > >>> demarcates the end of the text proper. For example:
> > > >> Not sure why you thought there should be a comma. Lines
usually end
> > > >> with  <cr> and or a <lf>.
> > > >>
> > > >>
> > > >> Once you have the raw text in a character vector from
`readLines`
> named,
> > > >> say, 'chrvec', then you could selectively
substitute commas for
> spaces
> > > >> with regex. (Now that you no longer desire to remove the
dates and
> > > times.)
> > > >>
> > > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
> > > >>
> > > >> This will not do any replacements when the pattern is
not matched.
> See
> > > >> this test:
> > > >>
> > > >>
> > > >>   > newvec <- sub("^(.{10}) (.{8})
(<.+>) (.+$)", "\\1,\\2,\\3,\\4",
> > > chrvec)
> > > >>   > newvec
> > > >>    [1] "2016-07-01,02:50:35,<john>,hey"
> > > >>    [2] "2016-07-01,02:51:26,<jane>,waiting
for plane to Edinburgh"
> > > >>    [3] "2016-07-01,02:51:45,<john>,thinking
about my boo"
> > > >>    [4] "2016-07-01,02:52:07,<jane>,nothing
crappy has happened, not
> > > really"
> > > >>    [5] "2016-07-01,02:52:20,<john>,plane went
by pretty fast, didn't
> > > sleep"
> > > >>    [6] "2016-07-01,02:54:08,<jane>,no idea
what time it is or where
> I am
> > > >> really"
> > > >>    [7] "2016-07-01,02:54:17,<john>,just know
it's london"
> > > >>    [8] "2016-07-01,02:56:44,<jane>,you are
probably asleep"
> > > >>    [9] "2016-07-01,02:58:45,<jane>,I hope
fish was fishy in a good
> eay"
> > > >> [10] "2016-07-01 02:58:56 <jone>"
> > > >> [11] "2016-07-01 02:59:34 <jane>"
> > > >> [12] "2016-07-01,03:02:48,<john>,British
security is a little more
> > > >> rigorous..."
> > > >>
> > > >>
> > > >> You should probably remove the "empty comment"
lines.
> > > >>
> > > >>
> > > >> --
> > > >>
> > > >> David.
> > > >>
> > > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were
in a
> starbucks2016-07-01
> > > >>> 15:35:02 <Jane Doe> Hmm that's
interesting2016-07-01 15:35:09 <Jane
> > > >>> Doe> You must want coffees2016-07-01 15:35:25
<John Doe> There was
> > > >>> lots of Starbucks in my day2016-07-01 15:35:47
> > > >>>
> > > >>> It was interesting, too, when I pasted the text into
the email, it
> > > >>> self-formatted into the way I wanted it to look. I
had to manually
> > > >>> make it look like it does above, since that's
the way that it
> looks in
> > > >>> the txt file. I wonder if it's being organized
by XML or something.
> > > >>>
> > > >>> Anyways, There's always a space between the two
sideways carrots,
> just
> > > >>> like there is right now: <John Doe> See.
Space. And there's always
> a
> > > >>> space between the data and time. Like this.
2016-07-01 15:34:30
> See.
> > > >>> Space. But there's never a space between the end
of the comment and
> > > >>> the next date. Like this: We were in a
starbucks2016-07-01 15:35:02
> > > >>> See. starbucks and 2016 are smooshed together.
> > > >>>
> > > >>> This code is also on the table right now too.
> > > >>>
> > > >>> a <- read.table("E:/working
> > > >>> directory/-189/hangouts-conversation2.txt",
quote="\"",
> > > >>> comment.char="", fill=TRUE)
> > > >>>
> > > >>>
> > >
>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> > > >>>
> > > >>> aa<-gsub("[^[:digit:]]","",h)
> > > >>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
> > > >>>
> > > >>> Those last lines are a work in progress. I wish I
could import a
> > > >>> picture of what it looks like when it's
translated into a data
> frame.
> > > >>> The fill=TRUE helped to get the data in table that
kind of sort of
> > > >>> works, but the comments keep bleeding into the data
and time
> column.
> > > >>> It's like
> > > >>>
> > > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously
I've never been
> > > >>> over               there
> > > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me
:(
> > > >>>
> > > >>> And then, maybe, the "seriously" will be
in a column all to
> itself, as
> > > >>> will be the "I've'"and the
"never" etc.
> > > >>>
> > > >>> I will use a regular expression if I have to, but it
would be nice
> to
> > > >>> keep the dates and times on there. Originally, I
thought they were
> > > >>> meaningless, but I've since changed my mind on
that count. The
> time of
> > > >>> day isn't so important. But, especially since,
say, Gmail itself
> knows
> > > >>> how to quickly recognize what it is, I know it can
be done. I know
> > > >>> this data has structure to it.
> > > >>>
> > > >>> Michael
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> > > dwinsemius at comcast.net> wrote:
> > > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> > > >>>>> I have a wild and crazy text file, the head
of which looks like
> this:
> > > >>>>>
> > > >>>>> 2016-07-01 02:50:35 <john> hey
> > > >>>>> 2016-07-01 02:51:26 <jane> waiting for
plane to Edinburgh
> > > >>>>> 2016-07-01 02:51:45 <john> thinking
about my boo
> > > >>>>> 2016-07-01 02:52:07 <jane> nothing
crappy has happened, not
> really
> > > >>>>> 2016-07-01 02:52:20 <john> plane went
by pretty fast, didn't
> sleep
> > > >>>>> 2016-07-01 02:54:08 <jane> no idea
what time it is or where I am
> > > really
> > > >>>>> 2016-07-01 02:54:17 <john> just know
it's london
> > > >>>>> 2016-07-01 02:56:44 <jane> you are
probably asleep
> > > >>>>> 2016-07-01 02:58:45 <jane> I hope fish
was fishy in a good eay
> > > >>>>> 2016-07-01 02:58:56 <jone>
> > > >>>>> 2016-07-01 02:59:34 <jane>
> > > >>>>> 2016-07-01 03:02:48 <john> British
security is a little more
> > > rigorous...
> > > >>>> Looks entirely not-"crazy". Typical
log file format.
> > > >>>>
> > > >>>> Two possibilities: 1) Use `read.fwf` from pkg
foreign; 2) Use
> regex
> > > >>>> (i.e. the sub-function) to strip everything up
to the "<". Read
> > > >>>> `?regex`. Since that's not a metacharacters
you could use a
> pattern
> > > >>>> ".+<" and replace with
"".
> > > >>>>
> > > >>>> And do read the Posting Guide. Cross-posting to
StackOverflow and
> > > Rhelp,
> > > >>>> at least within hours of each, is considered
poor manners.
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>>
> > > >>>> David.
> > > >>>>
> > > >>>>> It goes on for a while. It's a big file.
But I feel like it's
> going
> > > to
> > > >>>>> be difficult to annotate with the coreNLP
library or package. I'm
> > > >>>>> doing natural language processing. In other
words, I'm curious
> as to
> > > >>>>> how I would shave off the dates, that is, to
make it look like:
> > > >>>>>
> > > >>>>> <john> hey
> > > >>>>> <jane> waiting for plane to Edinburgh
> > > >>>>>     <john> thinking about my boo
> > > >>>>> <jane> nothing crappy has happened,
not really
> > > >>>>> <john> plane went by pretty fast,
didn't sleep
> > > >>>>> <jane> no idea what time it is or
where I am really
> > > >>>>> <john> just know it's london
> > > >>>>> <jane> you are probably asleep
> > > >>>>> <jane> I hope fish was fishy in a good
eay
> > > >>>>>     <jone>
> > > >>>>> <jane>
> > > >>>>> <john> British security is a little
more rigorous...
> > > >>>>>
> > > >>>>> To be clear, then, I'm trying to clean a
large text file by
> writing a
> > > >>>>> regular expression? such that I create a new
object with no
> numbers
> > > or
> > > >>>>> dates.
> > > >>>>>
> > > >>>>> Michael
> > > >>>>>
> > > >>>>>
______________________________________________
> > > >>>>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more,
> see
> > > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >>>>> PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > >>>>> and provide commented, minimal,
self-contained, reproducible
> code.
> > > >>> ______________________________________________
> > > >>> R-help at r-project.org mailing list -- To
UNSUBSCRIBE and more, see
> > > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >>> PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > >>> and provide commented, minimal, self-contained,
reproducible code.
> > > >> ______________________________________________
> > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> > > >> https://stat.ethz.ch/mailman/listinfo/r-help
> > > >> PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > >> and provide commented, minimal, self-contained,
reproducible code.
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible
code.
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible
code.
> > >
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Ivan Krylov

2019-May-17 19:43 UTC

head link

[R] how to separate string from numbers in a large txt file

On Fri, 17 May 2019 11:36:22 -0700
Michael Boulineau <michael.p.boulineau at gmail.com> wrote:
> So, who knows what happened with the ??? at the beginning of [1]
> directly above.
 perl -Mutf8 -MEncode=encode,decode -Mcharnames=:full \
 -E'say charnames::viacode ord decode utf8 => encode latin1 =>
"???"'
# ZERO WIDTH NO-BREAK SPACE

So the text seems to have been encoded in UTF-8, then decoded as
Latin-1. If you have multiple such artefacts and want to get rid of
them, try:

a <- readLines(con <- file("hangouts-conversation-6.csv.txt",
encoding
= "UTF-8")); close(con); rm(con)

-- 
Best regards,
Ivan

R help - May 2019 - how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file