thr3ads.net - R help - [R] how to separate string from numbers in a large txt file [May 2019]

If this information is useful, please help other people find it:
Share via:

Michael Boulineau

2019-May-16 22:53 UTC

[R] how to separate string from numbers in a large txt file

OK. So, I named the object test and then checked the 6347th item
> test <- readLines ("hangouts-conversation.txt)
> test [6347][1] "2016-10-21 10:56:37 <John Doe> Admit#8242"

Perhaps where it was getting screwed up is, since the end of this is a
number (8242), then, given that there's no space between the number
and what ought to be the next row, R didn't know where to draw the
line. Sure enough, it looks like this when I go to the original file
and control f "#8242"

2016-10-21 10:35:36 <Jane Doe> What's your login
2016-10-21 10:56:29 <John Doe> John_Doe
2016-10-21 10:56:37 <John Doe> Admit#8242
2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion

Again, it doesn't look like that in the file. Gmail automatically
formats it like that when I paste it in. More to the point, it looks
like

2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29
<John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
Admit#82422016-10-21
11:00:13 <Jane Doe> Okay so you have a discussion

Notice Admit#82422016. So there's that.

Then I built object test2.

test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", test)

This worked for 84 lines, then this happened.
> test2 [84][1] "2016-06-28 21:12:43 *** John Doe ended a video
chat"> test2 [85]
[1] "//1,//2,//3,//4"> test [85][1] "2016-07-01 02:50:35 <John Doe> hey"

Notice how I toggled back and forth between test and test2 there. So,
whatever happened with the regex, it happened in the switch from 84 to
85, I guess. It went on like

[990] "//1,//2,//3,//4"
 [991] "//1,//2,//3,//4"
 [992] "//1,//2,//3,//4"
 [993] "//1,//2,//3,//4"
 [994] "//1,//2,//3,//4"
 [995] "//1,//2,//3,//4"
 [996] "//1,//2,//3,//4"
 [997] "//1,//2,//3,//4"
 [998] "//1,//2,//3,//4"
 [999] "//1,//2,//3,//4"
[1000] "//1,//2,//3,//4"

up until line 1000, then I reached max.print.

Michael

On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at
comcast.net> wrote:>
>
> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> > Thanks for this tip on etiquette, David. I will be sure and not do
that again.
> >
> > I tried the read.fwf from the foreign package, with a code like this:
> >
> >   d <- read.fwf("hangouts-conversation.txt",
> >                  widths= c(10,10,20,40),
> >                 
col.names=c("date","time","person","comment"),
> >                  strip.white=TRUE)
> >
> > But it threw this error:
> >
> > Error in scan(file = file, what = what, sep = sep, quote = quote, dec
= dec,  :
> >    line 6347 did not have 4 elements
>
>
> So what does line 6347 look like? (Use `readLines` and print it out.)
>
> >
> > Interestingly, though, the error only happened when I increased the
> > width size. But I had to increase the size, or else I couldn't
"see"
> > anything.  The comment was so small that nothing was being captured by
> > the size of the column. so to speak.
> >
> > It seems like what's throwing me is that there's no comma that
> > demarcates the end of the text proper. For example:
>
> Not sure why you thought there should be a comma. Lines usually end
> with  <cr> and or a <lf>.
>
>
> Once you have the raw text in a character vector from `readLines` named,
> say, 'chrvec', then you could selectively substitute commas for
spaces
> with regex. (Now that you no longer desire to remove the dates and times.)
>
> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
>
> This will not do any replacements when the pattern is not matched. See
> this test:
>
>
>  > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
>  > newvec
>   [1] "2016-07-01,02:50:35,<john>,hey"
>   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
>   [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
not really"
>   [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
didn't sleep"
>   [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
where I am
> really"
>   [7] "2016-07-01,02:54:17,<john>,just know it's
london"
>   [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>   [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
good eay"
> [10] "2016-07-01 02:58:56 <jone>"
> [11] "2016-07-01 02:59:34 <jane>"
> [12] "2016-07-01,03:02:48,<john>,British security is a little
more
> rigorous..."
>
>
> You should probably remove the "empty comment" lines.
>
>
> --
>
> David.
>
> >
> > 2016-07-01 15:34:30 <John Doe> Lame. We were in a
starbucks2016-07-01
> > 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01
15:35:09 <Jane
> > Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
There was
> > lots of Starbucks in my day2016-07-01 15:35:47
> >
> > It was interesting, too, when I pasted the text into the email, it
> > self-formatted into the way I wanted it to look. I had to manually
> > make it look like it does above, since that's the way that it
looks in
> > the txt file. I wonder if it's being organized by XML or
something.
> >
> > Anyways, There's always a space between the two sideways carrots,
just
> > like there is right now: <John Doe> See. Space. And there's
always a
> > space between the data and time. Like this. 2016-07-01 15:34:30 See.
> > Space. But there's never a space between the end of the comment
and
> > the next date. Like this: We were in a starbucks2016-07-01 15:35:02
> > See. starbucks and 2016 are smooshed together.
> >
> > This code is also on the table right now too.
> >
> > a <- read.table("E:/working
> > directory/-189/hangouts-conversation2.txt",
quote="\"",
> > comment.char="", fill=TRUE)
> >
> >
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >
> > aa<-gsub("[^[:digit:]]","",h)
> > my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
> >
> > Those last lines are a work in progress. I wish I could import a
> > picture of what it looks like when it's translated into a data
frame.
> > The fill=TRUE helped to get the data in table that kind of sort of
> > works, but the comments keep bleeding into the data and time column.
> > It's like
> >
> > 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
> > over               there
> > 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >
> > And then, maybe, the "seriously" will be in a column all to
itself, as
> > will be the "I've'"and the "never" etc.
> >
> > I will use a regular expression if I have to, but it would be nice to
> > keep the dates and times on there. Originally, I thought they were
> > meaningless, but I've since changed my mind on that count. The
time of
> > day isn't so important. But, especially since, say, Gmail itself
knows
> > how to quickly recognize what it is, I know it can be done. I know
> > this data has structure to it.
> >
> > Michael
> >
> >
> >
> > On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at
comcast.net> wrote:
> >>
> >> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>> I have a wild and crazy text file, the head of which looks
like this:
> >>>
> >>> 2016-07-01 02:50:35 <john> hey
> >>> 2016-07-01 02:51:26 <jane> waiting for plane to
Edinburgh
> >>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>> 2016-07-01 02:52:07 <jane> nothing crappy has happened,
not really
> >>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
didn't sleep
> >>> 2016-07-01 02:54:08 <jane> no idea what time it is or
where I am really
> >>> 2016-07-01 02:54:17 <john> just know it's london
> >>> 2016-07-01 02:56:44 <jane> you are probably asleep
> >>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a
good eay
> >>> 2016-07-01 02:58:56 <jone>
> >>> 2016-07-01 02:59:34 <jane>
> >>> 2016-07-01 03:02:48 <john> British security is a little
more rigorous...
> >> Looks entirely not-"crazy". Typical log file format.
> >>
> >> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use
regex
> >> (i.e. the sub-function) to strip everything up to the
"<". Read
> >> `?regex`. Since that's not a metacharacters you could use a
pattern
> >> ".+<" and replace with "".
> >>
> >> And do read the Posting Guide. Cross-posting to StackOverflow and
Rhelp,
> >> at least within hours of each, is considered poor manners.
> >>
> >>
> >> --
> >>
> >> David.
> >>
> >>> It goes on for a while. It's a big file. But I feel like
it's going to
> >>> be difficult to annotate with the coreNLP library or package.
I'm
> >>> doing natural language processing. In other words, I'm
curious as to
> >>> how I would shave off the dates, that is, to make it look
like:
> >>>
> >>> <john> hey
> >>> <jane> waiting for plane to Edinburgh
> >>>    <john> thinking about my boo
> >>> <jane> nothing crappy has happened, not really
> >>> <john> plane went by pretty fast, didn't sleep
> >>> <jane> no idea what time it is or where I am really
> >>> <john> just know it's london
> >>> <jane> you are probably asleep
> >>> <jane> I hope fish was fishy in a good eay
> >>>    <jone>
> >>> <jane>
> >>> <john> British security is a little more rigorous...
> >>>
> >>> To be clear, then, I'm trying to clean a large text file
by writing a
> >>> regular expression? such that I create a new object with no
numbers or
> >>> dates.
> >>>
> >>> Michael
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius

2019-May-17 03:29 UTC

head link

[R] how to separate string from numbers in a large txt file

On 5/16/19 3:53 PM, Michael Boulineau wrote:> OK. So, I named the object test and then checked the 6347th item
>
>> test <- readLines ("hangouts-conversation.txt)
>> test [6347]
> [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
>
> Perhaps where it was getting screwed up is, since the end of this is a
> number (8242), then, given that there's no space between the number
> and what ought to be the next row, R didn't know where to draw the
> line. Sure enough, it looks like this when I go to the original file
> and control f "#8242"
>
> 2016-10-21 10:35:36 <Jane Doe> What's your login
> 2016-10-21 10:56:29 <John Doe> John_Doe
> 2016-10-21 10:56:37 <John Doe> Admit#8242

An octothorpe is an end of line signifier and is interpreted as allowing 
comments. You can prevent that interpretation with suitable choice of 
parameters to `read.table` or `read.csv`. I don't understand why that 
should cause anu error or a failure to match that pattern.
> 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
>
> Again, it doesn't look like that in the file. Gmail automatically
> formats it like that when I paste it in. More to the point, it looks
> like
>
> 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
10:56:29
> <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
Admit#82422016-10-21
> 11:00:13 <Jane Doe> Okay so you have a discussion
>
> Notice Admit#82422016. So there's that.
>
> Then I built object test2.
>
> test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", test)
>
> This worked for 84 lines, then this happened.
It may have done something but as you later discovered my first code for 
the pattern was incorrect. I had tested it (and pasted in the results of 
the test) . The way to refer to a capture class is with back-slashes 
before the numbers, not forward-slashes. Try this:


 > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
 > newvec
 ?[1] "2016-07-01,02:50:35,<john>,hey"
 ?[2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
 ?[3] "2016-07-01,02:51:45,<john>,thinking about my boo"
 ?[4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not
really"
 ?[5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
didn't sleep"
 ?[6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I
am
really"
 ?[7] "2016-07-01,02:54:17,<john>,just know it's london"
 ?[8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
 ?[9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good
eay"
[10] "2016-07-01 02:58:56 <jone>"
[11] "2016-07-01 02:59:34 <jane>"
[12] "2016-07-01,03:02:48,<john>,British security is a little more 
rigorous..."


I made note of the fact that the 10th and 11th lines had no commas.
>
>> test2 [84]
> [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
That line didn't have any "<" so wasn't matched.


You could remove all none matching lines for pattern of

dates<space>times<space>"<"<name>">"<space><anything>


with:


chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)]


Do read:

?read.csv

?regex


-- 

David

>> test2 [85]
> [1] "//1,//2,//3,//4"
>> test [85]
> [1] "2016-07-01 02:50:35 <John Doe> hey"
>
> Notice how I toggled back and forth between test and test2 there. So,
> whatever happened with the regex, it happened in the switch from 84 to
> 85, I guess. It went on like
>
> [990] "//1,//2,//3,//4"
>   [991] "//1,//2,//3,//4"
>   [992] "//1,//2,//3,//4"
>   [993] "//1,//2,//3,//4"
>   [994] "//1,//2,//3,//4"
>   [995] "//1,//2,//3,//4"
>   [996] "//1,//2,//3,//4"
>   [997] "//1,//2,//3,//4"
>   [998] "//1,//2,//3,//4"
>   [999] "//1,//2,//3,//4"
> [1000] "//1,//2,//3,//4"
>
> up until line 1000, then I reached max.print.
> Michael
>
> On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at
comcast.net> wrote:
>>
>> On 5/16/19 12:30 PM, Michael Boulineau wrote:
>>> Thanks for this tip on etiquette, David. I will be sure and not do
that again.
>>>
>>> I tried the read.fwf from the foreign package, with a code like
this:
>>>
>>>    d <- read.fwf("hangouts-conversation.txt",
>>>                   widths= c(10,10,20,40),
>>>                  
col.names=c("date","time","person","comment"),
>>>                   strip.white=TRUE)
>>>
>>> But it threw this error:
>>>
>>> Error in scan(file = file, what = what, sep = sep, quote = quote,
dec = dec,  :
>>>     line 6347 did not have 4 elements
>>
>> So what does line 6347 look like? (Use `readLines` and print it out.)
>>
>>> Interestingly, though, the error only happened when I increased the
>>> width size. But I had to increase the size, or else I couldn't
"see"
>>> anything.  The comment was so small that nothing was being captured
by
>>> the size of the column. so to speak.
>>>
>>> It seems like what's throwing me is that there's no comma
that
>>> demarcates the end of the text proper. For example:
>> Not sure why you thought there should be a comma. Lines usually end
>> with  <cr> and or a <lf>.
>>
>>
>> Once you have the raw text in a character vector from `readLines`
named,
>> say, 'chrvec', then you could selectively substitute commas for
spaces
>> with regex. (Now that you no longer desire to remove the dates and
times.)
>>
>> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
>>
>> This will not do any replacements when the pattern is not matched. See
>> this test:
>>
>>
>>   > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
>>   > newvec
>>    [1] "2016-07-01,02:50:35,<john>,hey"
>>    [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
>>    [3] "2016-07-01,02:51:45,<john>,thinking about my
boo"
>>    [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
happened, not really"
>>    [5] "2016-07-01,02:52:20,<john>,plane went by pretty
fast, didn't sleep"
>>    [6] "2016-07-01,02:54:08,<jane>,no idea what time it is
or where I am
>> really"
>>    [7] "2016-07-01,02:54:17,<john>,just know it's
london"
>>    [8] "2016-07-01,02:56:44,<jane>,you are probably
asleep"
>>    [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in
a good eay"
>> [10] "2016-07-01 02:58:56 <jone>"
>> [11] "2016-07-01 02:59:34 <jane>"
>> [12] "2016-07-01,03:02:48,<john>,British security is a
little more
>> rigorous..."
>>
>>
>> You should probably remove the "empty comment" lines.
>>
>>
>> --
>>
>> David.
>>
>>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
starbucks2016-07-01
>>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01
15:35:09 <Jane
>>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe>
There was
>>> lots of Starbucks in my day2016-07-01 15:35:47
>>>
>>> It was interesting, too, when I pasted the text into the email, it
>>> self-formatted into the way I wanted it to look. I had to manually
>>> make it look like it does above, since that's the way that it
looks in
>>> the txt file. I wonder if it's being organized by XML or
something.
>>>
>>> Anyways, There's always a space between the two sideways
carrots, just
>>> like there is right now: <John Doe> See. Space. And
there's always a
>>> space between the data and time. Like this. 2016-07-01 15:34:30
See.
>>> Space. But there's never a space between the end of the comment
and
>>> the next date. Like this: We were in a starbucks2016-07-01 15:35:02
>>> See. starbucks and 2016 are smooshed together.
>>>
>>> This code is also on the table right now too.
>>>
>>> a <- read.table("E:/working
>>> directory/-189/hangouts-conversation2.txt",
quote="\"",
>>> comment.char="", fill=TRUE)
>>>
>>>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
>>>
>>> aa<-gsub("[^[:digit:]]","",h)
>>> my.data.num <- as.numeric(str_extract(h, "[0-9]+"))
>>>
>>> Those last lines are a work in progress. I wish I could import a
>>> picture of what it looks like when it's translated into a data
frame.
>>> The fill=TRUE helped to get the data in table that kind of sort of
>>> works, but the comments keep bleeding into the data and time
column.
>>> It's like
>>>
>>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been
>>> over               there
>>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
>>>
>>> And then, maybe, the "seriously" will be in a column all
to itself, as
>>> will be the "I've'"and the "never" etc.
>>>
>>> I will use a regular expression if I have to, but it would be nice
to
>>> keep the dates and times on there. Originally, I thought they were
>>> meaningless, but I've since changed my mind on that count. The
time of
>>> day isn't so important. But, especially since, say, Gmail
itself knows
>>> how to quickly recognize what it is, I know it can be done. I know
>>> this data has structure to it.
>>>
>>> Michael
>>>
>>>
>>>
>>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at
comcast.net> wrote:
>>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
>>>>> I have a wild and crazy text file, the head of which looks
like this:
>>>>>
>>>>> 2016-07-01 02:50:35 <john> hey
>>>>> 2016-07-01 02:51:26 <jane> waiting for plane to
Edinburgh
>>>>> 2016-07-01 02:51:45 <john> thinking about my boo
>>>>> 2016-07-01 02:52:07 <jane> nothing crappy has
happened, not really
>>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast,
didn't sleep
>>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or
where I am really
>>>>> 2016-07-01 02:54:17 <john> just know it's london
>>>>> 2016-07-01 02:56:44 <jane> you are probably asleep
>>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a
good eay
>>>>> 2016-07-01 02:58:56 <jone>
>>>>> 2016-07-01 02:59:34 <jane>
>>>>> 2016-07-01 03:02:48 <john> British security is a
little more rigorous...
>>>> Looks entirely not-"crazy". Typical log file format.
>>>>
>>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use
regex
>>>> (i.e. the sub-function) to strip everything up to the
"<". Read
>>>> `?regex`. Since that's not a metacharacters you could use a
pattern
>>>> ".+<" and replace with "".
>>>>
>>>> And do read the Posting Guide. Cross-posting to StackOverflow
and Rhelp,
>>>> at least within hours of each, is considered poor manners.
>>>>
>>>>
>>>> --
>>>>
>>>> David.
>>>>
>>>>> It goes on for a while. It's a big file. But I feel
like it's going to
>>>>> be difficult to annotate with the coreNLP library or
package. I'm
>>>>> doing natural language processing. In other words, I'm
curious as to
>>>>> how I would shave off the dates, that is, to make it look
like:
>>>>>
>>>>> <john> hey
>>>>> <jane> waiting for plane to Edinburgh
>>>>>     <john> thinking about my boo
>>>>> <jane> nothing crappy has happened, not really
>>>>> <john> plane went by pretty fast, didn't sleep
>>>>> <jane> no idea what time it is or where I am really
>>>>> <john> just know it's london
>>>>> <jane> you are probably asleep
>>>>> <jane> I hope fish was fishy in a good eay
>>>>>     <jone>
>>>>> <jane>
>>>>> <john> British security is a little more rigorous...
>>>>>
>>>>> To be clear, then, I'm trying to clean a large text
file by writing a
>>>>> regular expression? such that I create a new object with no
numbers or
>>>>> dates.
>>>>>
>>>>> Michael
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained,
reproducible code.
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2019-May-17 15:20 UTC

head link

[R] how to separate string from numbers in a large txt file

Consider using readLines() and strcapture() for reading such a file.  E.g.,
suppose readLines(files) produced a character vector like

x <- c("2016-10-21 10:35:36 <Jane Doe> What's your
login",
          "2016-10-21 10:56:29 <John Doe> John_Doe",
          "2016-10-21 10:56:37 <John Doe> Admit#8242",
          "October 23, 1819 12:34 <Jane Eyre> I am not an
angel")

Then you can make a data.frame with columns When, Who, and What by
supplying a pattern containing three parenthesized capture
expressions:> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)",
             x, proto=data.frame(stringsAsFactors=FALSE, When="",
Who="",
What=""))> str(z)'data.frame':   4 obs. of  3 variables:
 $ When: chr  "2016-10-21 10:35:36" "2016-10-21 10:56:29"
"2016-10-21
10:56:37" NA
 $ Who : chr  "<Jane Doe>" "<John Doe>"
"<John Doe>" NA
 $ What: chr  "What's your login" "John_Doe"
"Admit#8242" NA

Lines that don't match the pattern result in NA's - you might make a
second
pass over the corresponding elements of x with a new pattern.

You can convert the When column from character to time with as.POSIXct().

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at
comcast.net>
wrote:
>
> On 5/16/19 3:53 PM, Michael Boulineau wrote:
> > OK. So, I named the object test and then checked the 6347th item
> >
> >> test <- readLines ("hangouts-conversation.txt)
> >> test [6347]
> > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242"
> >
> > Perhaps where it was getting screwed up is, since the end of this is a
> > number (8242), then, given that there's no space between the
number
> > and what ought to be the next row, R didn't know where to draw the
> > line. Sure enough, it looks like this when I go to the original file
> > and control f "#8242"
> >
> > 2016-10-21 10:35:36 <Jane Doe> What's your login
> > 2016-10-21 10:56:29 <John Doe> John_Doe
> > 2016-10-21 10:56:37 <John Doe> Admit#8242
>
>
> An octothorpe is an end of line signifier and is interpreted as allowing
> comments. You can prevent that interpretation with suitable choice of
> parameters to `read.table` or `read.csv`. I don't understand why that
> should cause anu error or a failure to match that pattern.
>
> > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion
> >
> > Again, it doesn't look like that in the file. Gmail automatically
> > formats it like that when I paste it in. More to the point, it looks
> > like
> >
> > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21
10:56:29
> > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe>
Admit#82422016-10-21
> > 11:00:13 <Jane Doe> Okay so you have a discussion
> >
> > Notice Admit#82422016. So there's that.
> >
> > Then I built object test2.
> >
> > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", test)
> >
> > This worked for 84 lines, then this happened.
>
> It may have done something but as you later discovered my first code for
> the pattern was incorrect. I had tested it (and pasted in the results of
> the test) . The way to refer to a capture class is with back-slashes
> before the numbers, not forward-slashes. Try this:
>
>
>  > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"\\1,\\2,\\3,\\4", chrvec)
>  > newvec
>   [1] "2016-07-01,02:50:35,<john>,hey"
>   [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
>   [3] "2016-07-01,02:51:45,<john>,thinking about my boo"
>   [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened,
not really"
>   [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast,
didn't sleep"
>   [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or
where I am
> really"
>   [7] "2016-07-01,02:54:17,<john>,just know it's
london"
>   [8] "2016-07-01,02:56:44,<jane>,you are probably asleep"
>   [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a
good eay"
> [10] "2016-07-01 02:58:56 <jone>"
> [11] "2016-07-01 02:59:34 <jane>"
> [12] "2016-07-01,03:02:48,<john>,British security is a little
more
> rigorous..."
>
>
> I made note of the fact that the 10th and 11th lines had no commas.
>
> >
> >> test2 [84]
> > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat"
>
> That line didn't have any "<" so wasn't matched.
>
>
> You could remove all none matching lines for pattern of
>
>
dates<space>times<space>"<"<name>">"<space><anything>
>
>
> with:
>
>
> chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)",
chrvec)]
>
>
> Do read:
>
> ?read.csv
>
> ?regex
>
>
> --
>
> David
>
>
> >> test2 [85]
> > [1] "//1,//2,//3,//4"
> >> test [85]
> > [1] "2016-07-01 02:50:35 <John Doe> hey"
> >
> > Notice how I toggled back and forth between test and test2 there. So,
> > whatever happened with the regex, it happened in the switch from 84 to
> > 85, I guess. It went on like
> >
> > [990] "//1,//2,//3,//4"
> >   [991] "//1,//2,//3,//4"
> >   [992] "//1,//2,//3,//4"
> >   [993] "//1,//2,//3,//4"
> >   [994] "//1,//2,//3,//4"
> >   [995] "//1,//2,//3,//4"
> >   [996] "//1,//2,//3,//4"
> >   [997] "//1,//2,//3,//4"
> >   [998] "//1,//2,//3,//4"
> >   [999] "//1,//2,//3,//4"
> > [1000] "//1,//2,//3,//4"
> >
> > up until line 1000, then I reached max.print.
>
> > Michael
> >
> > On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at
comcast.net>
> wrote:
> >>
> >> On 5/16/19 12:30 PM, Michael Boulineau wrote:
> >>> Thanks for this tip on etiquette, David. I will be sure and
not do
> that again.
> >>>
> >>> I tried the read.fwf from the foreign package, with a code
like this:
> >>>
> >>>    d <- read.fwf("hangouts-conversation.txt",
> >>>                   widths= c(10,10,20,40),
> >>>                  
col.names=c("date","time","person","comment"),
> >>>                   strip.white=TRUE)
> >>>
> >>> But it threw this error:
> >>>
> >>> Error in scan(file = file, what = what, sep = sep, quote =
quote, dec
> = dec,  :
> >>>     line 6347 did not have 4 elements
> >>
> >> So what does line 6347 look like? (Use `readLines` and print it
out.)
> >>
> >>> Interestingly, though, the error only happened when I
increased the
> >>> width size. But I had to increase the size, or else I
couldn't "see"
> >>> anything.  The comment was so small that nothing was being
captured by
> >>> the size of the column. so to speak.
> >>>
> >>> It seems like what's throwing me is that there's no
comma that
> >>> demarcates the end of the text proper. For example:
> >> Not sure why you thought there should be a comma. Lines usually
end
> >> with  <cr> and or a <lf>.
> >>
> >>
> >> Once you have the raw text in a character vector from `readLines`
named,
> >> say, 'chrvec', then you could selectively substitute
commas for spaces
> >> with regex. (Now that you no longer desire to remove the dates and
> times.)
> >>
> >> sub("^(.{10}) (.{8}) (<.+>) (.+$)",
"//1,//2,//3,//4", chrvec)
> >>
> >> This will not do any replacements when the pattern is not matched.
See
> >> this test:
> >>
> >>
> >>   > newvec <- sub("^(.{10}) (.{8}) (<.+>)
(.+$)", "\\1,\\2,\\3,\\4",
> chrvec)
> >>   > newvec
> >>    [1] "2016-07-01,02:50:35,<john>,hey"
> >>    [2] "2016-07-01,02:51:26,<jane>,waiting for plane to
Edinburgh"
> >>    [3] "2016-07-01,02:51:45,<john>,thinking about my
boo"
> >>    [4] "2016-07-01,02:52:07,<jane>,nothing crappy has
happened, not
> really"
> >>    [5] "2016-07-01,02:52:20,<john>,plane went by pretty
fast, didn't
> sleep"
> >>    [6] "2016-07-01,02:54:08,<jane>,no idea what time it
is or where I am
> >> really"
> >>    [7] "2016-07-01,02:54:17,<john>,just know it's
london"
> >>    [8] "2016-07-01,02:56:44,<jane>,you are probably
asleep"
> >>    [9] "2016-07-01,02:58:45,<jane>,I hope fish was
fishy in a good eay"
> >> [10] "2016-07-01 02:58:56 <jone>"
> >> [11] "2016-07-01 02:59:34 <jane>"
> >> [12] "2016-07-01,03:02:48,<john>,British security is a
little more
> >> rigorous..."
> >>
> >>
> >> You should probably remove the "empty comment" lines.
> >>
> >>
> >> --
> >>
> >> David.
> >>
> >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a
starbucks2016-07-01
> >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01
15:35:09 <Jane
> >>> Doe> You must want coffees2016-07-01 15:35:25 <John
Doe> There was
> >>> lots of Starbucks in my day2016-07-01 15:35:47
> >>>
> >>> It was interesting, too, when I pasted the text into the
email, it
> >>> self-formatted into the way I wanted it to look. I had to
manually
> >>> make it look like it does above, since that's the way that
it looks in
> >>> the txt file. I wonder if it's being organized by XML or
something.
> >>>
> >>> Anyways, There's always a space between the two sideways
carrots, just
> >>> like there is right now: <John Doe> See. Space. And
there's always a
> >>> space between the data and time. Like this. 2016-07-01
15:34:30 See.
> >>> Space. But there's never a space between the end of the
comment and
> >>> the next date. Like this: We were in a starbucks2016-07-01
15:35:02
> >>> See. starbucks and 2016 are smooshed together.
> >>>
> >>> This code is also on the table right now too.
> >>>
> >>> a <- read.table("E:/working
> >>> directory/-189/hangouts-conversation2.txt",
quote="\"",
> >>> comment.char="", fill=TRUE)
> >>>
> >>>
>
h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9])
> >>>
> >>> aa<-gsub("[^[:digit:]]","",h)
> >>> my.data.num <- as.numeric(str_extract(h,
"[0-9]+"))
> >>>
> >>> Those last lines are a work in progress. I wish I could import
a
> >>> picture of what it looks like when it's translated into a
data frame.
> >>> The fill=TRUE helped to get the data in table that kind of
sort of
> >>> works, but the comments keep bleeding into the data and time
column.
> >>> It's like
> >>>
> >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never
been
> >>> over               there
> >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :(
> >>>
> >>> And then, maybe, the "seriously" will be in a column
all to itself, as
> >>> will be the "I've'"and the "never"
etc.
> >>>
> >>> I will use a regular expression if I have to, but it would be
nice to
> >>> keep the dates and times on there. Originally, I thought they
were
> >>> meaningless, but I've since changed my mind on that count.
The time of
> >>> day isn't so important. But, especially since, say, Gmail
itself knows
> >>> how to quickly recognize what it is, I know it can be done. I
know
> >>> this data has structure to it.
> >>>
> >>> Michael
> >>>
> >>>
> >>>
> >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius <
> dwinsemius at comcast.net> wrote:
> >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote:
> >>>>> I have a wild and crazy text file, the head of which
looks like this:
> >>>>>
> >>>>> 2016-07-01 02:50:35 <john> hey
> >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to
Edinburgh
> >>>>> 2016-07-01 02:51:45 <john> thinking about my boo
> >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has
happened, not really
> >>>>> 2016-07-01 02:52:20 <john> plane went by pretty
fast, didn't sleep
> >>>>> 2016-07-01 02:54:08 <jane> no idea what time it
is or where I am
> really
> >>>>> 2016-07-01 02:54:17 <john> just know it's
london
> >>>>> 2016-07-01 02:56:44 <jane> you are probably
asleep
> >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy
in a good eay
> >>>>> 2016-07-01 02:58:56 <jone>
> >>>>> 2016-07-01 02:59:34 <jane>
> >>>>> 2016-07-01 03:02:48 <john> British security is a
little more
> rigorous...
> >>>> Looks entirely not-"crazy". Typical log file
format.
> >>>>
> >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2)
Use regex
> >>>> (i.e. the sub-function) to strip everything up to the
"<". Read
> >>>> `?regex`. Since that's not a metacharacters you could
use a pattern
> >>>> ".+<" and replace with "".
> >>>>
> >>>> And do read the Posting Guide. Cross-posting to
StackOverflow and
> Rhelp,
> >>>> at least within hours of each, is considered poor manners.
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> David.
> >>>>
> >>>>> It goes on for a while. It's a big file. But I
feel like it's going
> to
> >>>>> be difficult to annotate with the coreNLP library or
package. I'm
> >>>>> doing natural language processing. In other words,
I'm curious as to
> >>>>> how I would shave off the dates, that is, to make it
look like:
> >>>>>
> >>>>> <john> hey
> >>>>> <jane> waiting for plane to Edinburgh
> >>>>>     <john> thinking about my boo
> >>>>> <jane> nothing crappy has happened, not really
> >>>>> <john> plane went by pretty fast, didn't
sleep
> >>>>> <jane> no idea what time it is or where I am
really
> >>>>> <john> just know it's london
> >>>>> <jane> you are probably asleep
> >>>>> <jane> I hope fish was fishy in a good eay
> >>>>>     <jone>
> >>>>> <jane>
> >>>>> <john> British security is a little more
rigorous...
> >>>>>
> >>>>> To be clear, then, I'm trying to clean a large
text file by writing a
> >>>>> regular expression? such that I create a new object
with no numbers
> or
> >>>>> dates.
> >>>>>
> >>>>> Michael
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE
and more, see
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>>>> and provide commented, minimal, self-contained,
reproducible code.
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible
code.
> >> ______________________________________________
> >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more,
see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

R help - May 2019 - how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file

[R] how to separate string from numbers in a large txt file