Michael Boulineau
2019-May-15 23:07 UTC
[R] how to separate string from numbers in a large txt file
I have a wild and crazy text file, the head of which looks like this: 2016-07-01 02:50:35 <john> hey 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh 2016-07-01 02:51:45 <john> thinking about my boo 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really 2016-07-01 02:54:17 <john> just know it's london 2016-07-01 02:56:44 <jane> you are probably asleep 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay 2016-07-01 02:58:56 <jone> ? 2016-07-01 02:59:34 <jane> ??? 2016-07-01 03:02:48 <john> British security is a little more rigorous... It goes on for a while. It's a big file. But I feel like it's going to be difficult to annotate with the coreNLP library or package. I'm doing natural language processing. In other words, I'm curious as to how I would shave off the dates, that is, to make it look like: <john> hey <jane> waiting for plane to Edinburgh <john> thinking about my boo <jane> nothing crappy has happened, not really <john> plane went by pretty fast, didn't sleep <jane> no idea what time it is or where I am really <john> just know it's london <jane> you are probably asleep <jane> I hope fish was fishy in a good eay <jone> ? <jane> ??? <john> British security is a little more rigorous... To be clear, then, I'm trying to clean a large text file by writing a regular expression? such that I create a new object with no numbers or dates. Michael
David Winsemius
2019-May-16 03:47 UTC
[R] how to separate string from numbers in a large txt file
On 5/15/19 4:07 PM, Michael Boulineau wrote:> I have a wild and crazy text file, the head of which looks like this: > > 2016-07-01 02:50:35 <john> hey > 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > 2016-07-01 02:51:45 <john> thinking about my boo > 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really > 2016-07-01 02:54:17 <john> just know it's london > 2016-07-01 02:56:44 <jane> you are probably asleep > 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > 2016-07-01 02:58:56 <jone> ? > 2016-07-01 02:59:34 <jane> ??? > 2016-07-01 03:02:48 <john> British security is a little more rigorous...Looks entirely not-"crazy". Typical log file format. Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex (i.e. the sub-function) to strip everything up to the "<". Read `?regex`. Since that's not a metacharacters you could use a pattern ".+<" and replace with "". And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, at least within hours of each, is considered poor manners. -- David.> > It goes on for a while. It's a big file. But I feel like it's going to > be difficult to annotate with the coreNLP library or package. I'm > doing natural language processing. In other words, I'm curious as to > how I would shave off the dates, that is, to make it look like: > > <john> hey > <jane> waiting for plane to Edinburgh > <john> thinking about my boo > <jane> nothing crappy has happened, not really > <john> plane went by pretty fast, didn't sleep > <jane> no idea what time it is or where I am really > <john> just know it's london > <jane> you are probably asleep > <jane> I hope fish was fishy in a good eay > <jone> ? > <jane> ??? > <john> British security is a little more rigorous... > > To be clear, then, I'm trying to clean a large text file by writing a > regular expression? such that I create a new object with no numbers or > dates. > > Michael > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Michael Boulineau
2019-May-16 19:30 UTC
[R] how to separate string from numbers in a large txt file
Thanks for this tip on etiquette, David. I will be sure and not do that again. I tried the read.fwf from the foreign package, with a code like this: d <- read.fwf("hangouts-conversation.txt", widths= c(10,10,20,40), col.names=c("date","time","person","comment"), strip.white=TRUE) But it threw this error: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 6347 did not have 4 elements Interestingly, though, the error only happened when I increased the width size. But I had to increase the size, or else I couldn't "see" anything. The comment was so small that nothing was being captured by the size of the column. so to speak. It seems like what's throwing me is that there's no comma that demarcates the end of the text proper. For example: 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was lots of Starbucks in my day2016-07-01 15:35:47 It was interesting, too, when I pasted the text into the email, it self-formatted into the way I wanted it to look. I had to manually make it look like it does above, since that's the way that it looks in the txt file. I wonder if it's being organized by XML or something. Anyways, There's always a space between the two sideways carrots, just like there is right now: <John Doe> See. Space. And there's always a space between the data and time. Like this. 2016-07-01 15:34:30 See. Space. But there's never a space between the end of the comment and the next date. Like this: We were in a starbucks2016-07-01 15:35:02 See. starbucks and 2016 are smooshed together. This code is also on the table right now too. a <- read.table("E:/working directory/-189/hangouts-conversation2.txt", quote="\"", comment.char="", fill=TRUE) h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) aa<-gsub("[^[:digit:]]","",h) my.data.num <- as.numeric(str_extract(h, "[0-9]+")) Those last lines are a work in progress. I wish I could import a picture of what it looks like when it's translated into a data frame. The fill=TRUE helped to get the data in table that kind of sort of works, but the comments keep bleeding into the data and time column. It's like 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been over there 2016-07-01 15:59:27 <Jane Doe> It confuses me :( And then, maybe, the "seriously" will be in a column all to itself, as will be the "I've'"and the "never" etc. I will use a regular expression if I have to, but it would be nice to keep the dates and times on there. Originally, I thought they were meaningless, but I've since changed my mind on that count. The time of day isn't so important. But, especially since, say, Gmail itself knows how to quickly recognize what it is, I know it can be done. I know this data has structure to it. Michael On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at comcast.net> wrote:> > > On 5/15/19 4:07 PM, Michael Boulineau wrote: > > I have a wild and crazy text file, the head of which looks like this: > > > > 2016-07-01 02:50:35 <john> hey > > 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > > 2016-07-01 02:51:45 <john> thinking about my boo > > 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > > 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > > 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really > > 2016-07-01 02:54:17 <john> just know it's london > > 2016-07-01 02:56:44 <jane> you are probably asleep > > 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > > 2016-07-01 02:58:56 <jone> > > 2016-07-01 02:59:34 <jane> > > 2016-07-01 03:02:48 <john> British security is a little more rigorous... > > Looks entirely not-"crazy". Typical log file format. > > Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > (i.e. the sub-function) to strip everything up to the "<". Read > `?regex`. Since that's not a metacharacters you could use a pattern > ".+<" and replace with "". > > And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, > at least within hours of each, is considered poor manners. > > > -- > > David. > > > > > It goes on for a while. It's a big file. But I feel like it's going to > > be difficult to annotate with the coreNLP library or package. I'm > > doing natural language processing. In other words, I'm curious as to > > how I would shave off the dates, that is, to make it look like: > > > > <john> hey > > <jane> waiting for plane to Edinburgh > > <john> thinking about my boo > > <jane> nothing crappy has happened, not really > > <john> plane went by pretty fast, didn't sleep > > <jane> no idea what time it is or where I am really > > <john> just know it's london > > <jane> you are probably asleep > > <jane> I hope fish was fishy in a good eay > > <jone> > > <jane> > > <john> British security is a little more rigorous... > > > > To be clear, then, I'm trying to clean a large text file by writing a > > regular expression? such that I create a new object with no numbers or > > dates. > > > > Michael > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.