Michael Boulineau
2019-May-16 19:30 UTC
[R] how to separate string from numbers in a large txt file
Thanks for this tip on etiquette, David. I will be sure and not do that again. I tried the read.fwf from the foreign package, with a code like this: d <- read.fwf("hangouts-conversation.txt", widths= c(10,10,20,40), col.names=c("date","time","person","comment"), strip.white=TRUE) But it threw this error: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 6347 did not have 4 elements Interestingly, though, the error only happened when I increased the width size. But I had to increase the size, or else I couldn't "see" anything. The comment was so small that nothing was being captured by the size of the column. so to speak. It seems like what's throwing me is that there's no comma that demarcates the end of the text proper. For example: 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was lots of Starbucks in my day2016-07-01 15:35:47 It was interesting, too, when I pasted the text into the email, it self-formatted into the way I wanted it to look. I had to manually make it look like it does above, since that's the way that it looks in the txt file. I wonder if it's being organized by XML or something. Anyways, There's always a space between the two sideways carrots, just like there is right now: <John Doe> See. Space. And there's always a space between the data and time. Like this. 2016-07-01 15:34:30 See. Space. But there's never a space between the end of the comment and the next date. Like this: We were in a starbucks2016-07-01 15:35:02 See. starbucks and 2016 are smooshed together. This code is also on the table right now too. a <- read.table("E:/working directory/-189/hangouts-conversation2.txt", quote="\"", comment.char="", fill=TRUE) h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) aa<-gsub("[^[:digit:]]","",h) my.data.num <- as.numeric(str_extract(h, "[0-9]+")) Those last lines are a work in progress. I wish I could import a picture of what it looks like when it's translated into a data frame. The fill=TRUE helped to get the data in table that kind of sort of works, but the comments keep bleeding into the data and time column. It's like 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been over there 2016-07-01 15:59:27 <Jane Doe> It confuses me :( And then, maybe, the "seriously" will be in a column all to itself, as will be the "I've'"and the "never" etc. I will use a regular expression if I have to, but it would be nice to keep the dates and times on there. Originally, I thought they were meaningless, but I've since changed my mind on that count. The time of day isn't so important. But, especially since, say, Gmail itself knows how to quickly recognize what it is, I know it can be done. I know this data has structure to it. Michael On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at comcast.net> wrote:> > > On 5/15/19 4:07 PM, Michael Boulineau wrote: > > I have a wild and crazy text file, the head of which looks like this: > > > > 2016-07-01 02:50:35 <john> hey > > 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > > 2016-07-01 02:51:45 <john> thinking about my boo > > 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > > 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > > 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really > > 2016-07-01 02:54:17 <john> just know it's london > > 2016-07-01 02:56:44 <jane> you are probably asleep > > 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > > 2016-07-01 02:58:56 <jone> > > 2016-07-01 02:59:34 <jane> > > 2016-07-01 03:02:48 <john> British security is a little more rigorous... > > Looks entirely not-"crazy". Typical log file format. > > Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > (i.e. the sub-function) to strip everything up to the "<". Read > `?regex`. Since that's not a metacharacters you could use a pattern > ".+<" and replace with "". > > And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, > at least within hours of each, is considered poor manners. > > > -- > > David. > > > > > It goes on for a while. It's a big file. But I feel like it's going to > > be difficult to annotate with the coreNLP library or package. I'm > > doing natural language processing. In other words, I'm curious as to > > how I would shave off the dates, that is, to make it look like: > > > > <john> hey > > <jane> waiting for plane to Edinburgh > > <john> thinking about my boo > > <jane> nothing crappy has happened, not really > > <john> plane went by pretty fast, didn't sleep > > <jane> no idea what time it is or where I am really > > <john> just know it's london > > <jane> you are probably asleep > > <jane> I hope fish was fishy in a good eay > > <jone> > > <jane> > > <john> British security is a little more rigorous... > > > > To be clear, then, I'm trying to clean a large text file by writing a > > regular expression? such that I create a new object with no numbers or > > dates. > > > > Michael > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code.
David Winsemius
2019-May-16 20:05 UTC
[R] how to separate string from numbers in a large txt file
On 5/16/19 12:30 PM, Michael Boulineau wrote:> Thanks for this tip on etiquette, David. I will be sure and not do that again. > > I tried the read.fwf from the foreign package, with a code like this: > > d <- read.fwf("hangouts-conversation.txt", > widths= c(10,10,20,40), > col.names=c("date","time","person","comment"), > strip.white=TRUE) > > But it threw this error: > > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : > line 6347 did not have 4 elementsSo what does line 6347 look like? (Use `readLines` and print it out.)> > Interestingly, though, the error only happened when I increased the > width size. But I had to increase the size, or else I couldn't "see" > anything. The comment was so small that nothing was being captured by > the size of the column. so to speak. > > It seems like what's throwing me is that there's no comma that > demarcates the end of the text proper. For example:Not sure why you thought there should be a comma. Lines usually end with? <cr> and or a <lf>. Once you have the raw text in a character vector from `readLines` named, say, 'chrvec', then you could selectively substitute commas for spaces with regex. (Now that you no longer desire to remove the dates and times.) sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) This will not do any replacements when the pattern is not matched. See this test: > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec) > newvec ?[1] "2016-07-01,02:50:35,<john>,hey" ?[2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" ?[3] "2016-07-01,02:51:45,<john>,thinking about my boo" ?[4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really" ?[5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep" ?[6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am really" ?[7] "2016-07-01,02:54:17,<john>,just know it's london" ?[8] "2016-07-01,02:56:44,<jane>,you are probably asleep" ?[9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" [10] "2016-07-01 02:58:56 <jone>" [11] "2016-07-01 02:59:34 <jane>" [12] "2016-07-01,03:02:48,<john>,British security is a little more rigorous..." You should probably remove the "empty comment" lines. -- David.> > 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 > 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > lots of Starbucks in my day2016-07-01 15:35:47 > > It was interesting, too, when I pasted the text into the email, it > self-formatted into the way I wanted it to look. I had to manually > make it look like it does above, since that's the way that it looks in > the txt file. I wonder if it's being organized by XML or something. > > Anyways, There's always a space between the two sideways carrots, just > like there is right now: <John Doe> See. Space. And there's always a > space between the data and time. Like this. 2016-07-01 15:34:30 See. > Space. But there's never a space between the end of the comment and > the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > See. starbucks and 2016 are smooshed together. > > This code is also on the table right now too. > > a <- read.table("E:/working > directory/-189/hangouts-conversation2.txt", quote="\"", > comment.char="", fill=TRUE) > > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > > aa<-gsub("[^[:digit:]]","",h) > my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > > Those last lines are a work in progress. I wish I could import a > picture of what it looks like when it's translated into a data frame. > The fill=TRUE helped to get the data in table that kind of sort of > works, but the comments keep bleeding into the data and time column. > It's like > > 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > over there > 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > > And then, maybe, the "seriously" will be in a column all to itself, as > will be the "I've'"and the "never" etc. > > I will use a regular expression if I have to, but it would be nice to > keep the dates and times on there. Originally, I thought they were > meaningless, but I've since changed my mind on that count. The time of > day isn't so important. But, especially since, say, Gmail itself knows > how to quickly recognize what it is, I know it can be done. I know > this data has structure to it. > > Michael > > > > On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at comcast.net> wrote: >> >> On 5/15/19 4:07 PM, Michael Boulineau wrote: >>> I have a wild and crazy text file, the head of which looks like this: >>> >>> 2016-07-01 02:50:35 <john> hey >>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh >>> 2016-07-01 02:51:45 <john> thinking about my boo >>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really >>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep >>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really >>> 2016-07-01 02:54:17 <john> just know it's london >>> 2016-07-01 02:56:44 <jane> you are probably asleep >>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay >>> 2016-07-01 02:58:56 <jone> >>> 2016-07-01 02:59:34 <jane> >>> 2016-07-01 03:02:48 <john> British security is a little more rigorous... >> Looks entirely not-"crazy". Typical log file format. >> >> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex >> (i.e. the sub-function) to strip everything up to the "<". Read >> `?regex`. Since that's not a metacharacters you could use a pattern >> ".+<" and replace with "". >> >> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, >> at least within hours of each, is considered poor manners. >> >> >> -- >> >> David. >> >>> It goes on for a while. It's a big file. But I feel like it's going to >>> be difficult to annotate with the coreNLP library or package. I'm >>> doing natural language processing. In other words, I'm curious as to >>> how I would shave off the dates, that is, to make it look like: >>> >>> <john> hey >>> <jane> waiting for plane to Edinburgh >>> <john> thinking about my boo >>> <jane> nothing crappy has happened, not really >>> <john> plane went by pretty fast, didn't sleep >>> <jane> no idea what time it is or where I am really >>> <john> just know it's london >>> <jane> you are probably asleep >>> <jane> I hope fish was fishy in a good eay >>> <jone> >>> <jane> >>> <john> British security is a little more rigorous... >>> >>> To be clear, then, I'm trying to clean a large text file by writing a >>> regular expression? such that I create a new object with no numbers or >>> dates. >>> >>> Michael >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Michael Boulineau
2019-May-16 22:53 UTC
[R] how to separate string from numbers in a large txt file
OK. So, I named the object test and then checked the 6347th item> test <- readLines ("hangouts-conversation.txt) > test [6347][1] "2016-10-21 10:56:37 <John Doe> Admit#8242" Perhaps where it was getting screwed up is, since the end of this is a number (8242), then, given that there's no space between the number and what ought to be the next row, R didn't know where to draw the line. Sure enough, it looks like this when I go to the original file and control f "#8242" 2016-10-21 10:35:36 <Jane Doe> What's your login 2016-10-21 10:56:29 <John Doe> John_Doe 2016-10-21 10:56:37 <John Doe> Admit#8242 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion Again, it doesn't look like that in the file. Gmail automatically formats it like that when I paste it in. More to the point, it looks like 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29 <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion Notice Admit#82422016. So there's that. Then I built object test2. test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test) This worked for 84 lines, then this happened.> test2 [84][1] "2016-06-28 21:12:43 *** John Doe ended a video chat"> test2 [85][1] "//1,//2,//3,//4"> test [85][1] "2016-07-01 02:50:35 <John Doe> hey" Notice how I toggled back and forth between test and test2 there. So, whatever happened with the regex, it happened in the switch from 84 to 85, I guess. It went on like [990] "//1,//2,//3,//4" [991] "//1,//2,//3,//4" [992] "//1,//2,//3,//4" [993] "//1,//2,//3,//4" [994] "//1,//2,//3,//4" [995] "//1,//2,//3,//4" [996] "//1,//2,//3,//4" [997] "//1,//2,//3,//4" [998] "//1,//2,//3,//4" [999] "//1,//2,//3,//4" [1000] "//1,//2,//3,//4" up until line 1000, then I reached max.print. Michael On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at comcast.net> wrote:> > > On 5/16/19 12:30 PM, Michael Boulineau wrote: > > Thanks for this tip on etiquette, David. I will be sure and not do that again. > > > > I tried the read.fwf from the foreign package, with a code like this: > > > > d <- read.fwf("hangouts-conversation.txt", > > widths= c(10,10,20,40), > > col.names=c("date","time","person","comment"), > > strip.white=TRUE) > > > > But it threw this error: > > > > Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : > > line 6347 did not have 4 elements > > > So what does line 6347 look like? (Use `readLines` and print it out.) > > > > > Interestingly, though, the error only happened when I increased the > > width size. But I had to increase the size, or else I couldn't "see" > > anything. The comment was so small that nothing was being captured by > > the size of the column. so to speak. > > > > It seems like what's throwing me is that there's no comma that > > demarcates the end of the text proper. For example: > > Not sure why you thought there should be a comma. Lines usually end > with <cr> and or a <lf>. > > > Once you have the raw text in a character vector from `readLines` named, > say, 'chrvec', then you could selectively substitute commas for spaces > with regex. (Now that you no longer desire to remove the dates and times.) > > sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > > This will not do any replacements when the pattern is not matched. See > this test: > > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec) > > newvec > [1] "2016-07-01,02:50:35,<john>,hey" > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really" > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep" > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > really" > [7] "2016-07-01,02:54:17,<john>,just know it's london" > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > [10] "2016-07-01 02:58:56 <jone>" > [11] "2016-07-01 02:59:34 <jane>" > [12] "2016-07-01,03:02:48,<john>,British security is a little more > rigorous..." > > > You should probably remove the "empty comment" lines. > > > -- > > David. > > > > > 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 > > 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > > Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > > lots of Starbucks in my day2016-07-01 15:35:47 > > > > It was interesting, too, when I pasted the text into the email, it > > self-formatted into the way I wanted it to look. I had to manually > > make it look like it does above, since that's the way that it looks in > > the txt file. I wonder if it's being organized by XML or something. > > > > Anyways, There's always a space between the two sideways carrots, just > > like there is right now: <John Doe> See. Space. And there's always a > > space between the data and time. Like this. 2016-07-01 15:34:30 See. > > Space. But there's never a space between the end of the comment and > > the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > > See. starbucks and 2016 are smooshed together. > > > > This code is also on the table right now too. > > > > a <- read.table("E:/working > > directory/-189/hangouts-conversation2.txt", quote="\"", > > comment.char="", fill=TRUE) > > > > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > > > > aa<-gsub("[^[:digit:]]","",h) > > my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > > > > Those last lines are a work in progress. I wish I could import a > > picture of what it looks like when it's translated into a data frame. > > The fill=TRUE helped to get the data in table that kind of sort of > > works, but the comments keep bleeding into the data and time column. > > It's like > > > > 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > > over there > > 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > > > > And then, maybe, the "seriously" will be in a column all to itself, as > > will be the "I've'"and the "never" etc. > > > > I will use a regular expression if I have to, but it would be nice to > > keep the dates and times on there. Originally, I thought they were > > meaningless, but I've since changed my mind on that count. The time of > > day isn't so important. But, especially since, say, Gmail itself knows > > how to quickly recognize what it is, I know it can be done. I know > > this data has structure to it. > > > > Michael > > > > > > > > On Wed, May 15, 2019 at 8:47 PM David Winsemius <dwinsemius at comcast.net> wrote: > >> > >> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>> I have a wild and crazy text file, the head of which looks like this: > >>> > >>> 2016-07-01 02:50:35 <john> hey > >>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>> 2016-07-01 02:51:45 <john> thinking about my boo > >>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > >>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > >>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am really > >>> 2016-07-01 02:54:17 <john> just know it's london > >>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > >>> 2016-07-01 02:58:56 <jone> > >>> 2016-07-01 02:59:34 <jane> > >>> 2016-07-01 03:02:48 <john> British security is a little more rigorous... > >> Looks entirely not-"crazy". Typical log file format. > >> > >> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > >> (i.e. the sub-function) to strip everything up to the "<". Read > >> `?regex`. Since that's not a metacharacters you could use a pattern > >> ".+<" and replace with "". > >> > >> And do read the Posting Guide. Cross-posting to StackOverflow and Rhelp, > >> at least within hours of each, is considered poor manners. > >> > >> > >> -- > >> > >> David. > >> > >>> It goes on for a while. It's a big file. But I feel like it's going to > >>> be difficult to annotate with the coreNLP library or package. I'm > >>> doing natural language processing. In other words, I'm curious as to > >>> how I would shave off the dates, that is, to make it look like: > >>> > >>> <john> hey > >>> <jane> waiting for plane to Edinburgh > >>> <john> thinking about my boo > >>> <jane> nothing crappy has happened, not really > >>> <john> plane went by pretty fast, didn't sleep > >>> <jane> no idea what time it is or where I am really > >>> <john> just know it's london > >>> <jane> you are probably asleep > >>> <jane> I hope fish was fishy in a good eay > >>> <jone> > >>> <jane> > >>> <john> British security is a little more rigorous... > >>> > >>> To be clear, then, I'm trying to clean a large text file by writing a > >>> regular expression? such that I create a new object with no numbers or > >>> dates. > >>> > >>> Michael > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.