William Dunlap
2019-May-17 15:20 UTC
[R] how to separate string from numbers in a large txt file
Consider using readLines() and strcapture() for reading such a file. E.g., suppose readLines(files) produced a character vector like x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", "2016-10-21 10:56:29 <John Doe> John_Doe", "2016-10-21 10:56:37 <John Doe> Admit#8242", "October 23, 1819 12:34 <Jane Eyre> I am not an angel") Then you can make a data.frame with columns When, Who, and What by supplying a pattern containing three parenthesized capture expressions:> z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", x, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", What=""))> str(z)'data.frame': 4 obs. of 3 variables: $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" "2016-10-21 10:56:37" NA $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA $ What: chr "What's your login" "John_Doe" "Admit#8242" NA Lines that don't match the pattern result in NA's - you might make a second pass over the corresponding elements of x with a new pattern. You can convert the When column from character to time with as.POSIXct(). Bill Dunlap TIBCO Software wdunlap tibco.com On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at comcast.net> wrote:> > On 5/16/19 3:53 PM, Michael Boulineau wrote: > > OK. So, I named the object test and then checked the 6347th item > > > >> test <- readLines ("hangouts-conversation.txt) > >> test [6347] > > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > > > > Perhaps where it was getting screwed up is, since the end of this is a > > number (8242), then, given that there's no space between the number > > and what ought to be the next row, R didn't know where to draw the > > line. Sure enough, it looks like this when I go to the original file > > and control f "#8242" > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login > > 2016-10-21 10:56:29 <John Doe> John_Doe > > 2016-10-21 10:56:37 <John Doe> Admit#8242 > > > An octothorpe is an end of line signifier and is interpreted as allowing > comments. You can prevent that interpretation with suitable choice of > parameters to `read.table` or `read.csv`. I don't understand why that > should cause anu error or a failure to match that pattern. > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > > > > Again, it doesn't look like that in the file. Gmail automatically > > formats it like that when I paste it in. More to the point, it looks > > like > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29 > > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21 > > 11:00:13 <Jane Doe> Okay so you have a discussion > > > > Notice Admit#82422016. So there's that. > > > > Then I built object test2. > > > > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test) > > > > This worked for 84 lines, then this happened. > > It may have done something but as you later discovered my first code for > the pattern was incorrect. I had tested it (and pasted in the results of > the test) . The way to refer to a capture class is with back-slashes > before the numbers, not forward-slashes. Try this: > > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec) > > newvec > [1] "2016-07-01,02:50:35,<john>,hey" > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really" > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep" > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > really" > [7] "2016-07-01,02:54:17,<john>,just know it's london" > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > [10] "2016-07-01 02:58:56 <jone>" > [11] "2016-07-01 02:59:34 <jane>" > [12] "2016-07-01,03:02:48,<john>,British security is a little more > rigorous..." > > > I made note of the fact that the 10th and 11th lines had no commas. > > > > >> test2 [84] > > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > > That line didn't have any "<" so wasn't matched. > > > You could remove all none matching lines for pattern of > > dates<space>times<space>"<"<name>">"<space><anything> > > > with: > > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > > > Do read: > > ?read.csv > > ?regex > > > -- > > David > > > >> test2 [85] > > [1] "//1,//2,//3,//4" > >> test [85] > > [1] "2016-07-01 02:50:35 <John Doe> hey" > > > > Notice how I toggled back and forth between test and test2 there. So, > > whatever happened with the regex, it happened in the switch from 84 to > > 85, I guess. It went on like > > > > [990] "//1,//2,//3,//4" > > [991] "//1,//2,//3,//4" > > [992] "//1,//2,//3,//4" > > [993] "//1,//2,//3,//4" > > [994] "//1,//2,//3,//4" > > [995] "//1,//2,//3,//4" > > [996] "//1,//2,//3,//4" > > [997] "//1,//2,//3,//4" > > [998] "//1,//2,//3,//4" > > [999] "//1,//2,//3,//4" > > [1000] "//1,//2,//3,//4" > > > > up until line 1000, then I reached max.print. > > > Michael > > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at comcast.net> > wrote: > >> > >> On 5/16/19 12:30 PM, Michael Boulineau wrote: > >>> Thanks for this tip on etiquette, David. I will be sure and not do > that again. > >>> > >>> I tried the read.fwf from the foreign package, with a code like this: > >>> > >>> d <- read.fwf("hangouts-conversation.txt", > >>> widths= c(10,10,20,40), > >>> col.names=c("date","time","person","comment"), > >>> strip.white=TRUE) > >>> > >>> But it threw this error: > >>> > >>> Error in scan(file = file, what = what, sep = sep, quote = quote, dec > = dec, : > >>> line 6347 did not have 4 elements > >> > >> So what does line 6347 look like? (Use `readLines` and print it out.) > >> > >>> Interestingly, though, the error only happened when I increased the > >>> width size. But I had to increase the size, or else I couldn't "see" > >>> anything. The comment was so small that nothing was being captured by > >>> the size of the column. so to speak. > >>> > >>> It seems like what's throwing me is that there's no comma that > >>> demarcates the end of the text proper. For example: > >> Not sure why you thought there should be a comma. Lines usually end > >> with <cr> and or a <lf>. > >> > >> > >> Once you have the raw text in a character vector from `readLines` named, > >> say, 'chrvec', then you could selectively substitute commas for spaces > >> with regex. (Now that you no longer desire to remove the dates and > times.) > >> > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > >> > >> This will not do any replacements when the pattern is not matched. See > >> this test: > >> > >> > >> > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", > chrvec) > >> > newvec > >> [1] "2016-07-01,02:50:35,<john>,hey" > >> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > >> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > >> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not > really" > >> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't > sleep" > >> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > >> really" > >> [7] "2016-07-01,02:54:17,<john>,just know it's london" > >> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > >> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > >> [10] "2016-07-01 02:58:56 <jone>" > >> [11] "2016-07-01 02:59:34 <jane>" > >> [12] "2016-07-01,03:02:48,<john>,British security is a little more > >> rigorous..." > >> > >> > >> You should probably remove the "empty comment" lines. > >> > >> > >> -- > >> > >> David. > >> > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 > >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > >>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > >>> lots of Starbucks in my day2016-07-01 15:35:47 > >>> > >>> It was interesting, too, when I pasted the text into the email, it > >>> self-formatted into the way I wanted it to look. I had to manually > >>> make it look like it does above, since that's the way that it looks in > >>> the txt file. I wonder if it's being organized by XML or something. > >>> > >>> Anyways, There's always a space between the two sideways carrots, just > >>> like there is right now: <John Doe> See. Space. And there's always a > >>> space between the data and time. Like this. 2016-07-01 15:34:30 See. > >>> Space. But there's never a space between the end of the comment and > >>> the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > >>> See. starbucks and 2016 are smooshed together. > >>> > >>> This code is also on the table right now too. > >>> > >>> a <- read.table("E:/working > >>> directory/-189/hangouts-conversation2.txt", quote="\"", > >>> comment.char="", fill=TRUE) > >>> > >>> > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > >>> > >>> aa<-gsub("[^[:digit:]]","",h) > >>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > >>> > >>> Those last lines are a work in progress. I wish I could import a > >>> picture of what it looks like when it's translated into a data frame. > >>> The fill=TRUE helped to get the data in table that kind of sort of > >>> works, but the comments keep bleeding into the data and time column. > >>> It's like > >>> > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > >>> over there > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > >>> > >>> And then, maybe, the "seriously" will be in a column all to itself, as > >>> will be the "I've'"and the "never" etc. > >>> > >>> I will use a regular expression if I have to, but it would be nice to > >>> keep the dates and times on there. Originally, I thought they were > >>> meaningless, but I've since changed my mind on that count. The time of > >>> day isn't so important. But, especially since, say, Gmail itself knows > >>> how to quickly recognize what it is, I know it can be done. I know > >>> this data has structure to it. > >>> > >>> Michael > >>> > >>> > >>> > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > dwinsemius at comcast.net> wrote: > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > >>>>> I have a wild and crazy text file, the head of which looks like this: > >>>>> > >>>>> 2016-07-01 02:50:35 <john> hey > >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > >>>>> 2016-07-01 02:51:45 <john> thinking about my boo > >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > >>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > >>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am > really > >>>>> 2016-07-01 02:54:17 <john> just know it's london > >>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > >>>>> 2016-07-01 02:58:56 <jone> > >>>>> 2016-07-01 02:59:34 <jane> > >>>>> 2016-07-01 03:02:48 <john> British security is a little more > rigorous... > >>>> Looks entirely not-"crazy". Typical log file format. > >>>> > >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > >>>> (i.e. the sub-function) to strip everything up to the "<". Read > >>>> `?regex`. Since that's not a metacharacters you could use a pattern > >>>> ".+<" and replace with "". > >>>> > >>>> And do read the Posting Guide. Cross-posting to StackOverflow and > Rhelp, > >>>> at least within hours of each, is considered poor manners. > >>>> > >>>> > >>>> -- > >>>> > >>>> David. > >>>> > >>>>> It goes on for a while. It's a big file. But I feel like it's going > to > >>>>> be difficult to annotate with the coreNLP library or package. I'm > >>>>> doing natural language processing. In other words, I'm curious as to > >>>>> how I would shave off the dates, that is, to make it look like: > >>>>> > >>>>> <john> hey > >>>>> <jane> waiting for plane to Edinburgh > >>>>> <john> thinking about my boo > >>>>> <jane> nothing crappy has happened, not really > >>>>> <john> plane went by pretty fast, didn't sleep > >>>>> <jane> no idea what time it is or where I am really > >>>>> <john> just know it's london > >>>>> <jane> you are probably asleep > >>>>> <jane> I hope fish was fishy in a good eay > >>>>> <jone> > >>>>> <jane> > >>>>> <john> British security is a little more rigorous... > >>>>> > >>>>> To be clear, then, I'm trying to clean a large text file by writing a > >>>>> regular expression? such that I create a new object with no numbers > or > >>>>> dates. > >>>>> > >>>>> Michael > >>>>> > >>>>> ______________________________________________ > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> ______________________________________________ > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Michael Boulineau
2019-May-17 18:36 UTC
[R] how to separate string from numbers in a large txt file
This seemed to work:> a <- readLines ("hangouts-conversation-6.csv.txt") > b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > b [1:84]And the first 85 lines looks like this: [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" Then they transition to the commas:> b [84:100][1] "2016-06-28 21:12:43 *** John Doe ended a video chat" [2] "2016-07-01,02:50:35,<John Doe>,hey" [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" Even the strange bit on line 6347 was caught by this:> b [6346:6348][1] "2016-10-21,10:56:29,<John Doe>,John_Doe" [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" Perhaps most awesomely, the code catches spaces that are interposed into the comment itself:> b [4][1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > b [85] [1] "2016-07-01,02:50:35,<John Doe>,hey" Notice whether there is a space after the "hey" or not. These are the first two lines: [1] "???2016-01-27 09:14:40 *** Jane Doe started a video chat" [2] "2016-01-27,09:15:20,<Jane Doe>,https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf" So, who knows what happened with the ??? at the beginning of [1] directly above. But notice how there are no commas in [1] but there appear in [2]. I don't see why really long ones like [2] directly above would be a problem, were they to be translated into a csv or data frame column. Now, with the commas in there, couldn't we write this into a csv or a data.frame? Some of this data will end up being garbage, I imagine. Like in [2] directly above. Or with [83] and [84] at the top of this discussion post/email. Embarrassingly, I've been trying to convert this into a data.frame or csv but I can't manage to. I've been using the write.csv function, but I don't think I've been getting the arguments correct. At the end of the day, I would like a data.frame and/or csv with the following four columns: date, time, person, comment. I tried this, too:> c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}+ [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", + a, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", + What="")) But all I got was this:> c [1:100, ]When Who What 1 <NA> <NA> <NA> 2 <NA> <NA> <NA> 3 <NA> <NA> <NA> 4 <NA> <NA> <NA> 5 <NA> <NA> <NA> 6 <NA> <NA> <NA> It seems to have caught nothing.> unique (c)When Who What 1 <NA> <NA> <NA> But I like that it converted into columns. That's a really great format. With a little tweaking, it'd be a great code for this data set. Michael On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help <r-help at r-project.org> wrote:> > Consider using readLines() and strcapture() for reading such a file. E.g., > suppose readLines(files) produced a character vector like > > x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > "2016-10-21 10:56:29 <John Doe> John_Doe", > "2016-10-21 10:56:37 <John Doe> Admit#8242", > "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > > Then you can make a data.frame with columns When, Who, and What by > supplying a pattern containing three parenthesized capture expressions: > > z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > x, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", > What="")) > > str(z) > 'data.frame': 4 obs. of 3 variables: > $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" "2016-10-21 > 10:56:37" NA > $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > > Lines that don't match the pattern result in NA's - you might make a second > pass over the corresponding elements of x with a new pattern. > > You can convert the When column from character to time with as.POSIXct(). > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at comcast.net> > wrote: > > > > > On 5/16/19 3:53 PM, Michael Boulineau wrote: > > > OK. So, I named the object test and then checked the 6347th item > > > > > >> test <- readLines ("hangouts-conversation.txt) > > >> test [6347] > > > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > > > > > > Perhaps where it was getting screwed up is, since the end of this is a > > > number (8242), then, given that there's no space between the number > > > and what ought to be the next row, R didn't know where to draw the > > > line. Sure enough, it looks like this when I go to the original file > > > and control f "#8242" > > > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login > > > 2016-10-21 10:56:29 <John Doe> John_Doe > > > 2016-10-21 10:56:37 <John Doe> Admit#8242 > > > > > > An octothorpe is an end of line signifier and is interpreted as allowing > > comments. You can prevent that interpretation with suitable choice of > > parameters to `read.table` or `read.csv`. I don't understand why that > > should cause anu error or a failure to match that pattern. > > > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > > > > > > Again, it doesn't look like that in the file. Gmail automatically > > > formats it like that when I paste it in. More to the point, it looks > > > like > > > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29 > > > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> Admit#82422016-10-21 > > > 11:00:13 <Jane Doe> Okay so you have a discussion > > > > > > Notice Admit#82422016. So there's that. > > > > > > Then I built object test2. > > > > > > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test) > > > > > > This worked for 84 lines, then this happened. > > > > It may have done something but as you later discovered my first code for > > the pattern was incorrect. I had tested it (and pasted in the results of > > the test) . The way to refer to a capture class is with back-slashes > > before the numbers, not forward-slashes. Try this: > > > > > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", chrvec) > > > newvec > > [1] "2016-07-01,02:50:35,<john>,hey" > > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not really" > > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't sleep" > > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > > really" > > [7] "2016-07-01,02:54:17,<john>,just know it's london" > > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > > [10] "2016-07-01 02:58:56 <jone>" > > [11] "2016-07-01 02:59:34 <jane>" > > [12] "2016-07-01,03:02:48,<john>,British security is a little more > > rigorous..." > > > > > > I made note of the fact that the 10th and 11th lines had no commas. > > > > > > > >> test2 [84] > > > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > > > > That line didn't have any "<" so wasn't matched. > > > > > > You could remove all none matching lines for pattern of > > > > dates<space>times<space>"<"<name>">"<space><anything> > > > > > > with: > > > > > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > > > > > > Do read: > > > > ?read.csv > > > > ?regex > > > > > > -- > > > > David > > > > > > >> test2 [85] > > > [1] "//1,//2,//3,//4" > > >> test [85] > > > [1] "2016-07-01 02:50:35 <John Doe> hey" > > > > > > Notice how I toggled back and forth between test and test2 there. So, > > > whatever happened with the regex, it happened in the switch from 84 to > > > 85, I guess. It went on like > > > > > > [990] "//1,//2,//3,//4" > > > [991] "//1,//2,//3,//4" > > > [992] "//1,//2,//3,//4" > > > [993] "//1,//2,//3,//4" > > > [994] "//1,//2,//3,//4" > > > [995] "//1,//2,//3,//4" > > > [996] "//1,//2,//3,//4" > > > [997] "//1,//2,//3,//4" > > > [998] "//1,//2,//3,//4" > > > [999] "//1,//2,//3,//4" > > > [1000] "//1,//2,//3,//4" > > > > > > up until line 1000, then I reached max.print. > > > > > Michael > > > > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius <dwinsemius at comcast.net> > > wrote: > > >> > > >> On 5/16/19 12:30 PM, Michael Boulineau wrote: > > >>> Thanks for this tip on etiquette, David. I will be sure and not do > > that again. > > >>> > > >>> I tried the read.fwf from the foreign package, with a code like this: > > >>> > > >>> d <- read.fwf("hangouts-conversation.txt", > > >>> widths= c(10,10,20,40), > > >>> col.names=c("date","time","person","comment"), > > >>> strip.white=TRUE) > > >>> > > >>> But it threw this error: > > >>> > > >>> Error in scan(file = file, what = what, sep = sep, quote = quote, dec > > = dec, : > > >>> line 6347 did not have 4 elements > > >> > > >> So what does line 6347 look like? (Use `readLines` and print it out.) > > >> > > >>> Interestingly, though, the error only happened when I increased the > > >>> width size. But I had to increase the size, or else I couldn't "see" > > >>> anything. The comment was so small that nothing was being captured by > > >>> the size of the column. so to speak. > > >>> > > >>> It seems like what's throwing me is that there's no comma that > > >>> demarcates the end of the text proper. For example: > > >> Not sure why you thought there should be a comma. Lines usually end > > >> with <cr> and or a <lf>. > > >> > > >> > > >> Once you have the raw text in a character vector from `readLines` named, > > >> say, 'chrvec', then you could selectively substitute commas for spaces > > >> with regex. (Now that you no longer desire to remove the dates and > > times.) > > >> > > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > > >> > > >> This will not do any replacements when the pattern is not matched. See > > >> this test: > > >> > > >> > > >> > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", > > chrvec) > > >> > newvec > > >> [1] "2016-07-01,02:50:35,<john>,hey" > > >> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > > >> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > > >> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not > > really" > > >> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't > > sleep" > > >> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > > >> really" > > >> [7] "2016-07-01,02:54:17,<john>,just know it's london" > > >> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > > >> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > > >> [10] "2016-07-01 02:58:56 <jone>" > > >> [11] "2016-07-01 02:59:34 <jane>" > > >> [12] "2016-07-01,03:02:48,<john>,British security is a little more > > >> rigorous..." > > >> > > >> > > >> You should probably remove the "empty comment" lines. > > >> > > >> > > >> -- > > >> > > >> David. > > >> > > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a starbucks2016-07-01 > > >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > > >>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > > >>> lots of Starbucks in my day2016-07-01 15:35:47 > > >>> > > >>> It was interesting, too, when I pasted the text into the email, it > > >>> self-formatted into the way I wanted it to look. I had to manually > > >>> make it look like it does above, since that's the way that it looks in > > >>> the txt file. I wonder if it's being organized by XML or something. > > >>> > > >>> Anyways, There's always a space between the two sideways carrots, just > > >>> like there is right now: <John Doe> See. Space. And there's always a > > >>> space between the data and time. Like this. 2016-07-01 15:34:30 See. > > >>> Space. But there's never a space between the end of the comment and > > >>> the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > > >>> See. starbucks and 2016 are smooshed together. > > >>> > > >>> This code is also on the table right now too. > > >>> > > >>> a <- read.table("E:/working > > >>> directory/-189/hangouts-conversation2.txt", quote="\"", > > >>> comment.char="", fill=TRUE) > > >>> > > >>> > > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > > >>> > > >>> aa<-gsub("[^[:digit:]]","",h) > > >>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > > >>> > > >>> Those last lines are a work in progress. I wish I could import a > > >>> picture of what it looks like when it's translated into a data frame. > > >>> The fill=TRUE helped to get the data in table that kind of sort of > > >>> works, but the comments keep bleeding into the data and time column. > > >>> It's like > > >>> > > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > > >>> over there > > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > > >>> > > >>> And then, maybe, the "seriously" will be in a column all to itself, as > > >>> will be the "I've'"and the "never" etc. > > >>> > > >>> I will use a regular expression if I have to, but it would be nice to > > >>> keep the dates and times on there. Originally, I thought they were > > >>> meaningless, but I've since changed my mind on that count. The time of > > >>> day isn't so important. But, especially since, say, Gmail itself knows > > >>> how to quickly recognize what it is, I know it can be done. I know > > >>> this data has structure to it. > > >>> > > >>> Michael > > >>> > > >>> > > >>> > > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > > dwinsemius at comcast.net> wrote: > > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > > >>>>> I have a wild and crazy text file, the head of which looks like this: > > >>>>> > > >>>>> 2016-07-01 02:50:35 <john> hey > > >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > > >>>>> 2016-07-01 02:51:45 <john> thinking about my boo > > >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not really > > >>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't sleep > > >>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am > > really > > >>>>> 2016-07-01 02:54:17 <john> just know it's london > > >>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > > >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > > >>>>> 2016-07-01 02:58:56 <jone> > > >>>>> 2016-07-01 02:59:34 <jane> > > >>>>> 2016-07-01 03:02:48 <john> British security is a little more > > rigorous... > > >>>> Looks entirely not-"crazy". Typical log file format. > > >>>> > > >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use regex > > >>>> (i.e. the sub-function) to strip everything up to the "<". Read > > >>>> `?regex`. Since that's not a metacharacters you could use a pattern > > >>>> ".+<" and replace with "". > > >>>> > > >>>> And do read the Posting Guide. Cross-posting to StackOverflow and > > Rhelp, > > >>>> at least within hours of each, is considered poor manners. > > >>>> > > >>>> > > >>>> -- > > >>>> > > >>>> David. > > >>>> > > >>>>> It goes on for a while. It's a big file. But I feel like it's going > > to > > >>>>> be difficult to annotate with the coreNLP library or package. I'm > > >>>>> doing natural language processing. In other words, I'm curious as to > > >>>>> how I would shave off the dates, that is, to make it look like: > > >>>>> > > >>>>> <john> hey > > >>>>> <jane> waiting for plane to Edinburgh > > >>>>> <john> thinking about my boo > > >>>>> <jane> nothing crappy has happened, not really > > >>>>> <john> plane went by pretty fast, didn't sleep > > >>>>> <jane> no idea what time it is or where I am really > > >>>>> <john> just know it's london > > >>>>> <jane> you are probably asleep > > >>>>> <jane> I hope fish was fishy in a good eay > > >>>>> <jone> > > >>>>> <jane> > > >>>>> <john> British security is a little more rigorous... > > >>>>> > > >>>>> To be clear, then, I'm trying to clean a large text file by writing a > > >>>>> regular expression? such that I create a new object with no numbers > > or > > >>>>> dates. > > >>>>> > > >>>>> Michael > > >>>>> > > >>>>> ______________________________________________ > > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>>>> PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > >>>>> and provide commented, minimal, self-contained, reproducible code. > > >>> ______________________________________________ > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > >>> PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > >>> and provide commented, minimal, self-contained, reproducible code. > > >> ______________________________________________ > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > >> PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > >> and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
William Dunlap
2019-May-17 19:12 UTC
[R] how to separate string from numbers in a large txt file
The pattern I gave worked for the lines that you originally showed from the data file ('a'), before you put commas into them. If the name is either of the form "<name>" or "***" then the "(<[^>]*>)" needs to be changed so something like "(<[^>]*>|[*]{3})". The " ???" at the start of the imported data may come from the byte order mark that Windows apps like to put at the front of a text file in UTF-8 or UTF-16 format. Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, May 17, 2019 at 11:53 AM Michael Boulineau < michael.p.boulineau at gmail.com> wrote:> This seemed to work: > > > a <- readLines ("hangouts-conversation-6.csv.txt") > > b <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", a) > > b [1:84] > > And the first 85 lines looks like this: > > [83] "2016-06-28 21:02:28 *** Jane Doe started a video chat" > [84] "2016-06-28 21:12:43 *** John Doe ended a video chat" > > Then they transition to the commas: > > > b [84:100] > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > [2] "2016-07-01,02:50:35,<John Doe>,hey" > [3] "2016-07-01,02:51:26,<John Doe>,waiting for plane to Edinburgh" > [4] "2016-07-01,02:51:45,<John Doe>,thinking about my boo" > > Even the strange bit on line 6347 was caught by this: > > > b [6346:6348] > [1] "2016-10-21,10:56:29,<John Doe>,John_Doe" > [2] "2016-10-21,10:56:37,<John Doe>,Admit#8242" > [3] "2016-10-21,11:00:13,<Jane Doe>,Okay so you have a discussion" > > Perhaps most awesomely, the code catches spaces that are interposed > into the comment itself: > > > b [4] > [1] "2016-01-27,09:15:20,<Jane Doe>,Hey " > > b [85] > [1] "2016-07-01,02:50:35,<John Doe>,hey" > > Notice whether there is a space after the "hey" or not. > > These are the first two lines: > > [1] "???2016-01-27 09:14:40 *** Jane Doe started a video chat" > [2] "2016-01-27,09:15:20,<Jane > Doe>, > https://lh3.googleusercontent.com/-_WQF5kRcnpk/Vqj7J4aK1jI/AAAAAAAAAVA/GVqutPqbSuo/s0/be8ded30-87a6-4e80-bdfa-83ed51591dbf > " > > So, who knows what happened with the ??? at the beginning of [1] > directly above. But notice how there are no commas in [1] but there > appear in [2]. I don't see why really long ones like [2] directly > above would be a problem, were they to be translated into a csv or > data frame column. > > Now, with the commas in there, couldn't we write this into a csv or a > data.frame? Some of this data will end up being garbage, I imagine. > Like in [2] directly above. Or with [83] and [84] at the top of this > discussion post/email. Embarrassingly, I've been trying to convert > this into a data.frame or csv but I can't manage to. I've been using > the write.csv function, but I don't think I've been getting the > arguments correct. > > At the end of the day, I would like a data.frame and/or csv with the > following four columns: date, time, person, comment. > > I tried this, too: > > > c <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > + [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > + a, proto=data.frame(stringsAsFactors=FALSE, When="", > Who="", > + What="")) > > But all I got was this: > > > c [1:100, ] > When Who What > 1 <NA> <NA> <NA> > 2 <NA> <NA> <NA> > 3 <NA> <NA> <NA> > 4 <NA> <NA> <NA> > 5 <NA> <NA> <NA> > 6 <NA> <NA> <NA> > > It seems to have caught nothing. > > > unique (c) > When Who What > 1 <NA> <NA> <NA> > > But I like that it converted into columns. That's a really great > format. With a little tweaking, it'd be a great code for this data > set. > > Michael > > On Fri, May 17, 2019 at 8:20 AM William Dunlap via R-help > <r-help at r-project.org> wrote: > > > > Consider using readLines() and strcapture() for reading such a file. > E.g., > > suppose readLines(files) produced a character vector like > > > > x <- c("2016-10-21 10:35:36 <Jane Doe> What's your login", > > "2016-10-21 10:56:29 <John Doe> John_Doe", > > "2016-10-21 10:56:37 <John Doe> Admit#8242", > > "October 23, 1819 12:34 <Jane Eyre> I am not an angel") > > > > Then you can make a data.frame with columns When, Who, and What by > > supplying a pattern containing three parenthesized capture expressions: > > > z <- strcapture("^([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} > > [[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}) +(<[^>]*>) *(.*$)", > > x, proto=data.frame(stringsAsFactors=FALSE, When="", Who="", > > What="")) > > > str(z) > > 'data.frame': 4 obs. of 3 variables: > > $ When: chr "2016-10-21 10:35:36" "2016-10-21 10:56:29" "2016-10-21 > > 10:56:37" NA > > $ Who : chr "<Jane Doe>" "<John Doe>" "<John Doe>" NA > > $ What: chr "What's your login" "John_Doe" "Admit#8242" NA > > > > Lines that don't match the pattern result in NA's - you might make a > second > > pass over the corresponding elements of x with a new pattern. > > > > You can convert the When column from character to time with as.POSIXct(). > > > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > > > On Thu, May 16, 2019 at 8:30 PM David Winsemius <dwinsemius at comcast.net> > > wrote: > > > > > > > > On 5/16/19 3:53 PM, Michael Boulineau wrote: > > > > OK. So, I named the object test and then checked the 6347th item > > > > > > > >> test <- readLines ("hangouts-conversation.txt) > > > >> test [6347] > > > > [1] "2016-10-21 10:56:37 <John Doe> Admit#8242" > > > > > > > > Perhaps where it was getting screwed up is, since the end of this is > a > > > > number (8242), then, given that there's no space between the number > > > > and what ought to be the next row, R didn't know where to draw the > > > > line. Sure enough, it looks like this when I go to the original file > > > > and control f "#8242" > > > > > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login > > > > 2016-10-21 10:56:29 <John Doe> John_Doe > > > > 2016-10-21 10:56:37 <John Doe> Admit#8242 > > > > > > > > > An octothorpe is an end of line signifier and is interpreted as > allowing > > > comments. You can prevent that interpretation with suitable choice of > > > parameters to `read.table` or `read.csv`. I don't understand why that > > > should cause anu error or a failure to match that pattern. > > > > > > > 2016-10-21 11:00:13 <Jane Doe> Okay so you have a discussion > > > > > > > > Again, it doesn't look like that in the file. Gmail automatically > > > > formats it like that when I paste it in. More to the point, it looks > > > > like > > > > > > > > 2016-10-21 10:35:36 <Jane Doe> What's your login2016-10-21 10:56:29 > > > > <John Doe> John_Doe2016-10-21 10:56:37 <John Doe> > Admit#82422016-10-21 > > > > 11:00:13 <Jane Doe> Okay so you have a discussion > > > > > > > > Notice Admit#82422016. So there's that. > > > > > > > > Then I built object test2. > > > > > > > > test2 <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", test) > > > > > > > > This worked for 84 lines, then this happened. > > > > > > It may have done something but as you later discovered my first code > for > > > the pattern was incorrect. I had tested it (and pasted in the results > of > > > the test) . The way to refer to a capture class is with back-slashes > > > before the numbers, not forward-slashes. Try this: > > > > > > > > > > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", > chrvec) > > > > newvec > > > [1] "2016-07-01,02:50:35,<john>,hey" > > > [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > > > [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > > > [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not > really" > > > [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't > sleep" > > > [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where I am > > > really" > > > [7] "2016-07-01,02:54:17,<john>,just know it's london" > > > [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > > > [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good eay" > > > [10] "2016-07-01 02:58:56 <jone>" > > > [11] "2016-07-01 02:59:34 <jane>" > > > [12] "2016-07-01,03:02:48,<john>,British security is a little more > > > rigorous..." > > > > > > > > > I made note of the fact that the 10th and 11th lines had no commas. > > > > > > > > > > >> test2 [84] > > > > [1] "2016-06-28 21:12:43 *** John Doe ended a video chat" > > > > > > That line didn't have any "<" so wasn't matched. > > > > > > > > > You could remove all none matching lines for pattern of > > > > > > dates<space>times<space>"<"<name>">"<space><anything> > > > > > > > > > with: > > > > > > > > > chrvec <- chrvec[ grepl("^.{10} .{8} <.+> .+$)", chrvec)] > > > > > > > > > Do read: > > > > > > ?read.csv > > > > > > ?regex > > > > > > > > > -- > > > > > > David > > > > > > > > > >> test2 [85] > > > > [1] "//1,//2,//3,//4" > > > >> test [85] > > > > [1] "2016-07-01 02:50:35 <John Doe> hey" > > > > > > > > Notice how I toggled back and forth between test and test2 there. So, > > > > whatever happened with the regex, it happened in the switch from 84 > to > > > > 85, I guess. It went on like > > > > > > > > [990] "//1,//2,//3,//4" > > > > [991] "//1,//2,//3,//4" > > > > [992] "//1,//2,//3,//4" > > > > [993] "//1,//2,//3,//4" > > > > [994] "//1,//2,//3,//4" > > > > [995] "//1,//2,//3,//4" > > > > [996] "//1,//2,//3,//4" > > > > [997] "//1,//2,//3,//4" > > > > [998] "//1,//2,//3,//4" > > > > [999] "//1,//2,//3,//4" > > > > [1000] "//1,//2,//3,//4" > > > > > > > > up until line 1000, then I reached max.print. > > > > > > > Michael > > > > > > > > On Thu, May 16, 2019 at 1:05 PM David Winsemius < > dwinsemius at comcast.net> > > > wrote: > > > >> > > > >> On 5/16/19 12:30 PM, Michael Boulineau wrote: > > > >>> Thanks for this tip on etiquette, David. I will be sure and not do > > > that again. > > > >>> > > > >>> I tried the read.fwf from the foreign package, with a code like > this: > > > >>> > > > >>> d <- read.fwf("hangouts-conversation.txt", > > > >>> widths= c(10,10,20,40), > > > >>> col.names=c("date","time","person","comment"), > > > >>> strip.white=TRUE) > > > >>> > > > >>> But it threw this error: > > > >>> > > > >>> Error in scan(file = file, what = what, sep = sep, quote = quote, > dec > > > = dec, : > > > >>> line 6347 did not have 4 elements > > > >> > > > >> So what does line 6347 look like? (Use `readLines` and print it > out.) > > > >> > > > >>> Interestingly, though, the error only happened when I increased the > > > >>> width size. But I had to increase the size, or else I couldn't > "see" > > > >>> anything. The comment was so small that nothing was being > captured by > > > >>> the size of the column. so to speak. > > > >>> > > > >>> It seems like what's throwing me is that there's no comma that > > > >>> demarcates the end of the text proper. For example: > > > >> Not sure why you thought there should be a comma. Lines usually end > > > >> with <cr> and or a <lf>. > > > >> > > > >> > > > >> Once you have the raw text in a character vector from `readLines` > named, > > > >> say, 'chrvec', then you could selectively substitute commas for > spaces > > > >> with regex. (Now that you no longer desire to remove the dates and > > > times.) > > > >> > > > >> sub("^(.{10}) (.{8}) (<.+>) (.+$)", "//1,//2,//3,//4", chrvec) > > > >> > > > >> This will not do any replacements when the pattern is not matched. > See > > > >> this test: > > > >> > > > >> > > > >> > newvec <- sub("^(.{10}) (.{8}) (<.+>) (.+$)", "\\1,\\2,\\3,\\4", > > > chrvec) > > > >> > newvec > > > >> [1] "2016-07-01,02:50:35,<john>,hey" > > > >> [2] "2016-07-01,02:51:26,<jane>,waiting for plane to Edinburgh" > > > >> [3] "2016-07-01,02:51:45,<john>,thinking about my boo" > > > >> [4] "2016-07-01,02:52:07,<jane>,nothing crappy has happened, not > > > really" > > > >> [5] "2016-07-01,02:52:20,<john>,plane went by pretty fast, didn't > > > sleep" > > > >> [6] "2016-07-01,02:54:08,<jane>,no idea what time it is or where > I am > > > >> really" > > > >> [7] "2016-07-01,02:54:17,<john>,just know it's london" > > > >> [8] "2016-07-01,02:56:44,<jane>,you are probably asleep" > > > >> [9] "2016-07-01,02:58:45,<jane>,I hope fish was fishy in a good > eay" > > > >> [10] "2016-07-01 02:58:56 <jone>" > > > >> [11] "2016-07-01 02:59:34 <jane>" > > > >> [12] "2016-07-01,03:02:48,<john>,British security is a little more > > > >> rigorous..." > > > >> > > > >> > > > >> You should probably remove the "empty comment" lines. > > > >> > > > >> > > > >> -- > > > >> > > > >> David. > > > >> > > > >>> 2016-07-01 15:34:30 <John Doe> Lame. We were in a > starbucks2016-07-01 > > > >>> 15:35:02 <Jane Doe> Hmm that's interesting2016-07-01 15:35:09 <Jane > > > >>> Doe> You must want coffees2016-07-01 15:35:25 <John Doe> There was > > > >>> lots of Starbucks in my day2016-07-01 15:35:47 > > > >>> > > > >>> It was interesting, too, when I pasted the text into the email, it > > > >>> self-formatted into the way I wanted it to look. I had to manually > > > >>> make it look like it does above, since that's the way that it > looks in > > > >>> the txt file. I wonder if it's being organized by XML or something. > > > >>> > > > >>> Anyways, There's always a space between the two sideways carrots, > just > > > >>> like there is right now: <John Doe> See. Space. And there's always > a > > > >>> space between the data and time. Like this. 2016-07-01 15:34:30 > See. > > > >>> Space. But there's never a space between the end of the comment and > > > >>> the next date. Like this: We were in a starbucks2016-07-01 15:35:02 > > > >>> See. starbucks and 2016 are smooshed together. > > > >>> > > > >>> This code is also on the table right now too. > > > >>> > > > >>> a <- read.table("E:/working > > > >>> directory/-189/hangouts-conversation2.txt", quote="\"", > > > >>> comment.char="", fill=TRUE) > > > >>> > > > >>> > > > > h<-cbind(hangouts.conversation2[,1:2],hangouts.conversation2[,3:5],hangouts.conversation2[,6:9]) > > > >>> > > > >>> aa<-gsub("[^[:digit:]]","",h) > > > >>> my.data.num <- as.numeric(str_extract(h, "[0-9]+")) > > > >>> > > > >>> Those last lines are a work in progress. I wish I could import a > > > >>> picture of what it looks like when it's translated into a data > frame. > > > >>> The fill=TRUE helped to get the data in table that kind of sort of > > > >>> works, but the comments keep bleeding into the data and time > column. > > > >>> It's like > > > >>> > > > >>> 2016-07-01 15:59:17 <Jane Doe> Seriously I've never been > > > >>> over there > > > >>> 2016-07-01 15:59:27 <Jane Doe> It confuses me :( > > > >>> > > > >>> And then, maybe, the "seriously" will be in a column all to > itself, as > > > >>> will be the "I've'"and the "never" etc. > > > >>> > > > >>> I will use a regular expression if I have to, but it would be nice > to > > > >>> keep the dates and times on there. Originally, I thought they were > > > >>> meaningless, but I've since changed my mind on that count. The > time of > > > >>> day isn't so important. But, especially since, say, Gmail itself > knows > > > >>> how to quickly recognize what it is, I know it can be done. I know > > > >>> this data has structure to it. > > > >>> > > > >>> Michael > > > >>> > > > >>> > > > >>> > > > >>> On Wed, May 15, 2019 at 8:47 PM David Winsemius < > > > dwinsemius at comcast.net> wrote: > > > >>>> On 5/15/19 4:07 PM, Michael Boulineau wrote: > > > >>>>> I have a wild and crazy text file, the head of which looks like > this: > > > >>>>> > > > >>>>> 2016-07-01 02:50:35 <john> hey > > > >>>>> 2016-07-01 02:51:26 <jane> waiting for plane to Edinburgh > > > >>>>> 2016-07-01 02:51:45 <john> thinking about my boo > > > >>>>> 2016-07-01 02:52:07 <jane> nothing crappy has happened, not > really > > > >>>>> 2016-07-01 02:52:20 <john> plane went by pretty fast, didn't > sleep > > > >>>>> 2016-07-01 02:54:08 <jane> no idea what time it is or where I am > > > really > > > >>>>> 2016-07-01 02:54:17 <john> just know it's london > > > >>>>> 2016-07-01 02:56:44 <jane> you are probably asleep > > > >>>>> 2016-07-01 02:58:45 <jane> I hope fish was fishy in a good eay > > > >>>>> 2016-07-01 02:58:56 <jone> > > > >>>>> 2016-07-01 02:59:34 <jane> > > > >>>>> 2016-07-01 03:02:48 <john> British security is a little more > > > rigorous... > > > >>>> Looks entirely not-"crazy". Typical log file format. > > > >>>> > > > >>>> Two possibilities: 1) Use `read.fwf` from pkg foreign; 2) Use > regex > > > >>>> (i.e. the sub-function) to strip everything up to the "<". Read > > > >>>> `?regex`. Since that's not a metacharacters you could use a > pattern > > > >>>> ".+<" and replace with "". > > > >>>> > > > >>>> And do read the Posting Guide. Cross-posting to StackOverflow and > > > Rhelp, > > > >>>> at least within hours of each, is considered poor manners. > > > >>>> > > > >>>> > > > >>>> -- > > > >>>> > > > >>>> David. > > > >>>> > > > >>>>> It goes on for a while. It's a big file. But I feel like it's > going > > > to > > > >>>>> be difficult to annotate with the coreNLP library or package. I'm > > > >>>>> doing natural language processing. In other words, I'm curious > as to > > > >>>>> how I would shave off the dates, that is, to make it look like: > > > >>>>> > > > >>>>> <john> hey > > > >>>>> <jane> waiting for plane to Edinburgh > > > >>>>> <john> thinking about my boo > > > >>>>> <jane> nothing crappy has happened, not really > > > >>>>> <john> plane went by pretty fast, didn't sleep > > > >>>>> <jane> no idea what time it is or where I am really > > > >>>>> <john> just know it's london > > > >>>>> <jane> you are probably asleep > > > >>>>> <jane> I hope fish was fishy in a good eay > > > >>>>> <jone> > > > >>>>> <jane> > > > >>>>> <john> British security is a little more rigorous... > > > >>>>> > > > >>>>> To be clear, then, I'm trying to clean a large text file by > writing a > > > >>>>> regular expression? such that I create a new object with no > numbers > > > or > > > >>>>> dates. > > > >>>>> > > > >>>>> Michael > > > >>>>> > > > >>>>> ______________________________________________ > > > >>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, > see > > > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > > > >>>>> PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > >>>>> and provide commented, minimal, self-contained, reproducible > code. > > > >>> ______________________________________________ > > > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > >>> https://stat.ethz.ch/mailman/listinfo/r-help > > > >>> PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > >>> and provide commented, minimal, self-contained, reproducible code. > > > >> ______________________________________________ > > > >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > >> https://stat.ethz.ch/mailman/listinfo/r-help > > > >> PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > >> and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > ______________________________________________ > > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Ivan Krylov
2019-May-17 19:43 UTC
[R] how to separate string from numbers in a large txt file
On Fri, 17 May 2019 11:36:22 -0700 Michael Boulineau <michael.p.boulineau at gmail.com> wrote:> So, who knows what happened with the ??? at the beginning of [1] > directly above.perl -Mutf8 -MEncode=encode,decode -Mcharnames=:full \ -E'say charnames::viacode ord decode utf8 => encode latin1 => "???"' # ZERO WIDTH NO-BREAK SPACE So the text seems to have been encoded in UTF-8, then decoded as Latin-1. If you have multiple such artefacts and want to get rid of them, try: a <- readLines(con <- file("hangouts-conversation-6.csv.txt", encoding = "UTF-8")); close(con); rm(con) -- Best regards, Ivan