Hi, Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using data <- read.csv("20_newsgroups.csv",header=TRUE) throws this. Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : EOF within quoted string So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ? From: cubbie at garnet.berkeley.edu () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu gajarsky at pilot.njin.net writes: morgan and guzman will have era's 1 run higher than last year, and the cubs will be idiots and not pitch harkey as much as hibbard. castillo won't be good (i think he's a stud pitcher) This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know it is early in the season, we Cubs fans have learned how to enjoy the short triumph while it is still there. Thanks, Mohan This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. [[alternative HTML version deleted]]
You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956 Jean On Thu, Aug 10, 2017 at 7:59 AM, <Mohan.Radhakrishnan at cognizant.com> wrote:> Hi, > > Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading > it using > > data <- read.csv("20_newsgroups.csv",header=TRUE) > > throws this. > > Warning message: > In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > EOF within quoted string > > So, for example, the first line in the file is this. This column contains > only such text. Is there a way read it ? > > From: cubbie at garnet.berkeley.edu () Subject: Re: Cubs behind Marlins? > How? Article-I.D.: agate.1pt592$f9a Organization: University of California, > Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu > gajarsky at pilot.njin.net writes: morgan and guzman will have era's 1 run > higher than last year, and the cubs will be idiots and not pitch harkey as > much as hibbard. castillo won't be good (i think he's a stud pitcher) > This season so far, Morgan and Guzman helped to lead the Cubs at > top in ERA, even better than THE rotation at Atlanta. Cubs ERA at > 0.056 while Braves at 0.059. We know it is early in the season, we > Cubs fans have learned how to enjoy the short triumph while it is > still there. > > Thanks, > Mohan > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
Yes. I tried that already. Not straightforward. data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, quote="", sep=",", encoding="UTF-8") This line does read it haphazardly. The emails in the column are split into multiple columns and there are several columns with just ?NA?. Totally 202 columns. And then I removed columns with NA?s and concatenated all the text and finally got it. munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))] munged <- munged[-1,] munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse = " ") munged <- munged[,c("V1","V2","text")] print(head(munged$text)) Mohan From: Adams, Jean [mailto:jvadams at usgs.gov] Sent: Thursday, August 10, 2017 8:03 PM To: Radhakrishnan, Mohan (Cognizant) <Mohan.Radhakrishnan at cognizant.com> Cc: R help <r-help at r-project.org> Subject: Re: [R] EOF within quoted string You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956 Jean On Thu, Aug 10, 2017 at 7:59 AM, <Mohan.Radhakrishnan at cognizant.com<mailto:Mohan.Radhakrishnan at cognizant.com>> wrote: Hi, Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using data <- read.csv("20_newsgroups.csv",header=TRUE) throws this. Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : EOF within quoted string So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ? From: cubbie at garnet.berkeley.edu<mailto:cubbie at garnet.berkeley.edu> () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu> gajarsky at pilot.njin.net<mailto:gajarsky at pilot.njin.net> writes: morgan and guzman will have era's 1 run higher than last year, and the cubs will be idiots and not pitch harkey as much as hibbard. castillo won't be good (i think he's a stud pitcher) This season so far, Morgan and Guzman helped to lead the Cubs at top in ERA, even better than THE rotation at Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know it is early in the season, we Cubs fans have learned how to enjoy the short triumph while it is still there. Thanks, Mohan This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. [[alternative HTML version deleted]]