Hi,
Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using
data <- read.csv("20_newsgroups.csv",header=TRUE)
throws this.
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
So, for example, the first line in the file is this. This column contains only
such text. Is there a way read it ?
From: cubbie at garnet.berkeley.edu () Subject: Re: Cubs behind Marlins? How?
Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley
Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu gajarsky at pilot.njin.net
writes: morgan and guzman will have era's 1 run higher than last year, and
the cubs will be idiots and not pitch harkey as much as hibbard. castillo
won't be good (i think he's a stud pitcher) This season so far,
Morgan and Guzman helped to lead the Cubs at top in ERA, even better than
THE rotation at Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know
it is early in the season, we Cubs fans have learned how to enjoy the
short triumph while it is still there.
Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and destroy
all copies of the original message. Any unauthorized review, use, disclosure,
dissemination, forwarding, printing or copying of this email, and/or any action
taken in reliance on the contents of this e-mail is strictly prohibited and may
be unlawful. Where permitted by applicable law, this e-mail and other e-mail
communications sent to and from Cognizant e-mail addresses may be monitored.
[[alternative HTML version deleted]]
You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956 Jean On Thu, Aug 10, 2017 at 7:59 AM, <Mohan.Radhakrishnan at cognizant.com> wrote:> Hi, > > Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading > it using > > data <- read.csv("20_newsgroups.csv",header=TRUE) > > throws this. > > Warning message: > In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > EOF within quoted string > > So, for example, the first line in the file is this. This column contains > only such text. Is there a way read it ? > > From: cubbie at garnet.berkeley.edu () Subject: Re: Cubs behind Marlins? > How? Article-I.D.: agate.1pt592$f9a Organization: University of California, > Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu > gajarsky at pilot.njin.net writes: morgan and guzman will have era's 1 run > higher than last year, and the cubs will be idiots and not pitch harkey as > much as hibbard. castillo won't be good (i think he's a stud pitcher) > This season so far, Morgan and Guzman helped to lead the Cubs at > top in ERA, even better than THE rotation at Atlanta. Cubs ERA at > 0.056 while Braves at 0.059. We know it is early in the season, we > Cubs fans have learned how to enjoy the short triumph while it is > still there. > > Thanks, > Mohan > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]
Yes. I tried that already. Not straightforward.
data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F,
quote="", sep=",", encoding="UTF-8")
This line does read it haphazardly. The emails in the column are split into
multiple columns and there are several columns with just ?NA?. Totally 202
columns.
And then I removed columns with NA?s and concatenated all the text and finally
got it.
munged <- data[, unlist(lapply(data, function(x) !all(is.na(x))))]
munged <- munged[-1,]
munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse =
" ")
munged <- munged[,c("V1","V2","text")]
print(head(munged$text))
Mohan
From: Adams, Jean [mailto:jvadams at usgs.gov]
Sent: Thursday, August 10, 2017 8:03 PM
To: Radhakrishnan, Mohan (Cognizant) <Mohan.Radhakrishnan at
cognizant.com>
Cc: R help <r-help at r-project.org>
Subject: Re: [R] EOF within quoted string
You might want to try some of the suggestions mentioned in this post:
https://stackoverflow.com/q/17414776/2140956
Jean
On Thu, Aug 10, 2017 at 7:59 AM, <Mohan.Radhakrishnan at
cognizant.com<mailto:Mohan.Radhakrishnan at cognizant.com>> wrote:
Hi,
Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using
data <- read.csv("20_newsgroups.csv",header=TRUE)
throws this.
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
So, for example, the first line in the file is this. This column contains only
such text. Is there a way read it ?
From: cubbie at garnet.berkeley.edu<mailto:cubbie at garnet.berkeley.edu>
() Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a
Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host:
garnet.berkeley.edu<http://garnet.berkeley.edu> gajarsky at
pilot.njin.net<mailto:gajarsky at pilot.njin.net> writes: morgan and
guzman will have era's 1 run higher than last year, and the cubs will be
idiots and not pitch harkey as much as hibbard. castillo won't be good (i
think he's a stud pitcher) This season so far, Morgan and Guzman
helped to lead the Cubs at top in ERA, even better than THE rotation at
Atlanta. Cubs ERA at 0.056 while Braves at 0.059. We know it is early
in the season, we Cubs fans have learned how to enjoy the short triumph
while it is still there.
Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and destroy
all copies of the original message. Any unauthorized review, use, disclosure,
dissemination, forwarding, printing or copying of this email, and/or any action
taken in reliance on the contents of this e-mail is strictly prohibited and may
be unlawful. Where permitted by applicable law, this e-mail and other e-mail
communications sent to and from Cognizant e-mail addresses may be monitored.
[[alternative HTML version deleted]]
______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
This e-mail and any files transmitted with it are for the sole use of the
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient(s), please reply to the sender and destroy
all copies of the original message. Any unauthorized review, use, disclosure,
dissemination, forwarding, printing or copying of this email, and/or any action
taken in reliance on the contents of this e-mail is strictly prohibited and may
be unlawful. Where permitted by applicable law, this e-mail and other e-mail
communications sent to and from Cognizant e-mail addresses may be monitored.
[[alternative HTML version deleted]]