thr3ads.net - R help - [R] Huge Dataset Dates Span two Lines [Jan 2015]

If this information is useful, please help other people find it:
Share via:

DVL

2015-Jan-08 18:20 UTC

[R] Huge Dataset Dates Span two Lines

I'm trying to import a many gigabyte .txt file to analyze. It is asterisk
delimited. I'm having an issue with the date field in the dataset. In the
first 165 lines dates are listed as :
YYYY-MM-DD HH:MM:SS

Then on the 166th line and in other places the date spans two lines: 
YYYY-MM-DD
HH:MM:SS

This causes a problem because R thinks it has reached the end of a row in
the table. How can I solve this?



--
View this message in context:
http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html
Sent from the R help mailing list archive at Nabble.com.

Collin Lynch

2015-Jan-08 19:37 UTC

head link

[R] Huge Dataset Dates Span two Lines

You might consider using something other than R to clean the file and even
to load it.  I regularly use python to preprocess data for R and often feed
it to R directly via the rpy2 interface.  If the dates are delimited by
some feature (e.g. ") you could potentially use the python csv library to
load it directly and then either send that or dump it in a clean form.
Alternatively you could use a simple script to iterate over the file and to
remove newlines in that case without doing any other processing.  A regular
experession of the form:

"([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})(
)+(\n?)([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})"

should match the date/time strings even with an embedded newline.  You
could use this to detect those cases in the file and then to replace the
newline with a whitespace character.

On Thu, Jan 8, 2015 at 1:20 PM, DVL <daniel.van.lunen at state.ma.us>
wrote:
> I'm trying to import a many gigabyte .txt file to analyze. It is
asterisk
> delimited. I'm having an issue with the date field in the dataset. In
the
> first 165 lines dates are listed as :
> YYYY-MM-DD HH:MM:SS
>
> Then on the 166th line and in other places the date spans two lines:
> YYYY-MM-DD
> HH:MM:SS
>
> This causes a problem because R thinks it has reached the end of a row in
> the table. How can I solve this?
>
>
>
> --
> View this message in context:
>
http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

David Winsemius

2015-Jan-08 21:41 UTC

head link

[R] Huge Dataset Dates Span two Lines

On Jan 8, 2015, at 10:20 AM, DVL wrote:
> I'm trying to import a many gigabyte .txt file to analyze. It is
asterisk
> delimited. I'm having an issue with the date field in the dataset. In
the
> first 165 lines dates are listed as :
> YYYY-MM-DD HH:MM:SS
> 
> Then on the 166th line and in other places the date spans two lines: 
> YYYY-MM-DD
> HH:MM:SS
> 
> This causes a problem because R thinks it has reached the end of a row in
> the table. How can I solve this?
It would probably be easiest to edit the file in a text editor. I suppose you
could also read the file in with readLines() and do the work all in R but that
sounds a bit more painful than option 1 to my reading. If the problems are only
those exactly as you describe, this could be an untested outline of a solution:

dat <- readLines("/pat/fil.ext")
marks <- nchar(dat) == 10
#or 
marks <- grepl("[*]", dat)
# append shortened lines after broken fragments
dat[ marks ] <- paste(dat[ marks ], dat[ c(head(marks,-1), FALSE) ] )
final <- dat[ ! c(head(marks,-1), FALSE) ] # remove shorter lines
> View this message in context:
http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html
> Sent from the R help mailing list archive at Nabble.com.
> 
Nabble is not the Rhelp Archive and it also suppresses these message which you
should be sure to read:
*______________________________________________
*R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
*https://stat.ethz.ch/mailman/listinfo/r-help
*PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
*and provide commented, minimal, self-contained, reproducible code.

-- 
David Winsemius
Alameda, CA, USA

R help - Jan 2015 - Huge Dataset Dates Span two Lines

[R] Huge Dataset Dates Span two Lines

[R] Huge Dataset Dates Span two Lines

[R] Huge Dataset Dates Span two Lines