I'm trying to import a many gigabyte .txt file to analyze. It is asterisk delimited. I'm having an issue with the date field in the dataset. In the first 165 lines dates are listed as : YYYY-MM-DD HH:MM:SS Then on the 166th line and in other places the date spans two lines: YYYY-MM-DD HH:MM:SS This causes a problem because R thinks it has reached the end of a row in the table. How can I solve this? -- View this message in context: http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html Sent from the R help mailing list archive at Nabble.com.
You might consider using something other than R to clean the file and even to load it. I regularly use python to preprocess data for R and often feed it to R directly via the rpy2 interface. If the dates are delimited by some feature (e.g. ") you could potentially use the python csv library to load it directly and then either send that or dump it in a clean form. Alternatively you could use a simple script to iterate over the file and to remove newlines in that case without doing any other processing. A regular experession of the form: "([0-9]{4,4}-[0-9]{2,2}-[0-9]{2,2})( )+(\n?)([0-9]{2,2}:[0-9]{2,2}:[0-9]{2,2})" should match the date/time strings even with an embedded newline. You could use this to detect those cases in the file and then to replace the newline with a whitespace character. On Thu, Jan 8, 2015 at 1:20 PM, DVL <daniel.van.lunen at state.ma.us> wrote:> I'm trying to import a many gigabyte .txt file to analyze. It is asterisk > delimited. I'm having an issue with the date field in the dataset. In the > first 165 lines dates are listed as : > YYYY-MM-DD HH:MM:SS > > Then on the 166th line and in other places the date spans two lines: > YYYY-MM-DD > HH:MM:SS > > This causes a problem because R thinks it has reached the end of a row in > the table. How can I solve this? > > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On Jan 8, 2015, at 10:20 AM, DVL wrote:> I'm trying to import a many gigabyte .txt file to analyze. It is asterisk > delimited. I'm having an issue with the date field in the dataset. In the > first 165 lines dates are listed as : > YYYY-MM-DD HH:MM:SS > > Then on the 166th line and in other places the date spans two lines: > YYYY-MM-DD > HH:MM:SS > > This causes a problem because R thinks it has reached the end of a row in > the table. How can I solve this?It would probably be easiest to edit the file in a text editor. I suppose you could also read the file in with readLines() and do the work all in R but that sounds a bit more painful than option 1 to my reading. If the problems are only those exactly as you describe, this could be an untested outline of a solution: dat <- readLines("/pat/fil.ext") marks <- nchar(dat) == 10 #or marks <- grepl("[*]", dat) # append shortened lines after broken fragments dat[ marks ] <- paste(dat[ marks ], dat[ c(head(marks,-1), FALSE) ] ) final <- dat[ ! c(head(marks,-1), FALSE) ] # remove shorter lines> View this message in context: http://r.789695.n4.nabble.com/Huge-Dataset-Dates-Span-two-Lines-tp4701523.html > Sent from the R help mailing list archive at Nabble.com. >Nabble is not the Rhelp Archive and it also suppresses these message which you should be sure to read: *______________________________________________ *R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see *https://stat.ethz.ch/mailman/listinfo/r-help *PLEASE do read the posting guide http://www.R-project.org/posting-guide.html *and provide commented, minimal, self-contained, reproducible code. -- David Winsemius Alameda, CA, USA