algotr8der
2011-Mar-20 19:47 UTC
[R] read file part way through based on start and end date (first column)
Hello folks - I have been trying to figure this out. I have a set of very large files that are of this format , , , , 1/4/1999,9:31:00 AM,blah, blah, blah 1/4/1999,9:32:00 AM,blah, blah, blah 1/4/1999,9:33:00 AM,blah, blah, blah I want to write R code that reads only that data between a start and an end date (data is presented from oldest at the top of the file to the most recent at the bottom of the file). I'm not sure if there is an R function that makes this easy. I know the read.csv function enables you to skip a user specified number of rows before the file is read but this doesnt exactly help me as my start and end dates can be anywhere in between. Appreciate the help. -- View this message in context: http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3391769.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2011-Mar-20 21:04 UTC
[R] read file part way through based on start and end date (first column)
How big is the file? WHy not read the entire file in and then use 'subset' to extract only the data that you want? If the file is too large to be able to read in, then you could put it in a database and use SQL to extract what you want. You could also create a 'perl' script to filter the data before reading into R. So a little more specificity is needed to understand the problem you are trying to solve. On Sun, Mar 20, 2011 at 3:47 PM, algotr8der <algotr8der at gmail.com> wrote:> Hello folks - I have been trying to figure this out. I have a set of very > large files that are of this format > > , , , , > 1/4/1999,9:31:00 AM,blah, blah, blah > 1/4/1999,9:32:00 AM,blah, blah, blah > 1/4/1999,9:33:00 AM,blah, blah, blah > > I want to write R code that reads only that data between a start and an end > date (data is presented from oldest at the top of the file to the most > recent at the bottom of the file). I'm not sure if there is an R function > that makes this easy. > > I know the read.csv function enables you to skip a user specified number of > rows before the file is read but this doesnt exactly help me as my start and > end dates can be anywhere in between. > > Appreciate the help. > > -- > View this message in context: http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3391769.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
algotr8der
2011-Mar-20 21:12 UTC
[R] read file part way through based on start and end date (first column)
Thanks Jim for the reply. The file has 1,183,318 rows and there are 20 such files. Too big for R to handle? -- View this message in context: http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3392005.html Sent from the R help mailing list archive at Nabble.com.
jim holtman
2011-Mar-21 00:45 UTC
[R] read file part way through based on start and end date (first column)
Depends on what version of R you are using. If you are running a 32 bit version and if all the columns were numeric, if you had about 20 columns, I would guess that might require 300MB for a single copy of the object and for the reading in and then subsetting, you might require 3-4X that space. So if you had 3GB of memory, you might be fine. How much would you expect to read from each file (1%, 10% or 100%)? You might be better to initially put the data into a database and then extract what you want from there. Is it a fixed range that you want to extract from all the files, or does it vary for each run? There are a number of RDMs that interface to R that would make the job easier. What you should try is to read in progressively larger sections of one of the files to see how much memory is used. If you are using read.table, remember to explicity state what the mode of each column is. This will give you the best estimate as to if your system is capable of handling a single file at a time. This will also give you the timing of how long it will take to read/convert the data. I would suggest that if your system can handle a single file, then you setup a script to read in each of the files and "save" the resulting object. This will allow a lot faster access on subsequent reads since the data will already be converted. On Sun, Mar 20, 2011 at 5:12 PM, algotr8der <algotr8der at gmail.com> wrote:> Thanks Jim for the reply. The file has 1,183,318 rows and there are 20 such > files. > > Too big for R to handle? > > -- > View this message in context: http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3392005.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Gabor Grothendieck
2011-Mar-21 04:16 UTC
[R] read file part way through based on start and end date (first column)
On Sun, Mar 20, 2011 at 3:47 PM, algotr8der <algotr8der at gmail.com> wrote:> Hello folks - I have been trying to figure this out. I have a set of very > large files that are of this format > > , , , , > 1/4/1999,9:31:00 AM,blah, blah, blah > 1/4/1999,9:32:00 AM,blah, blah, blah > 1/4/1999,9:33:00 AM,blah, blah, blah > > I want to write R code that reads only that data between a start and an end > date (data is presented from oldest at the top of the file to the most > recent at the bottom of the file). I'm not sure if there is an R function > that makes this easy. > > I know the read.csv function enables you to skip a user specified number of > rows before the file is read but this doesnt exactly help me as my start and > end dates can be anywhere in between. >Try reading the entire file into R first to be really sure that you are not just assuming it can't be done. If its true that its too big to read it in and subset then try reading just the first column of the file (read about the colClasses= argument in ?read.table) and then figure out which rows you need from the first column and re-read the file, this time using the skip= and nrowsargument so that it only reads in the rows you need. -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Seemingly Similar Threads
- as.date: do not know how to convert 'test[1]' to class "Date"
- R-project: plot 2 zoo objects (price series) that have some date mis-matches
- Converting Multiple Columns of Data Frame to Date
- samba and writing through hard/symbolic links
- exclude a column in save!