thr3ads.net - R help - [R] read file part way through based on start and end date (first column) [Mar 2011]

If this information is useful, please help other people find it:
Share via:

algotr8der

2011-Mar-20 19:47 UTC

[R] read file part way through based on start and end date (first column)

Hello folks - I have been trying to figure this out. I have a set of very
large files that are of this format

, , , , 
1/4/1999,9:31:00 AM,blah, blah, blah
1/4/1999,9:32:00 AM,blah, blah, blah
1/4/1999,9:33:00 AM,blah, blah, blah

I want to write R code that reads only that data between a start and an end
date (data is presented from oldest at the top of the file to the most
recent at the bottom of the file). I'm not sure if there is an R function
that makes this easy. 

I know the read.csv function enables you to skip a user specified number of
rows before the file is read but this doesnt exactly help me as my start and
end dates can be anywhere in between.

Appreciate the help.

--
View this message in context:
http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3391769.html
Sent from the R help mailing list archive at Nabble.com.

jim holtman

2011-Mar-20 21:04 UTC

head link

[R] read file part way through based on start and end date (first column)

How big is the file?  WHy not read the entire file in and then use
'subset' to extract only the data that you want?  If the file is too
large to be able to read in, then you could put it in a database and
use SQL to extract what you want.  You could also create a 'perl'
script to filter the data before reading into R.  So a little more
specificity is needed to understand the problem you are trying to
solve.

On Sun, Mar 20, 2011 at 3:47 PM, algotr8der <algotr8der at gmail.com>
wrote:> Hello folks - I have been trying to figure this out. I have a set of very
> large files that are of this format
>
> , , , ,
> 1/4/1999,9:31:00 AM,blah, blah, blah
> 1/4/1999,9:32:00 AM,blah, blah, blah
> 1/4/1999,9:33:00 AM,blah, blah, blah
>
> I want to write R code that reads only that data between a start and an end
> date (data is presented from oldest at the top of the file to the most
> recent at the bottom of the file). I'm not sure if there is an R
function
> that makes this easy.
>
> I know the read.csv function enables you to skip a user specified number of
> rows before the file is read but this doesnt exactly help me as my start
and
> end dates can be anywhere in between.
>
> Appreciate the help.
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3391769.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

algotr8der

2011-Mar-20 21:12 UTC

head link

[R] read file part way through based on start and end date (first column)

Thanks Jim for the reply. The file has 1,183,318 rows and there are 20 such
files. 

Too big for R to handle? 

--
View this message in context:
http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3392005.html
Sent from the R help mailing list archive at Nabble.com.

jim holtman

2011-Mar-21 00:45 UTC

head link

[R] read file part way through based on start and end date (first column)

Depends on what version of R you are using.  If you are running a 32
bit version and if all the columns were numeric, if you had about 20
columns, I would guess that might require 300MB for a single copy of
the object and for the reading in and then subsetting, you might
require 3-4X that space.  So if you had 3GB of memory, you might be
fine.

How much would you expect to read from each file (1%, 10% or 100%)?
You might be better to initially put the data into a database and then
extract what you want from there.  Is it a fixed range that you want
to extract from all the files, or does it vary for each run?  There
are a number of RDMs that interface to R that would make the job
easier.

What you should try is to read in progressively larger sections of one
of the files to see how much memory is used.  If you are using
read.table, remember to explicity state what the mode of each column
is.  This will give you the best estimate as to if your system is
capable of handling a single file at a time.  This will also give you
the timing of how long it will take to read/convert the data.  I would
suggest that if your system can handle a single file, then you setup a
script to read in each of the files and "save" the resulting object.
This will allow a lot faster access on subsequent reads since the data
will already be converted.

On Sun, Mar 20, 2011 at 5:12 PM, algotr8der <algotr8der at gmail.com>
wrote:> Thanks Jim for the reply. The file has 1,183,318 rows and there are 20 such
> files.
>
> Too big for R to handle?
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/read-file-part-way-through-based-on-start-and-end-date-first-column-tp3391769p3392005.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

Gabor Grothendieck

2011-Mar-21 04:16 UTC

head link

[R] read file part way through based on start and end date (first column)

On Sun, Mar 20, 2011 at 3:47 PM, algotr8der <algotr8der at gmail.com>
wrote:> Hello folks - I have been trying to figure this out. I have a set of very
> large files that are of this format
>
> , , , ,
> 1/4/1999,9:31:00 AM,blah, blah, blah
> 1/4/1999,9:32:00 AM,blah, blah, blah
> 1/4/1999,9:33:00 AM,blah, blah, blah
>
> I want to write R code that reads only that data between a start and an end
> date (data is presented from oldest at the top of the file to the most
> recent at the bottom of the file). I'm not sure if there is an R
function
> that makes this easy.
>
> I know the read.csv function enables you to skip a user specified number of
> rows before the file is read but this doesnt exactly help me as my start
and
> end dates can be anywhere in between.
>
Try reading the entire file into R first to be really sure that you
are not just assuming it can't be done.

If its true that its too big to read it in and subset then try reading
just the first column of the file (read about the colClasses= argument
in ?read.table) and then figure out which rows you need from the first
column and re-read the file, this time using the skip= and nrowsargument so that
it only reads in the rows you need.


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Mar 2011 - read file part way through based on start and end date (first column)

[R] read file part way through based on start and end date (first column)

[R] read file part way through based on start and end date (first column)

[R] read file part way through based on start and end date (first column)

[R] read file part way through based on start and end date (first column)

[R] read file part way through based on start and end date (first column)

Possibly Parallel Threads