thr3ads.net - R help - [R] Efficient Way to gather data from various files [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Sam Asin

2012-Oct-02 23:33 UTC

[R] Efficient Way to gather data from various files

Hello,

Sorry if this process is too simple for this list.  I know I can do it, but
I always read online about how when using R one should always try to avoid
loops and use vectors.  I am wondering if there exists a more "R
friendly"
way to do this than to use for loops.

I have a dataset that has a list of "ID"s.  Let's call this
dataset "Master"

Each of these "ID"s has an associated DBF file.  The DBF files each
have
the same title, and they are each located in a directory path that
includes, as one of the folder names, the "ID".

These DBF files have 2 columns of interest.  One is the "run number"
the
other is the "statistic."  I'm interested in the median and 90th
percentile
of the "statistic" as well as their corresponding run numbers. 
Ultimately,
I want a table that consists of

ID Run_50th Stat_50 Run_90 Stat_90
1AB      5    102010     3         144376
1AC      3    999999     6         999999999

etc.

Where I currently have a dataset that has

ID
1AB
1AC

etc.

And there are several DBF files that are in folders i.e.
"folder1/1AC/folder2/blah.dbf"

This dbf looks like

run   Stat

1      10
2      10
3      999999
4      100000000000
5      100000000
6       9999999999
7      100000000
8     10
9     10
10    10
11     1000000


I know i could do this with a loop, but I can't see the efficient, R way.
 I was hoping that you experienced R programmers could give me some
pointers on the most efficient way to achieve this result.

Sam

	[[alternative HTML version deleted]]

Rui Barradas

2012-Oct-03 01:17 UTC

head link

[R] Efficient Way to gather data from various files

Hello,

There are more R friendly ways to do what you want, it seems to me easy 
to avoid loops but you need to tell us how do you know which rows 
correspond to the 50th and 90th quantiles. Maybe this comes from the 
value in some other column?

Give a bit more complete a description and we'll see what can be done.

Hope this helps,

Rui Barradas
Em 03-10-2012 00:33, Sam Asin escreveu:> Hello,
>
> Sorry if this process is too simple for this list.  I know I can do it, but
> I always read online about how when using R one should always try to avoid
> loops and use vectors.  I am wondering if there exists a more "R
friendly"
> way to do this than to use for loops.
>
> I have a dataset that has a list of "ID"s.  Let's call this
dataset "Master"
>
> Each of these "ID"s has an associated DBF file.  The DBF files
each have
> the same title, and they are each located in a directory path that
> includes, as one of the folder names, the "ID".
>
> These DBF files have 2 columns of interest.  One is the "run
number" the
> other is the "statistic."  I'm interested in the median and
90th percentile
> of the "statistic" as well as their corresponding run numbers. 
Ultimately,
> I want a table that consists of
>
> ID Run_50th Stat_50 Run_90 Stat_90
> 1AB      5    102010     3         144376
> 1AC      3    999999     6         999999999
>
> etc.
>
> Where I currently have a dataset that has
>
> ID
> 1AB
> 1AC
>
> etc.
>
> And there are several DBF files that are in folders i.e.
> "folder1/1AC/folder2/blah.dbf"
>
> This dbf looks like
>
> run   Stat
>
> 1      10
> 2      10
> 3      999999
> 4      100000000000
> 5      100000000
> 6       9999999999
> 7      100000000
> 8     10
> 9     10
> 10    10
> 11     1000000
>
>
> I know i could do this with a loop, but I can't see the efficient, R
way.
>   I was hoping that you experienced R programmers could give me some
> pointers on the most efficient way to achieve this result.
>
> Sam
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jeff Newmiller

2012-Oct-03 03:05 UTC

head link

[R] Efficient Way to gather data from various files

File operations are not vectorizable. About the only thing you can do for the
iterating through files part might be to use lapply instead of a for loop, but
that is mostly a style change.

Once you have read the dbf files there will probably be vector functions you can
use (quantile). Off the top of my head I don't know a function that tells
you which value corresponds to a particular quantile, but you can probably sort
the data with order(), find the value whose ecdf is just below your target with
which.max, and look at the row number of that value.

x <- rnorm(11)
names(x) <- seq(x)
xs <- x[order(x)]
Row90 <- as.numeric(names (xs)[0.9<=seq(xs)/length(xs))])

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.

Sam Asin <asin.sam at gmail.com> wrote:
>Hello,
>
>Sorry if this process is too simple for this list.  I know I can do it,
>but
>I always read online about how when using R one should always try to
>avoid
>loops and use vectors.  I am wondering if there exists a more "R
>friendly"
>way to do this than to use for loops.
>
>I have a dataset that has a list of "ID"s.  Let's call this
dataset
>"Master"
>
>Each of these "ID"s has an associated DBF file.  The DBF files
each
>have
>the same title, and they are each located in a directory path that
>includes, as one of the folder names, the "ID".
>
>These DBF files have 2 columns of interest.  One is the "run
number"
>the
>other is the "statistic."  I'm interested in the median and
90th
>percentile
>of the "statistic" as well as their corresponding run numbers. 
>Ultimately,
>I want a table that consists of
>
>ID Run_50th Stat_50 Run_90 Stat_90
>1AB      5    102010     3         144376
>1AC      3    999999     6         999999999
>
>etc.
>
>Where I currently have a dataset that has
>
>ID
>1AB
>1AC
>
>etc.
>
>And there are several DBF files that are in folders i.e.
>"folder1/1AC/folder2/blah.dbf"
>
>This dbf looks like
>
>run   Stat
>
>1      10
>2      10
>3      999999
>4      100000000000
>5      100000000
>6       9999999999
>7      100000000
>8     10
>9     10
>10    10
>11     1000000
>
>
>I know i could do this with a loop, but I can't see the efficient, R
>way.
> I was hoping that you experienced R programmers could give me some
>pointers on the most efficient way to achieve this result.
>
>Sam
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Oct 2012 - Efficient Way to gather data from various files

[R] Efficient Way to gather data from various files

[R] Efficient Way to gather data from various files

[R] Efficient Way to gather data from various files

Apparently Analagous Threads