thr3ads.net - R help - [R] data.frame: data-driven column selections that vary by row?? [Mar 2015]

If this information is useful, please help other people find it:
Share via:

David Wolfskill

2015-Mar-30 13:50 UTC

[R] data.frame: data-driven column selections that vary by row??

Sorry if that's confusing: I'm probably confused. :-(

I am collecting and trying to analyze data regarding performance of
computer systems.

After extracting the data from its repository, I have created and
used a Perl script to generate a (relatively) simple CSV, each
record of which contains:
* a POSIXct timestamp
* a hostname
* a collection of metrics for the interval identified by the timestamp,
  and specific to the host in question, as well as some factors to
  group the hosts (e.g., whether it's in a "control" vs. a
"test"
  group; a broad categorization of how the host is provisioned; which
  version of the software it was running at the time...).  (Each
  metric and factor is in a uniquely-named column.)

As extracted from the repository, there were several records for each
such hostname/timestamp pair -- e.g., there would be separate records
for:
* Input bandwidth utilization for network interface 1
* Output bandwidth utilization for network interface 1
* Input bandwidth utilization for network interface 2
* Output bandwidth utilization for network interface 2

(And the same field would be used for each of these -- the
interpretation being driven by the content of other fields in teh
record.)

Working with the data as described (immediately) above directly in R
seemed... daunting, at best: thus the excursion into Perl.

And for some of the data, what I have works well enough.

But now I also want to analyze information from disk drives, and things
get messy (as far as I can see).

First, each disk drive has a collection of 17 metrics (such as
"busy_pct", "kb_per_transfer_read", and
"transfers_per_second_write"),
as well as a factor ("dev_type").  Each also has a device name that is
unique within the host where it resides (e.g. "da1", "da2",
"da3"....).
(The "dev_type" factor identifies whether the drive is a solid-state
device or a spinning disk.)

I have thus made the corresponding columns unique by pasting the drive
name and the name of the metric (or factor), separating the two with
"_" (e.g. "da7_busy_pct";
"ada0_mb_per_second_write";
"ada4_queue_length").  I am not certain that's the best thing I
could
have done -- and I'm open to changing the approach.

The challenge for me is that different (classes of) machines are
provisioned differently; some consequennces of that:
* While da1 may be a spinning disk on host A, that has no bearing on
  whether or not the "da1" on host B is a spinning disk or an SSD.
* Host C may not even have a "da1" device.
* Host D may be of a type that normally has a "da1," but in this case,
  the drive has failed and has been disabled (so host D won't report
  anything about "da1").

(I'm not too bothered about the "non-reporting" case, but cite it
so we
all know about it.)

I expect I will want to be using groupings:
* All disk devices -- this one is easy.
* All SSD devices (excluding spinning disks).
* All spinning disks (excluding SSDs).

I'm having trouble with the latter two (though, certainly, if I solve
one, the other is also solved).

Also, for some  of the metrics, I will want to sum them; for others,
I will want to do other things -- find minima or maxima, or average
them.  So pre-calculating such aggregates in the Perl script isn't
something that appeals to me.

Finally (as far as complications go), I'm trying to write the code in
such a way that if we deploy a new configuration of machine that has
(say) twice as many drives as the biggest one we presently deploy, the
code Just Works -- I shouldn't need to update the code merely to adapt
to another hardware configuration.

I have been able to write a function that takes the data.frame obtained
by reading the above-cited CSV, and generates a data.frame with a row
for each host, and depicts the "dev_type" for each device for that
host;
here's an abbreviated (and slightly redacted) copy of its output to
illustrate some of the above:

       ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36 da3
host_A  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd hdd
host_B  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd hdd
host_G  ssd  ssd  ssd  ssd  ssd  ssd                                    ssd
host_H  ssd  ssd  ssd  ssd  ssd  ssd                                    ssd
host_M  ssd  ssd  ssd  ssd  ssd  ssd                                    ssd
host_N  ssd  ssd  ssd  ssd  ssd  ssd                                    ssd

(That function is written with the explicit assumption(!) that for the
period covered by a given set of input data, a given host's
configuration remains static: we won't have drives changing type
mid-stream.)

So the point of this lengthy(!) note is to ask if there's a
somewhat-sane way to be able to group the metrics for the "ssd"
devices
(for example), given the above.

(So far, the least obnoxious way that comes to mind is to actually
create 2 columns for each device metric: one for the device if it's an
"ssd";l the other for "hdd" -- so instead of columns such
as:
* da3_busy_pct
* da3_dev_type
* da3_kb_per_transfer_read
* da36_cam_timeouts
* da36_dev_type
* da36_mb_per_second_read

I would have:
* da3_hdd_busy_pct
* da3_ssd_busy_pct
* da3_hdd_dev_type
* da3_ssd_dev_type
* da3_hdd_kb_per_transfer_read
* da3_ssd_kb_per_transfer_read
* da36_hdd_cam_timeouts
* da36_ssd_cam_timeouts
* da36_hdd_dev_type
* da36_ssd_dev_type
* da36_hdd_mb_per_second_read
* da36_ssd_mb_per_second_read

and no more than half of those would actually be populated (depending on
the content of "dev_type" when the Perl script is creating the CSV).

That seems rather hackish, though.

Thank you in advance for any insight.

Peace,
david
-- 
David H. Wolfskill				r at catwhisker.org
Those who murder in the name of God or prophet are blasphemous cowards.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 949 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20150330/984d379b/attachment.bin>

John Kane

2015-Mar-31 15:11 UTC

head link

[R] data.frame: data-driven column selections that vary by row??

I think we need some data and code 
Reproducibility
https://github.com/hadley/devtools/wiki/Reproducibility
 http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example



John Kane
Kingston ON Canada

> -----Original Message-----
> From: r at catwhisker.org
> Sent: Mon, 30 Mar 2015 06:50:59 -0700
> To: r-help at r-project.org
> Subject: [R] data.frame: data-driven column selections that vary by row??
> 
> Sorry if that's confusing: I'm probably confused. :-(
> 
> I am collecting and trying to analyze data regarding performance of
> computer systems.
> 
> After extracting the data from its repository, I have created and
> used a Perl script to generate a (relatively) simple CSV, each
> record of which contains:
> * a POSIXct timestamp
> * a hostname
> * a collection of metrics for the interval identified by the timestamp,
>   and specific to the host in question, as well as some factors to
>   group the hosts (e.g., whether it's in a "control" vs. a
"test"
>   group; a broad categorization of how the host is provisioned; which
>   version of the software it was running at the time...).  (Each
>   metric and factor is in a uniquely-named column.)
> 
> As extracted from the repository, there were several records for each
> such hostname/timestamp pair -- e.g., there would be separate records
> for:
> * Input bandwidth utilization for network interface 1
> * Output bandwidth utilization for network interface 1
> * Input bandwidth utilization for network interface 2
> * Output bandwidth utilization for network interface 2
> 
> (And the same field would be used for each of these -- the
> interpretation being driven by the content of other fields in teh
> record.)
> 
> Working with the data as described (immediately) above directly in R
> seemed... daunting, at best: thus the excursion into Perl.
> 
> And for some of the data, what I have works well enough.
> 
> But now I also want to analyze information from disk drives, and things
> get messy (as far as I can see).
> 
> First, each disk drive has a collection of 17 metrics (such as
> "busy_pct", "kb_per_transfer_read", and
"transfers_per_second_write"),
> as well as a factor ("dev_type").  Each also has a device name
that is
> unique within the host where it resides (e.g. "da1",
"da2", "da3"....).
> (The "dev_type" factor identifies whether the drive is a
solid-state
> device or a spinning disk.)
> 
> I have thus made the corresponding columns unique by pasting the drive
> name and the name of the metric (or factor), separating the two with
> "_" (e.g. "da7_busy_pct";
"ada0_mb_per_second_write";
> "ada4_queue_length").  I am not certain that's the best thing
I could
> have done -- and I'm open to changing the approach.
> 
> The challenge for me is that different (classes of) machines are
> provisioned differently; some consequennces of that:
> * While da1 may be a spinning disk on host A, that has no bearing on
>   whether or not the "da1" on host B is a spinning disk or an
SSD.
> * Host C may not even have a "da1" device.
> * Host D may be of a type that normally has a "da1," but in this
case,
>   the drive has failed and has been disabled (so host D won't report
>   anything about "da1").
> 
> (I'm not too bothered about the "non-reporting" case, but
cite it so we
> all know about it.)
> 
> I expect I will want to be using groupings:
> * All disk devices -- this one is easy.
> * All SSD devices (excluding spinning disks).
> * All spinning disks (excluding SSDs).
> 
> I'm having trouble with the latter two (though, certainly, if I solve
> one, the other is also solved).
> 
> Also, for some  of the metrics, I will want to sum them; for others,
> I will want to do other things -- find minima or maxima, or average
> them.  So pre-calculating such aggregates in the Perl script isn't
> something that appeals to me.
> 
> Finally (as far as complications go), I'm trying to write the code in
> such a way that if we deploy a new configuration of machine that has
> (say) twice as many drives as the biggest one we presently deploy, the
> code Just Works -- I shouldn't need to update the code merely to adapt
> to another hardware configuration.
> 
> I have been able to write a function that takes the data.frame obtained
> by reading the above-cited CSV, and generates a data.frame with a row
> for each host, and depicts the "dev_type" for each device for
that host;
> here's an abbreviated (and slightly redacted) copy of its output to
> illustrate some of the above:
> 
>        ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36
> da3
> host_A  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd
> hdd
> host_B  ssd  ssd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd  hdd
> hdd
> host_G  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_H  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_M  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> host_N  ssd  ssd  ssd  ssd  ssd  ssd
> ssd
> 
> (That function is written with the explicit assumption(!) that for the
> period covered by a given set of input data, a given host's
> configuration remains static: we won't have drives changing type
> mid-stream.)
> 
> So the point of this lengthy(!) note is to ask if there's a
> somewhat-sane way to be able to group the metrics for the "ssd"
devices
> (for example), given the above.
> 
> (So far, the least obnoxious way that comes to mind is to actually
> create 2 columns for each device metric: one for the device if it's an
> "ssd";l the other for "hdd" -- so instead of columns
such as:
> * da3_busy_pct
> * da3_dev_type
> * da3_kb_per_transfer_read
> * da36_cam_timeouts
> * da36_dev_type
> * da36_mb_per_second_read
> 
> I would have:
> * da3_hdd_busy_pct
> * da3_ssd_busy_pct
> * da3_hdd_dev_type
> * da3_ssd_dev_type
> * da3_hdd_kb_per_transfer_read
> * da3_ssd_kb_per_transfer_read
> * da36_hdd_cam_timeouts
> * da36_ssd_cam_timeouts
> * da36_hdd_dev_type
> * da36_ssd_dev_type
> * da36_hdd_mb_per_second_read
> * da36_ssd_mb_per_second_read
> 
> and no more than half of those would actually be populated (depending on
> the content of "dev_type" when the Perl script is creating the
CSV).
> 
> That seems rather hackish, though.
> 
> Thank you in advance for any insight.
> 
> Peace,
> david
> --
> David H. Wolfskill				r at catwhisker.org
> Those who murder in the name of God or prophet are blasphemous cowards.
> 
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.
____________________________________________________________
Can't remember your password? Do you need a strong and secure password?
Use Password manager! It stores your passwords & protects your account.

David Wolfskill

2015-Mar-31 17:22 UTC

head link

[R] data.frame: data-driven column selections that vary by row??

On Tue, Mar 31, 2015 at 07:11:28AM -0800, John Kane
wrote:> I think we need some data and code 
> Reproducibility
> https://github.com/hadley/devtools/wiki/Reproducibility
> 
http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> ....
I apologize for failing to provide that.

Here is a quite small subset of the data (with a few edits to reduce
excess verbosity in names of things) that still illustrates the
challenge I perceive:
> dput(bw)structure(list(timestamp = c(1426892400L, 1426892400L, 1426892400L, 
1426892400L, 1426892400L, 1426892400L, 1426892460L, 1426892460L, 
1426892460L, 1426892460L, 1426892460L, 1426892460L, 1426892520L, 
1426892520L, 1426892520L, 1426892520L, 1426892520L, 1426892520L
), hostname = c("c001", "c002", "c021",
"c022", "c041", "c051",
"c001", "c002", "c021", "c022",
"c041", "c051", "c001", "c002",
"c021", "c022", "c041", "c051"), health
= c(0.0549374999999983,
0.250585416666667, 1, 1, 0.577784167075767, 0.546805261621527, 
0.1599375, 0.24954375, 1, 1, 0.582307554123614, 0.558298168996525, 
0.2813125, 0.270877083333333, 1, 1, 0.579231349457365, 0.542973020177151
), hw = c(1.9, 1.9, 1.4, 1.4, 1.5, 1.5, 1.9, 1.9, 1.4, 1.4, 1.5, 
1.5, 1.9, 1.9, 1.4, 1.4, 1.5, 1.5), fw = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = "2015Q1.2", class = "factor"), role =
structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L), .Label = c("control", "test"), class =
"factor"), type = structure(c(3L,
3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 
2L), .Label = c("D", "F", "H"), class =
"factor"), da20_busy_pct = c(79.1,
62.8, NA, NA, NA, NA, 75, 64.8, NA, NA, NA, NA, 72.2, 74.5, NA, 
NA, NA, NA), da20_dev_type = structure(c(2L, 2L, 1L, 1L, 1L, 
1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("", 
"hdd"), class = "factor"), da20_kb_per_xfer_read = c(727.23,
665.81, NA, NA, NA, NA, 737.04, 691.38, NA, NA, NA, NA, 721.71, 
668.96, NA, NA, NA, NA), da20_kb_per_xfer_write = c(0, 0, NA, 
NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_mb_per_sec_read =
c(39.77,
31.21, NA, NA, NA, NA, 36.71, 32.41, NA, NA, NA, NA, 35.94, 37.24, 
NA, NA, NA, NA), da20_mb_per_sec_write = c(0, 0, NA, NA, NA, 
NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_ms_per_xactn_read =
c(43.5,
31.6, NA, NA, NA, NA, 35.7, 30.2, NA, NA, NA, NA, 32.7, 34.6, 
NA, NA, NA, NA), da20_ms_per_xactn_write = c(0, 0, NA, NA, NA, 
NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_Q_length = c(0, 
0, NA, NA, NA, NA, 2, 0, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA
), da20_xfers_per_sec_other = c(0, 0, NA, NA, NA, NA, 0, 0, NA, 
NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_xfers_per_sec_read = c(56, 
48, NA, NA, NA, NA, 51, 48, NA, NA, NA, NA, 51, 57, NA, NA, NA, 
NA), da20_xfers_per_sec_write = c(0, 0, NA, NA, NA, NA, 0, 0, 
NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da2_busy_pct = c(84.5, 
81.8, 29.5, 26.7, 55.5, 50.9, 80.6, 79.7, 29.2, 27.3, 58.8, 50.2, 
74.6, 79.3, 29.4, 26.6, 55.4, 50.1), da2_dev_type = structure(c(2L, 
2L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 
3L), .Label = c("", "hdd", "ssd"), class =
"factor"), da2_kb_per_xfer_read = c(690.67,
686.63, 613.78, 587, 571.64, 553.27, 692.26, 660.05, 612.01, 
594.28, 560.16, 566.41, 672.68, 670.25, 604.64, 592.16, 565.02, 
564.43), da2_kb_per_xfer_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0), da2_mb_per_sec_read = c(44.52, 41.57, 
134.26, 120.38, 252.88, 229.09, 41.24, 39.96, 132.68, 123.61, 
268.04, 227.34, 37.44, 39.93, 133.45, 120.28, 251.06, 225.99), 
    da2_mb_per_sec_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0), da2_ms_per_xactn_read = c(49.1, 47.8, 
    2, 1.8, 2.6, 2.4, 40.3, 43.9, 2, 1.8, 2.8, 2.4, 37.1, 40.9, 
    1.9, 1.8, 2.6, 2.4), da2_ms_per_xactn_write = c(0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_Q_length = c(0, 
    2, 0, 1, 3, 0, 3, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3), da2_xfers_per_sec_other
= c(0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_xfers_per_sec_read =
c(66,
    62, 224, 210, 453, 424, 61, 62, 222, 213, 490, 411, 57, 61, 
    226, 208, 455, 410), da2_xfers_per_sec_write = c(0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names =
c("timestamp",
"hostname", "health", "hw", "fw",
"role", "type", "da20_busy_pct",
"da20_dev_type", "da20_kb_per_xfer_read",
"da20_kb_per_xfer_write",
"da20_mb_per_sec_read", "da20_mb_per_sec_write",
"da20_ms_per_xactn_read",
"da20_ms_per_xactn_write", "da20_Q_length",
"da20_xfers_per_sec_other",
"da20_xfers_per_sec_read", "da20_xfers_per_sec_write",
"da2_busy_pct",
"da2_dev_type", "da2_kb_per_xfer_read",
"da2_kb_per_xfer_write",
"da2_mb_per_sec_read", "da2_mb_per_sec_write",
"da2_ms_per_xactn_read",
"da2_ms_per_xactn_write", "da2_Q_length",
"da2_xfers_per_sec_other",
"da2_xfers_per_sec_read", "da2_xfers_per_sec_write"), class
= "data.frame", row.names = c(1L,
2L, 7L, 8L, 13L, 16L, 19L, 20L, 25L, 26L, 31L, 34L, 37L, 38L, 
43L, 44L, 49L, 52L))> dim(bw)[1] 18 31

(In the current case, there are a few more columns per device, as
well as about 40 more devices -- and thousands of rows -- represented
in the data.)

For reference (as well):> version               _                                          
platform       i386-portbld-freebsd10.1                   
arch           i386                                       
os             freebsd10.1                                
system         i386, freebsd10.1                          
status         Patched                                    
major          3                                          
minor          0.2                                        
year           2013                                       
month          11                                         
day            12                                         
svn rev        64207                                      
language       R                                          
version.string R version 3.0.2 Patched (2013-11-12 r64207)
nickname       Frisbee Sailing                           
> 
[BTW: the first link cited (above) is now a redirect to
<http://adv-r.had.co.nz/Reproducibility.html>.]

Peace,
david
-- 
David H. Wolfskill				r at catwhisker.org
Those who murder in the name of God or prophet are blasphemous cowards.

See http://www.catwhisker.org/~david/publickey.gpg for my public key.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 949 bytes
Desc: not available
URL:
<https://stat.ethz.ch/pipermail/r-help/attachments/20150331/0f006ce0/attachment.bin>

R help - Mar 2015 - data.frame: data-driven column selections that vary by row??

[R] data.frame: data-driven column selections that vary by row??

[R] data.frame: data-driven column selections that vary by row??

[R] data.frame: data-driven column selections that vary by row??