David Wolfskill
2015-Mar-30 13:50 UTC
[R] data.frame: data-driven column selections that vary by row??
Sorry if that's confusing: I'm probably confused. :-( I am collecting and trying to analyze data regarding performance of computer systems. After extracting the data from its repository, I have created and used a Perl script to generate a (relatively) simple CSV, each record of which contains: * a POSIXct timestamp * a hostname * a collection of metrics for the interval identified by the timestamp, and specific to the host in question, as well as some factors to group the hosts (e.g., whether it's in a "control" vs. a "test" group; a broad categorization of how the host is provisioned; which version of the software it was running at the time...). (Each metric and factor is in a uniquely-named column.) As extracted from the repository, there were several records for each such hostname/timestamp pair -- e.g., there would be separate records for: * Input bandwidth utilization for network interface 1 * Output bandwidth utilization for network interface 1 * Input bandwidth utilization for network interface 2 * Output bandwidth utilization for network interface 2 (And the same field would be used for each of these -- the interpretation being driven by the content of other fields in teh record.) Working with the data as described (immediately) above directly in R seemed... daunting, at best: thus the excursion into Perl. And for some of the data, what I have works well enough. But now I also want to analyze information from disk drives, and things get messy (as far as I can see). First, each disk drive has a collection of 17 metrics (such as "busy_pct", "kb_per_transfer_read", and "transfers_per_second_write"), as well as a factor ("dev_type"). Each also has a device name that is unique within the host where it resides (e.g. "da1", "da2", "da3"....). (The "dev_type" factor identifies whether the drive is a solid-state device or a spinning disk.) I have thus made the corresponding columns unique by pasting the drive name and the name of the metric (or factor), separating the two with "_" (e.g. "da7_busy_pct"; "ada0_mb_per_second_write"; "ada4_queue_length"). I am not certain that's the best thing I could have done -- and I'm open to changing the approach. The challenge for me is that different (classes of) machines are provisioned differently; some consequennces of that: * While da1 may be a spinning disk on host A, that has no bearing on whether or not the "da1" on host B is a spinning disk or an SSD. * Host C may not even have a "da1" device. * Host D may be of a type that normally has a "da1," but in this case, the drive has failed and has been disabled (so host D won't report anything about "da1"). (I'm not too bothered about the "non-reporting" case, but cite it so we all know about it.) I expect I will want to be using groupings: * All disk devices -- this one is easy. * All SSD devices (excluding spinning disks). * All spinning disks (excluding SSDs). I'm having trouble with the latter two (though, certainly, if I solve one, the other is also solved). Also, for some of the metrics, I will want to sum them; for others, I will want to do other things -- find minima or maxima, or average them. So pre-calculating such aggregates in the Perl script isn't something that appeals to me. Finally (as far as complications go), I'm trying to write the code in such a way that if we deploy a new configuration of machine that has (say) twice as many drives as the biggest one we presently deploy, the code Just Works -- I shouldn't need to update the code merely to adapt to another hardware configuration. I have been able to write a function that takes the data.frame obtained by reading the above-cited CSV, and generates a data.frame with a row for each host, and depicts the "dev_type" for each device for that host; here's an abbreviated (and slightly redacted) copy of its output to illustrate some of the above: ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36 da3 host_A ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd host_B ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd host_G ssd ssd ssd ssd ssd ssd ssd host_H ssd ssd ssd ssd ssd ssd ssd host_M ssd ssd ssd ssd ssd ssd ssd host_N ssd ssd ssd ssd ssd ssd ssd (That function is written with the explicit assumption(!) that for the period covered by a given set of input data, a given host's configuration remains static: we won't have drives changing type mid-stream.) So the point of this lengthy(!) note is to ask if there's a somewhat-sane way to be able to group the metrics for the "ssd" devices (for example), given the above. (So far, the least obnoxious way that comes to mind is to actually create 2 columns for each device metric: one for the device if it's an "ssd";l the other for "hdd" -- so instead of columns such as: * da3_busy_pct * da3_dev_type * da3_kb_per_transfer_read * da36_cam_timeouts * da36_dev_type * da36_mb_per_second_read I would have: * da3_hdd_busy_pct * da3_ssd_busy_pct * da3_hdd_dev_type * da3_ssd_dev_type * da3_hdd_kb_per_transfer_read * da3_ssd_kb_per_transfer_read * da36_hdd_cam_timeouts * da36_ssd_cam_timeouts * da36_hdd_dev_type * da36_ssd_dev_type * da36_hdd_mb_per_second_read * da36_ssd_mb_per_second_read and no more than half of those would actually be populated (depending on the content of "dev_type" when the Perl script is creating the CSV). That seems rather hackish, though. Thank you in advance for any insight. Peace, david -- David H. Wolfskill r at catwhisker.org Those who murder in the name of God or prophet are blasphemous cowards. See http://www.catwhisker.org/~david/publickey.gpg for my public key. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 949 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20150330/984d379b/attachment.bin>
John Kane
2015-Mar-31 15:11 UTC
[R] data.frame: data-driven column selections that vary by row??
I think we need some data and code Reproducibility https://github.com/hadley/devtools/wiki/Reproducibility http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example John Kane Kingston ON Canada> -----Original Message----- > From: r at catwhisker.org > Sent: Mon, 30 Mar 2015 06:50:59 -0700 > To: r-help at r-project.org > Subject: [R] data.frame: data-driven column selections that vary by row?? > > Sorry if that's confusing: I'm probably confused. :-( > > I am collecting and trying to analyze data regarding performance of > computer systems. > > After extracting the data from its repository, I have created and > used a Perl script to generate a (relatively) simple CSV, each > record of which contains: > * a POSIXct timestamp > * a hostname > * a collection of metrics for the interval identified by the timestamp, > and specific to the host in question, as well as some factors to > group the hosts (e.g., whether it's in a "control" vs. a "test" > group; a broad categorization of how the host is provisioned; which > version of the software it was running at the time...). (Each > metric and factor is in a uniquely-named column.) > > As extracted from the repository, there were several records for each > such hostname/timestamp pair -- e.g., there would be separate records > for: > * Input bandwidth utilization for network interface 1 > * Output bandwidth utilization for network interface 1 > * Input bandwidth utilization for network interface 2 > * Output bandwidth utilization for network interface 2 > > (And the same field would be used for each of these -- the > interpretation being driven by the content of other fields in teh > record.) > > Working with the data as described (immediately) above directly in R > seemed... daunting, at best: thus the excursion into Perl. > > And for some of the data, what I have works well enough. > > But now I also want to analyze information from disk drives, and things > get messy (as far as I can see). > > First, each disk drive has a collection of 17 metrics (such as > "busy_pct", "kb_per_transfer_read", and "transfers_per_second_write"), > as well as a factor ("dev_type"). Each also has a device name that is > unique within the host where it resides (e.g. "da1", "da2", "da3"....). > (The "dev_type" factor identifies whether the drive is a solid-state > device or a spinning disk.) > > I have thus made the corresponding columns unique by pasting the drive > name and the name of the metric (or factor), separating the two with > "_" (e.g. "da7_busy_pct"; "ada0_mb_per_second_write"; > "ada4_queue_length"). I am not certain that's the best thing I could > have done -- and I'm open to changing the approach. > > The challenge for me is that different (classes of) machines are > provisioned differently; some consequennces of that: > * While da1 may be a spinning disk on host A, that has no bearing on > whether or not the "da1" on host B is a spinning disk or an SSD. > * Host C may not even have a "da1" device. > * Host D may be of a type that normally has a "da1," but in this case, > the drive has failed and has been disabled (so host D won't report > anything about "da1"). > > (I'm not too bothered about the "non-reporting" case, but cite it so we > all know about it.) > > I expect I will want to be using groupings: > * All disk devices -- this one is easy. > * All SSD devices (excluding spinning disks). > * All spinning disks (excluding SSDs). > > I'm having trouble with the latter two (though, certainly, if I solve > one, the other is also solved). > > Also, for some of the metrics, I will want to sum them; for others, > I will want to do other things -- find minima or maxima, or average > them. So pre-calculating such aggregates in the Perl script isn't > something that appeals to me. > > Finally (as far as complications go), I'm trying to write the code in > such a way that if we deploy a new configuration of machine that has > (say) twice as many drives as the biggest one we presently deploy, the > code Just Works -- I shouldn't need to update the code merely to adapt > to another hardware configuration. > > I have been able to write a function that takes the data.frame obtained > by reading the above-cited CSV, and generates a data.frame with a row > for each host, and depicts the "dev_type" for each device for that host; > here's an abbreviated (and slightly redacted) copy of its output to > illustrate some of the above: > > ada0 ada1 ada2 ada3 ada4 ada5 da30 da31 da32 da33 da34 da35 da36 > da3 > host_A ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd > hdd > host_B ssd ssd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd hdd > hdd > host_G ssd ssd ssd ssd ssd ssd > ssd > host_H ssd ssd ssd ssd ssd ssd > ssd > host_M ssd ssd ssd ssd ssd ssd > ssd > host_N ssd ssd ssd ssd ssd ssd > ssd > > (That function is written with the explicit assumption(!) that for the > period covered by a given set of input data, a given host's > configuration remains static: we won't have drives changing type > mid-stream.) > > So the point of this lengthy(!) note is to ask if there's a > somewhat-sane way to be able to group the metrics for the "ssd" devices > (for example), given the above. > > (So far, the least obnoxious way that comes to mind is to actually > create 2 columns for each device metric: one for the device if it's an > "ssd";l the other for "hdd" -- so instead of columns such as: > * da3_busy_pct > * da3_dev_type > * da3_kb_per_transfer_read > * da36_cam_timeouts > * da36_dev_type > * da36_mb_per_second_read > > I would have: > * da3_hdd_busy_pct > * da3_ssd_busy_pct > * da3_hdd_dev_type > * da3_ssd_dev_type > * da3_hdd_kb_per_transfer_read > * da3_ssd_kb_per_transfer_read > * da36_hdd_cam_timeouts > * da36_ssd_cam_timeouts > * da36_hdd_dev_type > * da36_ssd_dev_type > * da36_hdd_mb_per_second_read > * da36_ssd_mb_per_second_read > > and no more than half of those would actually be populated (depending on > the content of "dev_type" when the Perl script is creating the CSV). > > That seems rather hackish, though. > > Thank you in advance for any insight. > > Peace, > david > -- > David H. Wolfskill r at catwhisker.org > Those who murder in the name of God or prophet are blasphemous cowards. > > See http://www.catwhisker.org/~david/publickey.gpg for my public key.____________________________________________________________ Can't remember your password? Do you need a strong and secure password? Use Password manager! It stores your passwords & protects your account.
David Wolfskill
2015-Mar-31 17:22 UTC
[R] data.frame: data-driven column selections that vary by row??
On Tue, Mar 31, 2015 at 07:11:28AM -0800, John Kane wrote:> I think we need some data and code > Reproducibility > https://github.com/hadley/devtools/wiki/Reproducibility > http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example > ....I apologize for failing to provide that. Here is a quite small subset of the data (with a few edits to reduce excess verbosity in names of things) that still illustrates the challenge I perceive:> dput(bw)structure(list(timestamp = c(1426892400L, 1426892400L, 1426892400L, 1426892400L, 1426892400L, 1426892400L, 1426892460L, 1426892460L, 1426892460L, 1426892460L, 1426892460L, 1426892460L, 1426892520L, 1426892520L, 1426892520L, 1426892520L, 1426892520L, 1426892520L ), hostname = c("c001", "c002", "c021", "c022", "c041", "c051", "c001", "c002", "c021", "c022", "c041", "c051", "c001", "c002", "c021", "c022", "c041", "c051"), health = c(0.0549374999999983, 0.250585416666667, 1, 1, 0.577784167075767, 0.546805261621527, 0.1599375, 0.24954375, 1, 1, 0.582307554123614, 0.558298168996525, 0.2813125, 0.270877083333333, 1, 1, 0.579231349457365, 0.542973020177151 ), hw = c(1.9, 1.9, 1.4, 1.4, 1.5, 1.5, 1.9, 1.9, 1.4, 1.4, 1.5, 1.5, 1.9, 1.9, 1.4, 1.4, 1.5, 1.5), fw = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L ), .Label = "2015Q1.2", class = "factor"), role = structure(c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("control", "test"), class = "factor"), type = structure(c(3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L, 2L), .Label = c("D", "F", "H"), class = "factor"), da20_busy_pct = c(79.1, 62.8, NA, NA, NA, NA, 75, 64.8, NA, NA, NA, NA, 72.2, 74.5, NA, NA, NA, NA), da20_dev_type = structure(c(2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("", "hdd"), class = "factor"), da20_kb_per_xfer_read = c(727.23, 665.81, NA, NA, NA, NA, 737.04, 691.38, NA, NA, NA, NA, 721.71, 668.96, NA, NA, NA, NA), da20_kb_per_xfer_write = c(0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_mb_per_sec_read = c(39.77, 31.21, NA, NA, NA, NA, 36.71, 32.41, NA, NA, NA, NA, 35.94, 37.24, NA, NA, NA, NA), da20_mb_per_sec_write = c(0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_ms_per_xactn_read = c(43.5, 31.6, NA, NA, NA, NA, 35.7, 30.2, NA, NA, NA, NA, 32.7, 34.6, NA, NA, NA, NA), da20_ms_per_xactn_write = c(0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_Q_length = c(0, 0, NA, NA, NA, NA, 2, 0, NA, NA, NA, NA, 1, 1, NA, NA, NA, NA ), da20_xfers_per_sec_other = c(0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da20_xfers_per_sec_read = c(56, 48, NA, NA, NA, NA, 51, 48, NA, NA, NA, NA, 51, 57, NA, NA, NA, NA), da20_xfers_per_sec_write = c(0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA, 0, 0, NA, NA, NA, NA), da2_busy_pct = c(84.5, 81.8, 29.5, 26.7, 55.5, 50.9, 80.6, 79.7, 29.2, 27.3, 58.8, 50.2, 74.6, 79.3, 29.4, 26.6, 55.4, 50.1), da2_dev_type = structure(c(2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("", "hdd", "ssd"), class = "factor"), da2_kb_per_xfer_read = c(690.67, 686.63, 613.78, 587, 571.64, 553.27, 692.26, 660.05, 612.01, 594.28, 560.16, 566.41, 672.68, 670.25, 604.64, 592.16, 565.02, 564.43), da2_kb_per_xfer_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_mb_per_sec_read = c(44.52, 41.57, 134.26, 120.38, 252.88, 229.09, 41.24, 39.96, 132.68, 123.61, 268.04, 227.34, 37.44, 39.93, 133.45, 120.28, 251.06, 225.99), da2_mb_per_sec_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_ms_per_xactn_read = c(49.1, 47.8, 2, 1.8, 2.6, 2.4, 40.3, 43.9, 2, 1.8, 2.8, 2.4, 37.1, 40.9, 1.9, 1.8, 2.6, 2.4), da2_ms_per_xactn_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_Q_length = c(0, 2, 0, 1, 3, 0, 3, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3), da2_xfers_per_sec_other = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), da2_xfers_per_sec_read = c(66, 62, 224, 210, 453, 424, 61, 62, 222, 213, 490, 411, 57, 61, 226, 208, 455, 410), da2_xfers_per_sec_write = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("timestamp", "hostname", "health", "hw", "fw", "role", "type", "da20_busy_pct", "da20_dev_type", "da20_kb_per_xfer_read", "da20_kb_per_xfer_write", "da20_mb_per_sec_read", "da20_mb_per_sec_write", "da20_ms_per_xactn_read", "da20_ms_per_xactn_write", "da20_Q_length", "da20_xfers_per_sec_other", "da20_xfers_per_sec_read", "da20_xfers_per_sec_write", "da2_busy_pct", "da2_dev_type", "da2_kb_per_xfer_read", "da2_kb_per_xfer_write", "da2_mb_per_sec_read", "da2_mb_per_sec_write", "da2_ms_per_xactn_read", "da2_ms_per_xactn_write", "da2_Q_length", "da2_xfers_per_sec_other", "da2_xfers_per_sec_read", "da2_xfers_per_sec_write"), class = "data.frame", row.names = c(1L, 2L, 7L, 8L, 13L, 16L, 19L, 20L, 25L, 26L, 31L, 34L, 37L, 38L, 43L, 44L, 49L, 52L))> dim(bw)[1] 18 31 (In the current case, there are a few more columns per device, as well as about 40 more devices -- and thousands of rows -- represented in the data.) For reference (as well):> version_ platform i386-portbld-freebsd10.1 arch i386 os freebsd10.1 system i386, freebsd10.1 status Patched major 3 minor 0.2 year 2013 month 11 day 12 svn rev 64207 language R version.string R version 3.0.2 Patched (2013-11-12 r64207) nickname Frisbee Sailing>[BTW: the first link cited (above) is now a redirect to <http://adv-r.had.co.nz/Reproducibility.html>.] Peace, david -- David H. Wolfskill r at catwhisker.org Those who murder in the name of God or prophet are blasphemous cowards. See http://www.catwhisker.org/~david/publickey.gpg for my public key. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 949 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20150331/0f006ce0/attachment.bin>