Matthew Trunnell
2007-Jun-19 01:07 UTC
[R] Histograms with strings, grouped by repeat count (w/ data)
Hello R gurus, I just spent my first weekend wrestling with R, but so far have come up empty handed. I have a dataset that represents file downloads; it has 4 dimensions: date, filename, email, and country. (sample data below) My first goal is to get an idea of the frequency of repeated downloads. Let me explain that. Some people tend to download multiple times, e.g. if the download fails they keep trying over and over. I'm trying to build a histogram that shows the repeat count along the x-axis, that is, how many people downloaded once, twice, three times, etc. I plan to compare the median of that before and after we switched ISPs. To accomplish this, I'm assuming that I'll first need to combine the email and filename columns so as to represent a single download attempt by an individual. Does that sound right? Later, it would be nice to limit the histogram to a single filename, country, or company. I can probably figure that out myself after I understand how to write this funky histogram expression. With the help of Verzani's introductory text, I've learned how to read in the CSV data and do some simple tables, like this: hist(table(d$filename)) hist(table(d$filename[substring(d$filename, 1, 5)=="file1"])) hist(sort(table(d$filename[substring(d$filename, 1, 5)=="file1"]))) Obviously, these commands count the frequency of the files. What I'd like to see are the repeats grouped along the x-axis; I'd like to find, for all files, the distribution of retries. I hope that makes sense. :) Can someone point me in the right direction? I'm very new to R and to statistics, but I write code for a living. At this point I'd almost be better off writing a program do this kind of simple counting... but I have a feeling R would be so useful if I could just get past the initial learning curve. Thank you in advance, Matt Here's some real data, with the private info replaced :) d<-read.table(file="C:\\users\\trunnellm\\downloads\\statistics\\downloads.csv", sep=",", quote="\"", header=TRUE) filename,last_modified,email_addr,country_residence file1,3/4/2006 13:54,email1,Korea (South) file2,3/4/2006 14:33,email2,United States file2,3/4/2006 16:03,email2,United States file2,3/4/2006 16:17,email3,United States file2,3/4/2006 16:28,email3,United States file3,3/4/2006 19:13,email4,United States file2,3/4/2006 21:22,email5,India file4,3/4/2006 21:46,email6,United States file1,3/4/2006 22:04,email7,Japan file2,3/4/2006 22:09,email8,Croatia file1,3/4/2006 22:22,email7,Japan file1,3/4/2006 22:29,email9,India file1,3/4/2006 23:06,email6,United States file1,3/4/2006 23:33,email6,United States file5,3/4/2006 23:44,email10,China file1,3/5/2006 0:13,email9,India file2,3/5/2006 0:52,email8,Croatia file2,3/5/2006 0:54,email8,Croatia file2,3/5/2006 1:10,email5,India file6,3/5/2006 2:17,email9,India file2,3/5/2006 2:24,email11,Italy file7,3/5/2006 2:36,email12,Italy file8,3/5/2006 2:52,email12,Italy file2,3/5/2006 3:09,email13,United Kingdom file2,3/5/2006 4:02,email14,India file2,3/5/2006 4:07,email14,India file2,3/5/2006 4:14,email14,India file2,3/5/2006 4:37,email5,India file2,3/5/2006 4:44,email15,Belgium file1,3/5/2006 5:02,email9,India file1,3/5/2006 5:24,email16,Taiwan file2,3/5/2006 6:06,email17,Saudi Arabia file2,3/5/2006 7:32,email17,Saudi Arabia file2,3/5/2006 8:12,email18,Brazil file2,3/5/2006 8:26,email18,Brazil file2,3/5/2006 9:49,email19,United Kingdom file1,3/5/2006 10:49,email11,Italy file1,3/5/2006 11:16,email13,United Kingdom file1,3/5/2006 11:16,email13,United Kingdom file1,3/5/2006 11:45,email13,United Kingdom file1,3/5/2006 14:34,email20,Australia file9,3/5/2006 14:56,email20,Australia file9,3/5/2006 14:56,email20,Australia file5,3/5/2006 16:43,email21,United States file1,3/5/2006 17:17,email7,Japan file2,3/5/2006 17:26,email22,Japan file2,3/5/2006 17:27,email22,Japan file2,3/5/2006 17:33,email23,China file1,3/5/2006 17:45,email22,Japan file2,3/5/2006 17:45,email22,Japan file2,3/5/2006 17:59,email23,China file1,3/5/2006 18:27,email24,Japan file1,3/5/2006 18:47,email25,Taiwan file2,3/5/2006 18:48,email26,New Zealand file2,3/5/2006 19:15,email27,Canada file2,3/5/2006 19:23,email28,Canada file2,3/5/2006 19:24,email28,Canada file10,3/5/2006 19:49,email29,Japan file10,3/5/2006 19:52,email29,Japan file10,3/5/2006 19:57,email29,Japan file2,3/5/2006 20:01,email29,Japan file2,3/5/2006 20:02,email29,Japan file2,3/5/2006 20:06,email29,Japan
jim holtman
2007-Jun-19 01:43 UTC
[R] Histograms with strings, grouped by repeat count (w/ data)
You should be using barplot and not hist. I think this produces what you want: x <- "filename,last_modified,email_addr,country_residence file1,3/4/2006 13:54,email1,Korea (South) file2,3/4/2006 14:33,email2,United States file2,3/4/2006 16:03,email2,United States file2,3/4/2006 16:17,email3,United States file2,3/4/2006 16:28,email3,United States file3,3/4/2006 19:13,email4,United States file2,3/4/2006 21:22,email5,India file4,3/4/2006 21:46,email6,United States file1,3/4/2006 22:04,email7,Japan file2,3/4/2006 22:09,email8,Croatia file1,3/4/2006 22:22,email7,Japan file1,3/4/2006 22:29,email9,India file1,3/4/2006 23:06,email6,United States file1,3/4/2006 23:33,email6,United States file5,3/4/2006 23:44,email10,China file1,3/5/2006 0:13,email9,India file2,3/5/2006 0:52,email8,Croatia file2,3/5/2006 0:54,email8,Croatia file2,3/5/2006 1:10,email5,India file6,3/5/2006 2:17,email9,India file2,3/5/2006 2:24,email11,Italy file7,3/5/2006 2:36,email12,Italy file8,3/5/2006 2:52,email12,Italy file2,3/5/2006 3:09,email13,United Kingdom file2,3/5/2006 4:02,email14,India file2,3/5/2006 4:07,email14,India file2,3/5/2006 4:14,email14,India file2,3/5/2006 4:37,email5,India file2,3/5/2006 4:44,email15,Belgium file1,3/5/2006 5:02,email9,India file1,3/5/2006 5:24,email16,Taiwan file2,3/5/2006 6:06,email17,Saudi Arabia file2,3/5/2006 7:32,email17,Saudi Arabia file2,3/5/2006 8:12,email18,Brazil file2,3/5/2006 8:26,email18,Brazil file2,3/5/2006 9:49,email19,United Kingdom file1,3/5/2006 10:49,email11,Italy file1,3/5/2006 11:16,email13,United Kingdom file1,3/5/2006 11:16,email13,United Kingdom file1,3/5/2006 11:45,email13,United Kingdom file1,3/5/2006 14:34,email20,Australia file9,3/5/2006 14:56,email20,Australia file9,3/5/2006 14:56,email20,Australia file5,3/5/2006 16:43,email21,United States file1,3/5/2006 17:17,email7,Japan file2,3/5/2006 17:26,email22,Japan file2,3/5/2006 17:27,email22,Japan file2,3/5/2006 17:33,email23,China file1,3/5/2006 17:45,email22,Japan file2,3/5/2006 17:45,email22,Japan file2,3/5/2006 17:59,email23,China file1,3/5/2006 18:27,email24,Japan file1,3/5/2006 18:47,email25,Taiwan file2,3/5/2006 18:48,email26,New Zealand file2,3/5/2006 19:15,email27,Canada file2,3/5/2006 19:23,email28,Canada file2,3/5/2006 19:24,email28,Canada file10,3/5/2006 19:49,email29,Japan file10,3/5/2006 19:52,email29,Japan file10,3/5/2006 19:57,email29,Japan file2,3/5/2006 20:01,email29,Japan file2,3/5/2006 20:02,email29,Japan file2,3/5/2006 20:06,email29,Japan" d <- read.csv(textConnection(x)) barplot(table(d$filename), main="All Files", las=2) # plot counts for all the files # generate plots for each file name showing which emails used them counts <- table(d$filename, d$email_addr) for (i in seq(nrow(counts))){ .index <- which(counts[i,] > 0) barplot(counts[i, .index], las=2, names.arg=colnames(counts)[.index], main=rownames(counts)[i]) } On 6/18/07, Matthew Trunnell <trunnell@cognix.net> wrote:> > Hello R gurus, > > I just spent my first weekend wrestling with R, but so far have come > up empty handed. > > I have a dataset that represents file downloads; it has 4 dimensions: > date, filename, email, and country. (sample data below) > > My first goal is to get an idea of the frequency of repeated > downloads. Let me explain that. Some people tend to download > multiple times, e.g. if the download fails they keep trying over and > over. I'm trying to build a histogram that shows the repeat count > along the x-axis, that is, how many people downloaded once, twice, > three times, etc. I plan to compare the median of that before and > after we switched ISPs. > > To accomplish this, I'm assuming that I'll first need to combine the > email and filename columns so as to represent a single download > attempt by an individual. Does that sound right? Later, it would be > nice to limit the histogram to a single filename, country, or company. > I can probably figure that out myself after I understand how to write > this funky histogram expression. > > With the help of Verzani's introductory text, I've learned how to read > in the CSV data and do some simple tables, like this: > > hist(table(d$filename)) > hist(table(d$filename[substring(d$filename, 1, 5)=="file1"])) > hist(sort(table(d$filename[substring(d$filename, 1, 5)=="file1"]))) > > Obviously, these commands count the frequency of the files. What I'd > like to see are the repeats grouped along the x-axis; I'd like to > find, for all files, the distribution of retries. I hope that makes > sense. :) > > Can someone point me in the right direction? I'm very new to R and to > statistics, but I write code for a living. At this point I'd almost > be better off writing a program do this kind of simple counting... but > I have a feeling R would be so useful if I could just get past the > initial learning curve. > > Thank you in advance, > Matt > > Here's some real data, with the private info replaced :) > > d<-read.table > (file="C:\\users\\trunnellm\\downloads\\statistics\\downloads.csv", > sep=",", quote="\"", header=TRUE) > > filename,last_modified,email_addr,country_residence > file1,3/4/2006 13:54,email1,Korea (South) > file2,3/4/2006 14:33,email2,United States > file2,3/4/2006 16:03,email2,United States > file2,3/4/2006 16:17,email3,United States > file2,3/4/2006 16:28,email3,United States > file3,3/4/2006 19:13,email4,United States > file2,3/4/2006 21:22,email5,India > file4,3/4/2006 21:46,email6,United States > file1,3/4/2006 22:04,email7,Japan > file2,3/4/2006 22:09,email8,Croatia > file1,3/4/2006 22:22,email7,Japan > file1,3/4/2006 22:29,email9,India > file1,3/4/2006 23:06,email6,United States > file1,3/4/2006 23:33,email6,United States > file5,3/4/2006 23:44,email10,China > file1,3/5/2006 0:13,email9,India > file2,3/5/2006 0:52,email8,Croatia > file2,3/5/2006 0:54,email8,Croatia > file2,3/5/2006 1:10,email5,India > file6,3/5/2006 2:17,email9,India > file2,3/5/2006 2:24,email11,Italy > file7,3/5/2006 2:36,email12,Italy > file8,3/5/2006 2:52,email12,Italy > file2,3/5/2006 3:09,email13,United Kingdom > file2,3/5/2006 4:02,email14,India > file2,3/5/2006 4:07,email14,India > file2,3/5/2006 4:14,email14,India > file2,3/5/2006 4:37,email5,India > file2,3/5/2006 4:44,email15,Belgium > file1,3/5/2006 5:02,email9,India > file1,3/5/2006 5:24,email16,Taiwan > file2,3/5/2006 6:06,email17,Saudi Arabia > file2,3/5/2006 7:32,email17,Saudi Arabia > file2,3/5/2006 8:12,email18,Brazil > file2,3/5/2006 8:26,email18,Brazil > file2,3/5/2006 9:49,email19,United Kingdom > file1,3/5/2006 10:49,email11,Italy > file1,3/5/2006 11:16,email13,United Kingdom > file1,3/5/2006 11:16,email13,United Kingdom > file1,3/5/2006 11:45,email13,United Kingdom > file1,3/5/2006 14:34,email20,Australia > file9,3/5/2006 14:56,email20,Australia > file9,3/5/2006 14:56,email20,Australia > file5,3/5/2006 16:43,email21,United States > file1,3/5/2006 17:17,email7,Japan > file2,3/5/2006 17:26,email22,Japan > file2,3/5/2006 17:27,email22,Japan > file2,3/5/2006 17:33,email23,China > file1,3/5/2006 17:45,email22,Japan > file2,3/5/2006 17:45,email22,Japan > file2,3/5/2006 17:59,email23,China > file1,3/5/2006 18:27,email24,Japan > file1,3/5/2006 18:47,email25,Taiwan > file2,3/5/2006 18:48,email26,New Zealand > file2,3/5/2006 19:15,email27,Canada > file2,3/5/2006 19:23,email28,Canada > file2,3/5/2006 19:24,email28,Canada > file10,3/5/2006 19:49,email29,Japan > file10,3/5/2006 19:52,email29,Japan > file10,3/5/2006 19:57,email29,Japan > file2,3/5/2006 20:01,email29,Japan > file2,3/5/2006 20:02,email29,Japan > file2,3/5/2006 20:06,email29,Japan > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]]