Marcelo Perlin
2016-Jun-10 13:27 UTC
[R] About identification of CRAN CHECK machines in logs
I don't know Hadley. But you can see evidence of "something" systematically installing the packages in the log data. From my two CRAN packages I noticed a high correlation in the number of downloads. Try the following script, which will pick 5 random packages from CRAN and calculate the correlation matrix between their differenced number of downloads. To avoid spurious correlations, I removed the weekends since we can expect some seasonality and also the zero entries. Its crude, I know, but it does shows some positive associations between the number of installations of the packages. If not CRAN, who/what is downloading this packages and how can I set it apart from the actual user installations? Many thanks! ____ # get packages df <- as.data.frame(available.packages()) # choose 5 random idx <- sample(seq(nrow(df)))[1:5] df<- df[idx,] my.pkgs <- as.character(df$Package) #my.pkgs <- c('RndTexExams','GetTDData') dl.df <- cranlogs::cran_downloads(my.pkgs, from = '2015-01-01', to Sys.Date()) # remove zeros entries dl.df$count[dl.df$count==0] <- NA # remove weekends dl.df$sat.sun <- as.POSIXlt(dl.df$date)$wday dl.df <- dplyr::filter(dl.df, sat.sun != 0, sat.sun != 6) # to wide (for corr) dl.df <- tidyr::spread(dl.df, key = package,value = count) # remove na dl.df <- dl.df[complete.cases(dl.df), ] diff.mat <- diff(as.matrix(dl.df[,3:ncol(dl.df)])) cor(diff.mat) ___ On Thu, Jun 9, 2016 at 6:18 PM, Hadley Wickham <h.wickham at gmail.com> wrote:> On Thu, Jun 9, 2016 at 9:24 AM, Marcelo Perlin <marceloperlin at gmail.com> > wrote: > > Hi, > > > > I recently released two packages (RndTexExams and GetTDData) in CRAN and > > I'm trying to track the number of downloads and location of users. > > > > I wrote a simple script to download and analyze the log files in > http://cran > > -logs.rstudio.com. > > I realized, however, that during the release of a new version of the > > packages there is a spike in the number of downloads. I believe that the > > CRAN checks are included in the number of installations of the package in > > the log files. > > I don't think that's true. Why would CRAN be installing the package > from a mirror? > > Hadley > > -- > http://hadley.nz >-- Marcelo Perlin Professor Adjunto | Escola de Administra??o Universidade Federal do Rio Grande do Sul Rua Washington Luiz, 855 | 90010-460| Porto Alegre RS| Brasil Tel.: (51) 3308-3303 | www.ea.ufrgs.br http://lattes.cnpq.br/3262699324398819 https://sites.google.com/site/marceloperlin/ [[alternative HTML version deleted]]
Hadley Wickham
2016-Jun-10 13:32 UTC
[R] About identification of CRAN CHECK machines in logs
On Fri, Jun 10, 2016 at 8:27 AM, Marcelo Perlin <marceloperlin at gmail.com> wrote:> I don't know Hadley. But you can see evidence of "something" systematically > installing the packages in the log data. From my two CRAN packages I noticed > a high correlation in the number of downloads. > > Try the following script, which will pick 5 random packages from CRAN and > calculate the correlation matrix between their differenced number of > downloads. To avoid spurious correlations, I removed the weekends since we > can expect some seasonality and also the zero entries. Its crude, I know, > but it does shows some positive associations between the number of > installations of the packages.Which is not at all surprising: * there are very strong seasonal patterns * there are big jumps after releases of new versions of R * some people like to have all packages installed locally This is an intrinsic problem with download data. There's no way to tell if a downloader is really using your package or not. Hadley -- http://hadley.nz