Dr Gregory Jefferis
2014-Nov-10 18:35 UTC
[Rd] subscripting a data.frame (without changing row order) changes internal row.names
Dear R-devel, Can anyone help me to understand this? It seems that subscripting the rows of a data.frame without actually changing their order, somehow changes an internal representation of row.names that is revealed by e.g. dput/dump/serialize I have read the docs and inspected the (R) code for data.frame, rownames, row.names and dput without enlightenment. df=data.frame(a=1:10, b=1) dput(df) df2=df[1:nrow(df), ] # R thinks they are equal (so do I!) all.equal(df, df2) dput(df2) Looking at the output of the dputs> dput(df)structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("a", "b"), row.names = c(NA, -10L), class = "data.frame")> dput(df2)structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("a", "b"), row.names = c(NA, 10L), class = "data.frame") we have row.names = c(NA, -10L) in the first case and row.names = c(NA, 10L) in the second, so somehow these objects have a different representation Can anyone explain why? This has come up because> library(digest) > digest(df)==digest(df2)[1] FALSE digest uses serialize under the hood, but serialize, dput and dump all show the same effect (I've pasted an example below using dump, md5sum from base R). Many thanks for any enlightenment! More generally is there any way to calculate a digest of a data.frame that could get round this issue or is that not possible? Best wishes, Greg. A digest using base R: library(tools) td=tempfile() dir.create(td) tempfiles=file.path(td,c("df", "df2")) dump("df",tempfiles[1]) dump("df2",tempfiles[2]) md5sum(tempfiles) # different md5sum> sessionInfo() # for my laptop but also observed on R 3.1.2R version 3.1.1 (2014-07-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] tools stats graphics grDevices utils datasets methods base other attached packages: [1] nat_1.5.14 nat.utils_0.4.2 digest_0.6.4 Rvcg_0.9 devtools_1.6.1 igraph_0.7.1 [7] testthat_0.9.1 rgl_0.93.1098 loaded via a namespace (and not attached): [1] codetools_0.2-9 filehash_2.2-2 nabor_0.4.3 parallel_3.1.1 plyr_1.8.1 [6] Rcpp_0.11.3 rstudio_0.98.1062 rstudioapi_0.1 XML_3.98-1.1 yaml_2.1.13 -- Gregory Jefferis, PhD Division of Neurobiology MRC Laboratory of Molecular Biology Francis Crick Avenue Cambridge Biomedical Campus Cambridge, CB2 OQH, UK http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis http://jefferislab.org http://flybrain.stanford.edu
Joshua Ulrich
2014-Nov-10 20:05 UTC
[Rd] subscripting a data.frame (without changing row order) changes internal row.names
On Mon, Nov 10, 2014 at 12:35 PM, Dr Gregory Jefferis <jefferis at mrc-lmb.cam.ac.uk> wrote:> Dear R-devel, > > Can anyone help me to understand this? It seems that subscripting the rows > of a data.frame without actually changing their order, somehow changes an > internal representation of row.names that is revealed by e.g. > dput/dump/serialize > > I have read the docs and inspected the (R) code for data.frame, rownames, > row.names and dput without enlightenment. >Look at ?.row_names_info (which is mentioned in the See Also section of ?row.names) and its type argument. Also see the discussion here: http://stackoverflow.com/q/26468746/271616> df=data.frame(a=1:10, b=1) > dput(df) > df2=df[1:nrow(df), ] > # R thinks they are equal (so do I!) > all.equal(df, df2) > dput(df2) > > Looking at the output of the dputs > >> dput(df) > > structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names > c("a", > "b"), row.names = c(NA, -10L), class = "data.frame") >> >> dput(df2) > > structure(list(a = 1:10, b = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names > c("a", > "b"), row.names = c(NA, 10L), class = "data.frame") > > we have row.names = c(NA, -10L) in the first case and row.names = c(NA, 10L) > in the second, so somehow these objects have a different representation > > Can anyone explain why? This has come up because >The first are "automatic". The second are a compact form of 1:10, as mentioned in ?row.names. I'm not certain of the root cause/reason, but the second object will not have "automatic" rownames because you have subset it with a non-missing 'i'.>> library(digest) >> digest(df)==digest(df2) > > [1] FALSE > > digest uses serialize under the hood, but serialize, dput and dump all show > the same effect (I've pasted an example below using dump, md5sum from base > R). > > Many thanks for any enlightenment! More generally is there any way to > calculate a digest of a data.frame that could get round this issue or is > that not possible? > > Best wishes, > > Greg. > > > A digest using base R: > > library(tools) > td=tempfile() > dir.create(td) > tempfiles=file.path(td,c("df", "df2")) > dump("df",tempfiles[1]) > dump("df2",tempfiles[2]) > md5sum(tempfiles) > > # different md5sum > >> sessionInfo() # for my laptop but also observed on R 3.1.2 > > R version 3.1.1 (2014-07-10) > Platform: x86_64-apple-darwin13.1.0 (64-bit) > > locale: > [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 > > attached base packages: > [1] tools stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] nat_1.5.14 nat.utils_0.4.2 digest_0.6.4 Rvcg_0.9 > devtools_1.6.1 igraph_0.7.1 > [7] testthat_0.9.1 rgl_0.93.1098 > > loaded via a namespace (and not attached): > [1] codetools_0.2-9 filehash_2.2-2 nabor_0.4.3 parallel_3.1.1 > plyr_1.8.1 > [6] Rcpp_0.11.3 rstudio_0.98.1062 rstudioapi_0.1 XML_3.98-1.1 > yaml_2.1.13 > > -- > Gregory Jefferis, PhD > Division of Neurobiology > MRC Laboratory of Molecular Biology > Francis Crick Avenue > Cambridge Biomedical Campus > Cambridge, CB2 OQH, UK > > http://www2.mrc-lmb.cam.ac.uk/group-leaders/h-to-m/g-jefferis > http://jefferislab.org > http://flybrain.stanford.edu > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Joshua Ulrich | about.me/joshuaulrich FOSS Trading | www.fosstrading.com
Maybe Matching Threads
- Making a package CITATION file from BibTeX
- inflate zlib compressed data using base R or CRAN package?
- Making a package CITATION file from BibTeX
- possible bug: graphics::image seems to ignore getOption("preferRaster")
- Access and assign list sub-elements using a string such as "l$a$b"