Hi All, I am learning R and having a little trouble with the usage and proper definitions of data.frames vs. matrix vs vectors. I have read many R tutorials, and looked over ump-teen 'cheat' sheets and have found that no one has articulated a really good definition of the differences between 'data.frames', 'matrix', and 'arrays' and even 'factors'. I realize that I might have missed someones R tutorial, and actually would like to receive 'your' most concise or most useful tutorial. Any help would be appreciated. My particular favorite explanation and helpful hint is from the 'R-Inferno'. Don't get me wrong... I think this pdf is great and some tables are excellent. Overall it is a very good primer but this one section leaves me puzzled. This quote belies the lack of hard and fast rules for what and when to use 'data.frames', 'matrix', and 'arrays'. It discusses ways in which to simplify your work. Here are a few possibilities for simplifying: ? Don?t use a list when an atomic vector will do. ? Don?t use a data frame when a matrix will do. ? Don?t try to use an atomic vector when a list is needed. ? Don?t try to use a matrix when a data frame is needed. Cheers, Matt C
Gabor Grothendieck
2010-Oct-27 00:49 UTC
[R] Data.frame Vs Matrix Vs Array: Definitions Please
On Tue, Oct 26, 2010 at 8:37 PM, Matt Curcio <matt.curcio.ri at gmail.com> wrote:> Hi All, > I am learning R and having a little trouble with the usage and proper > definitions of data.frames vs. matrix vs vectors. I have read many R > tutorials, and looked over ump-teen 'cheat' sheets and have found that > no one has articulated a really good definition of the differences > between 'data.frames', 'matrix', and 'arrays' and even 'factors'. ?I > realize that I might have missed someones R tutorial, and actually > would like to receive 'your' most concise or most useful tutorial. > Any help would be appreciated. > > My particular favorite explanation and helpful hint is from the > 'R-Inferno'. ?Don't get me wrong... ?I think this pdf is great and > some tables are excellent. Overall it is a very good primer but this > one section leaves me puzzled. ?This quote belies the lack of hard and > fast rules for what and when to use 'data.frames', 'matrix', and > 'arrays'. ?It discusses ways in which to simplify your work. > > Here are a few possibilities for simplifying: > ? Don?t use a list when an atomic vector will do. > ? Don?t use a data frame when a matrix will do. > ? Don?t try to use an atomic vector when a list is needed. > ? Don?t try to use a matrix when a data frame is needed. > > Cheers, > Matt CLook at their internal representations and it will become clearer. v, a vector, has length 6. m, a matrix, is actually the same as the vector v except is has dimensions too. Since m is just a vector with dimensions, m has length 6 as well. L is a list and has length 2 because its a vector each of whose components is itself a vector. DF is a data frame and is the same as L except its 2 components must each have the same length and it must have row and column names. If you don't assign the row and column names they are automatically generated as we can see. Note that row.names = c(NA, -3L) is a short form for row names of 1:3 and .Names internally refers to column names.> v <- 1:6 # vector > dput(v)1:6> > m <- v; dim(m) <- 2:3 # m is a matrix since we added dimensions > dput(m)structure(1:6, .Dim = 2:3)> > L <- list(1:3, 4:6) > dput(L)list(1:3, 4:6)> > DF <- data.frame(1:3, 4:6) > dput(DF)structure(list(X1.3 = 1:3, X4.6 = 4:6), .Names = c("X1.3", "X4.6" ), row.names = c(NA, -3L), class = "data.frame")>-- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
Hi: I'm going to take a different tack from Gabor and Ivan and be strictly qualitative on the distinctions among vectors, matrices, arrays, data frames and lists. As Ivan mentioned, a vector has a single (atomic) mode - i.e., all elements of a vector must be of the same type. A numeric vector consists strictly of numbers, a character variable is composed of character entries, and every element of a logical vector is TRUE or FALSE. Mixtures of types can produce surprises to the unwary - for example, a single character element in an otherwise numeric vector coerces its type to character. Vectors are one-dimensional by definition. Example:> x1 <- 1:5 > x2 <- c(x1, 'a') > x1[1] 1 2 3 4 5> x2[1] "1" "2" "3" "4" "5" "a"> class(x1)[1] "integer"> class(x2)[1] "character" A matrix is a vector with a (two) dimensional attribute, as Gabor noted and showed by example. Thus, the elements of a matrix must also be of the same type (numeric, character, logical, etc.). A data frame is a rectangular object whose columns consist of vectors of the same length. Columns can be (and usually are) of different types, but all of the elements within a column are of the same type. The restriction on data frames is that all columns must have the same length, but this is common in data sets where each row represents an observation and each column represents a different datum. Example:> d <- data.frame(a = LETTERS[1:3], x = 1:3, y = rnorm(3)) > da x y 1 A 1 -1.3417463 2 B 2 -0.7032052 3 C 3 -0.7099726> str(d)'data.frame': 3 obs. of 3 variables: $ a: Factor w/ 3 levels "A","B","C": 1 2 3 $ x: int 1 2 3 $ y: num -1.342 -0.703 -0.71 Lists are the most general type of data object. Each list contains one or more components, but each component may contain subcomponents, which in turn may contain sub-subcomponents, etc. Each (sub)component can have a different type, like data frames, but they can also have different lengths, so in this sense they generalize data frames. The capacity to nest lists within lists is a further generalization of data frames. For example, the output of a modeling function (e.g., lm(), glm()) returns a list, providing an instructive example to learn how lists work and behave. Lists are difficult to 'get' at first, but it gets easier with experience. Example: extend above to read in four random normal deviates.> dd <- data.frame(a = LETTERS[1:3], x = 1:3, y = rnorm(4))Error in data.frame(a = LETTERS[1:3], x = 1:3, y = rnorm(4)) : arguments imply differing number of rows: 3, 4> dd <- list(a = LETTERS[1:3], x = 1:3, y = rnorm(4) ) > dd$a [1] "A" "B" "C" $x [1] 1 2 3 $y [1] -0.02635882 0.50764973 2.02707087 0.01845697 Data frames are special cases of lists where each column represents a list component and each component is an atomic vector of the same length. Matrices are generalizations of vectors (vectors with dimensional attributes), but they can also be thought of as a special case of a data frame in the sense that each column is of the same type. However, matrices are not list objects, so the analogy is limited. The function as.data.frame(matrix) converts a matrix to a data frame. Arrays are also vectors with dimensional attributes. A one-dimensional array is a vector and a two-dimensional array is a matrix, but arrays can have more than one or two dimensions, as Gabor pointed out. The length of the dim vector determines the number of dimensions of an array. Since an array is a generalization of a vector, all elements of an array of any dimension must have the same type. I'm glad that Ivan described factors for you - these objects are likely to give you more headaches than any other. Be particularly careful when reading in data from a file - make sure you know what is being input and what you want for output, and code accordingly. Example: the first call reads the character variable a as a factor (the default behavior), the second overrides the default.> d <- data.frame(a = LETTERS[1:3], x = 1:3, y = rnorm(3)) > str(d)'data.frame': 3 obs. of 3 variables: $ a : Factor w/ 3 levels "A","B","C": 1 2 3 # <<==== $ x : int 1 2 3 $ y : num 0.926 -1.103 0.554> d <- data.frame(a = LETTERS[1:3], x = 1:3, y = rnorm(3),+ stringsAsFactors = FALSE)> str(d)'data.frame': 3 obs. of 3 variables: $ a: chr "A" "B" "C" # <<==== $ x: int 1 2 3 $ y: num 0.495 0.956 0.628 I'd suggest learning to use the function str() routinely to elucidate the contents of a particular (class of) object (and its elements). It is certainly one of the most useful functions in R and a great way to improve your understanding of the various types of objects you'll encounter in the language. This is a general description of the types of objects you wanted to know about, but special cases arise where an object of one type turns into another silently. You need to learn these exceptions, sometimes the hard way. Gabor's list -> vector example is one; another is that a one-dimensional matrix or array is silently converted into a vector unless explicitly overwritten. Here's a small example to illustrate (notice the differences in how the objects are printed - it provides a clue):> m <- matrix(1:9, nrow = 3) > m[,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9> class(m)[1] "matrix"> m2 <- m[1, ] > m2[1] 1 4 7> class(m2)[1] "integer"> is.matrix(m2)[1] FALSE> is.vector(m2)[1] TRUE # How to create a (row) vector but keep matrix class:> m3 <- m[1, , drop = FALSE] > m3[,1] [,2] [,3] [1,] 1 4 7> class(m3)[1] "matrix" # Pay attention to the dimensions> str(m)int [1:3, 1:3] 1 2 3 4 5 6 7 8 9> str(m2)int [1:3] 1 4 7> str(m3)int [1, 1:3] 1 4 7 HTH, Dennis On Tue, Oct 26, 2010 at 5:37 PM, Matt Curcio <matt.curcio.ri@gmail.com>wrote:> Hi All, > I am learning R and having a little trouble with the usage and proper > definitions of data.frames vs. matrix vs vectors. I have read many R > tutorials, and looked over ump-teen 'cheat' sheets and have found that > no one has articulated a really good definition of the differences > between 'data.frames', 'matrix', and 'arrays' and even 'factors'. I > realize that I might have missed someones R tutorial, and actually > would like to receive 'your' most concise or most useful tutorial. > Any help would be appreciated. > > My particular favorite explanation and helpful hint is from the > 'R-Inferno'. Don't get me wrong... I think this pdf is great and > some tables are excellent. Overall it is a very good primer but this > one section leaves me puzzled. This quote belies the lack of hard and > fast rules for what and when to use 'data.frames', 'matrix', and > 'arrays'. It discusses ways in which to simplify your work. >Here are a few possibilities for simplifying:> • Don’t use a list when an atomic vector will do. > • Don’t use a data frame when a matrix will do. > • Don’t try to use an atomic vector when a list is needed. > • Don’t try to use a matrix when a data frame is needed. > > Cheers, > Matt C > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]