SoftImpute is a new package for matrix completion - i.e. for imputing missing values in matrices. SoftImpute was written by myself and Rahul Mazumder. softImpute uses uses squared-error loss with nuclear norm regularization - one can think of it as the "lasso" for matrix approximation - to find a low-rank approximation to the observed entries in the matrix. This low-rank approximation is then used to impute the missing entries. softImpute works in a kind of "EM" fashion. Given a current guess, it fills in the missing entries. Then it computes a soft-thresholded SVD of this complete matrix, which yields the next guess. These steps are iterated till convergence to the solution of the convex-optimation problem. The algorithm can work with large matrices, such as the "netflix" matrix (400K x 20K) by making heavy use of sparse-matrix methods in the Matrix package. It creates new S4 classes such as "Incomplete" for storing the large data matrix, and "SparseplusLowRank" for representing the completed matrix. SVD computations are done using a specially built block-alternating algorithm, svd.als, that exploits these structures and uses warm starts. Some of the methods used are described in Rahul Mazumder, Trevor Hastie and Rob Tibshirani: Spectral Regularization Algorithms for Learning Large Incomplete Matrices. JMLR 2010 11 2287-2322 Other newer and more efficient methods that inter-weave the alternating block algorithm steps with imputation steps will be described in a forthcoming article. Some of the features of softImpute are * works with large matrices using sparse matrix methods, or smaller matrices using standard svd methods. * one can control the maximum rank of the solution, to avoid overly expensive operations. * warm starts can be used to move from one solution to a new solution with a different value for the nuclear-norm regularization parameter lambda (and/or a different rank) * with lambda=0 and a specified rank, one automatically gets an implementation of "hardImpute" - iterative svd imputation * softImpute has an option "type" which can be "svd" or "als" (alternating least squares), for specifying which of the two approaches above should be used. *included in the package is svd.als, an efficient rank-restricted svd algorithm that can exploit sparsity and other special structure, and accept warm starts. * a function biScale is provided, for centering and scaling both rows and columns of matrix to have means zero and variance 1. The centering and scaling constants are stored on the object. For sparse matrices with centering, the centered object is stored in "SparseplusLowRank" form to preserve its special structure * prediction functions impute and complete are provided. Trevor Hastie ---------------------------------------------------------------------------------------- Trevor Hastie hastie at stanford.edu Professor, Department of Statistics, Stanford University Phone: (650) 725-2231 Fax: (650) 725-8977 URL: http://www.stanford.edu/~hastie address: room 104, Department of Statistics, Sequoia Hall 390 Serra Mall, Stanford University, CA 94305-4065 _______________________________________________ R-packages mailing list R-packages at r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages