Chris Howden
2010-Sep-01 13:11 UTC
[R] how to replace NA with a specific score that is dependant on another indicator variable
Hi everyone, I’m looking for a clever bit of code to replace NA’s with a specific score depending on an indicator variable. I can see how to do it using lots of if statements but I’m sure there most be a neater, better way of doing it. Any ideas at all will be much appreciated, I’m dreading coding up all those if statements!!!!! My problem is as follows: I have a data set with lots of missing data: EG Raw Data Set Category variable1 variable2 variable3 1 5 NA NA 1 NA 3 4 2 NA 7 NA etc Now I want to replace the NA’s with the average for each category, so if these averages were: EG Averages Category variable1 variable2 variable3 1 4.5 3.2 2.5 2 3.5 7.4 5.9 So I’d like my data set to look like the following once I’ve replaced the NA’s with the appropriate category average: EG Imputed Data Set Category variable1 variable2 variable3 1 5 3.2 2.5 1 4.5 3 4 2 3.5 7 5.9 etc Any ideas would be very much appreciated!!!!! thankyou Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP development, Data Analysis, Modelling, and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 chris@trickysolutions.com.au [[alternative HTML version deleted]]
Chris Howden
2010-Sep-01 13:20 UTC
[R] how to replace NA with a specific score that is dependant on another indicator variable
Hi everyone, I’m looking for a clever bit of code to replace NA’s with a specific score depending on an indicator variable. I can see how to do it using lots of if statements but I’m sure there most be a neater, better way of doing it. Any ideas at all will be much appreciated, I’m dreading coding up all those if statements!!!!! My problem is as follows: I have a data set with lots of missing data: EG Raw Data Set Category variable1 variable2 variable3 1 5 NA NA 1 NA 3 4 2 NA 7 NA etc Now I want to replace the NA’s with the average for each category, so if these averages were: EG Averages Category variable1 variable2 variable3 1 4.5 3.2 2.5 2 3.5 7.4 5.9 So I’d like my data set to look like the following once I’ve replaced the NA’s with the appropriate category average: EG Imputed Data Set Category variable1 variable2 variable3 1 5 3.2 2.5 1 4.5 3 4 2 3.5 7 5.9 etc Any ideas would be very much appreciated!!!!! thankyou Chris Howden Founding Partner Tricky Solutions Tricky Solutions 4 Tricky Problems Evidence Based Strategic Development, IP development, Data Analysis, Modelling, and Training (mobile) 0410 689 945 (fax / office) (+618) 8952 7878 chris@trickysolutions.com.au [[alternative HTML version deleted]]
David Winsemius
2010-Sep-01 13:55 UTC
[R] how to replace NA with a specific score that is dependant on another indicator variable
On Sep 1, 2010, at 9:20 AM, Chris Howden wrote:> Hi everyone, > > > > I?m looking for a clever bit of code to replace NA?s with a specific > score > depending on an indicator variable. > > I can see how to do it using lots of if statements but I?m sure > there most > be a neater, better way of doing it. > > Any ideas at all will be much appreciated, I?m dreading coding up > all those > if statements!!!!! > > My problem is as follows: > > I have a data set with lots of missing data: > > EG Raw Data Set > > Category variable1 variable2 > variable3 > > 1 5 NA > NA > > 1 NA > 3 4 > > 2 NA > 7 NAThis does not do its work by category (since I got tired of fixing mangled htmlized datasets) but it seems to me that a tapply "wrap" could do either of these operations within categories: > egraw Category variable1 variable2 variable3 1 1 5 NA NA 2 1 NA 3 4 3 2 NA 7 NA > lapply(egraw, function(x) {mnx <- mean(x, na.rm=TRUE) sapply(x, function(z) if (is.na(z)) {mnx}else{z}) } ) $Category [1] 1 1 2 $variable1 [1] 5 5 5 $variable2 [1] 5 3 7 $variable3 [1] 4 4 4 > sapply(egraw, function(x) {mnx <- mean(x, na.rm=TRUE) sapply(x, function(z) if (is.na(z)) {mnx}else{z}) } ) Category variable1 variable2 variable3 [1,] 1 5 5 4 [2,] 1 5 3 4 [3,] 2 5 7 4> > etc > > Now I want to replace the NA?s with the average for each category, > so if > these averages were: > > EG Averages > > Category variable1 variable2 > variable3 > > 1 4.5 > 3.2 2.5 > > 2 3.5 > 7.4 5.9 > > > > So I?d like my data set to look like the following once I?ve > replaced the > NA?s with the appropriate category average: > > EG Imputed Data Set > > Category variable1 variable2 > variable3 > > 1 5 3.2 > 2.5 > > 1 4.5 > 3 4 > > 2 3.5 > 7 5.9 > > etc > > Any ideas would be very much appreciated!!!!!You might add reading the Posing Guide and setting up your reader to post in plain text to your TODO list.> > thankyou > > Chris Howden> .David Winsemius, MD West Hartford, CT
On Aug 9, 2011, at 11:38 PM, Chris Howden wrote:> Hi, > > I?m trying to do a hierarchical cluster analysis in R with a Big > Data set. > I?m running into problems using the dist() function. > > I?ve been looking at a few threads about R?s memory and have read the > memory limits section in R help. However I?m no computer expert so I?m > hoping I?ve misunderstood something and R can handle my Big Data set, > somehow. Although at the moment I think my dataset is simply too big > and > there is no way around it, but I?d like to be proved wrong! > > My data set has 90523 rows of data and 24 columns. > > My understanding is that this means the distance matrix has a min of > 90523^2 elements which is 8194413529. Which roughly translates as > 8GB of > memory being required (if I assume each entry requires 1 bit). I > only have > 4GB on a 32bit build of windows and R. So there is no way that?s > going to > work. > > So then I thought of getting access to a more powerful computer, and > maybe > using cloud computing. > > However the R memory limit help mentions ?On all builds of R, the > maximum > length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9?. Now > as the > distance matrix I require has more elements than this does this mean > it?s > too big for R no matter what I do?Yes. Vector indexing is done with 4 byte integers. -- David Winsemius, MD West Hartford, CT
On Wed, 10 Aug 2011, David Winsemius wrote:> > On Aug 9, 2011, at 11:38 PM, Chris Howden wrote: > >> Hi, >> >> I?m trying to do a hierarchical cluster analysis in R with a Big Data set. >> I?m running into problems using the dist() function. >> >> I?ve been looking at a few threads about R?s memory and have read the >> memory limits section in R help. However I?m no computer expert so I?m >> hoping I?ve misunderstood something and R can handle my Big Data set, >> somehow. Although at the moment I think my dataset is simply too big and >> there is no way around it, but I?d like to be proved wrong! >> >> My data set has 90523 rows of data and 24 columns. >> >> My understanding is that this means the distance matrix has a min of >> 90523^2 elements which is 8194413529. Which roughly translates as 8GB ofA bit less than half that: it is symmetric.>> memory being required (if I assume each entry requires 1 bit).Hmm, that would be a 0/1 distance: there are simpler methods to cluster such distances.>> I only have 4GB on a 32bit build of windows and R. So there is no >> way that?s going to work. >> >> So then I thought of getting access to a more powerful computer, and maybe >> using cloud computing. >> >> However the R memory limit help mentions ?On all builds of R, the maximum >> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9?. Now as the >> distance matrix I require has more elements than this does this mean it?s >> too big for R no matter what I do? > > Yes. Vector indexing is done with 4 byte integers.Assuming you need the full distance matrix at one time (which you do not for hierarchical clustering, itself a highly dubious method for more than a few hundred points).> > -- > > David Winsemius, MD > West Hartford, CT > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Sorry if this is a duplicate... my email is giving me trouble this evening... On Tue, Aug 9, 2011 at 8:38 PM, Chris Howden <chris at trickysolutions.com.au> wrote:> Hi, > > I?m trying to do a hierarchical cluster analysis in R with a Big Data set. > I?m running into problems using the dist() function. > > I?ve been looking at a few threads about R?s memory and have read the > memory limits section in R help. However I?m no computer expert so I?m > hoping I?ve misunderstood something and R can handle my Big Data set, > somehow. Although at the moment I think my dataset is simply too big and > there is no way around it, but I?d like to be proved wrong! > > My data set has 90523 rows of data and 24 columns. > > My understanding is that this means the distance matrix has a min of > 90523^2 elements which is 8194413529. Which roughly translates as 8GB of > memory being required (if I assume each entry requires 1 bit). I only have > 4GB on a 32bit build of windows and R. So there is no way that?s going to > work. > > So then I thought of getting access to a more powerful computer, and maybe > using cloud computing. > > However the R memory limit help mentions ??On all builds of R, the maximum > length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9?. Now as the > distance matrix I require has more elements than this does this mean it?s > too big for R no matter what I do?You have understood correctly.> > Any ideas would be welcome.You have a couple options, some more involved than others. If you want to stick with R, I would suggest using a two-step clustering approach in which you first use k-means (assuming your distance is Euclidean) or a modification (for example, for correlation-based distances, the package WGCNA contains a function called projectiveKMeans) to pre-cluster your 90k+ variables into "blocks" of about 8-10k each (that's about as much as your computer will handle). The k-means algorithm only requires memory storage of order n*k where k is the number of clusters (or blocks) which can be small, say 500, and n is the number of your variables. Then you do hierarchical clustering in each block separately. Make sure you install and load the package flashClust or fastCluster to make the hierarchical clustering run reasonably fast (the stock R implementation of hclust is horribly slow with large data sets). The mentioned WGCNA package contains a function called blockwiseModules that does just such a procedure, but there the distance is based on correlations which may or may not suit your problem. HTH, Peter
> Assuming you need the full distance matrix at one time (which you do not for > hierarchical clustering, itself a highly dubious method for more than a few > hundred points).Apologies if this hijacks the thread, but why is hierarchical clustering "highly dubious for more than a few hundred points"? Peter
On Tue, 9 Aug 2011, Peter Langfelder wrote:>> Assuming you need the full distance matrix at one time (which you do not for >> hierarchical clustering, itself a highly dubious method for more than a few >> hundred points). > > Apologies if this hijacks the thread, but why is hierarchical > clustering "highly dubious for more than a few > hundred points"?That is off-topic for R-help: see the posting guide.> > Peter >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Łukasz Ręcławowicz
2011-Aug-12 05:16 UTC
[R] Can R handle a matrix with 8 billion entries?
W dniu 12 sierpnia 2011 05:19 u¿ytkownik Chris Howden < chris@trickysolutions.com.au> napisa³:> Thanks for the suggestion, I'll look into it >It seems to work! :) library(multiv) data(iris) iris <- as.matrix(iris[,1:4]) h <- hierclust(iris, method=2) d<- dist(iris) hk<-hclust(d)>str(hk)List of 7 $ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -5 -20 -30 -58 ... $ height : num [1:149] 0 0.1 0.1 0.1 0.1 ... $ order : int [1:150] 108 131 103 126 130 119 106 123 118 132 ... $ labels : NULL $ method : chr "complete" $ call : language hclust(d = d) $ dist.method: chr "euclidean" - attr(*, "class")= chr "hclust"> str(h)List of 3 $ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -41 -5 -20 7 ... $ height: num [1:149] 0 0.01 0.01 0.01 0.01 ... $ order : int [1:150] 42 23 15 16 45 34 33 17 21 32 ... test.mat<-matrix(rnorm(90523*24),,24) out<-hierclust(test.mat, method = 1, bign = T)> print(object.size(out),u="Mb")1.7 Mb> str(out)List of 3 $ merge : int [1:90522, 1:2] -35562 -19476 -60344 -66060 -38949 -14537 -3322 -20248 -19464 -78693 ... $ height: num [1:90522] 1.93 1.94 1.96 1.98 2 ... $ order : int [1:90523] 24026 61915 71685 16317 85828 11577 36034 37324 65754 55381 ...> R.version$os[1] "mingw32" -- Mi³ego dnia [[alternative HTML version deleted]]