edmund jones
2010-Jun-08 22:06 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
Hi, I am relatively new to R; when creating functions, I run into problems with missing values. I would like my functions to ignore rows with missing values for arguments of my function) in the analysis (as for example is the case in STATA). Note that I don't want my function to drop rows if there are missing arguments elsewhere in a row, ie for variables that are not arguments of my function. As an example: here is a clustering function I wrote: cl <- function(dat, na.rm = TRUE, fm, cluster){ attach( dat , warn.conflicts = F) library(sandwich) library(lmtest) M <- length(unique(cluster)) N <- length(cluster) K <- fm$rank dfc <- (M/(M-1))*((N-1)/(N-K)) uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, cluster, sum)) ) ); vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) coeftest(fm, vcovCL) } When I run my function, I get the message: Error in tapply(x, cluster, sum) : arguments must have same length If I specify instead attach(na.omit(dat), warn.conflicts = F) and don't have the "na.rm=TRUE" argument, then my function runs; but only for the rows where there are no missing values AT ALL; however, I don't care if there are missing values for variables on which I am not applying my function. For example, I have information on children's size; if I want regress scores on age and parents' education, clustering on class, I would like missing values in size not to interfere (ie if I have scores, age, parents' education, and class, but not size, I don't want to drop this observation). I tried to look at the code of "lm" to see how the na.action part works, but I couldn't figure it out... This is exactly how I would like to deal with missing values. I tried to write cl <- function(dat, fm, cluster, na.action){ attach( dat , warn.conflicts = F) library(sandwich) library(lmtest) M <- length(unique(cluster)) N <- length(cluster) K <- fm$rank dfc <- (M/(M-1))*((N-1)/(N-K)) uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, cluster, sum)) ) ); vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) coeftest(fm, vcovCL) } attr(cl,"na.action") <- na.exclude but it still didn't work... Any ideas of how to deal with this issue? Thank you for your answers! Edmund [[alternative HTML version deleted]]
Joris Meys
2010-Jun-09 09:33 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
It's difficult to help you if we don't know what the data looks like. Two more tips : - look at ?with instead of attach(). The latter causes a lot of trouble further on. - use require() instead of library(). See also the help files on this. I also wonder what you're doing with the na.rm=TRUE. I don't see how it affects the function, as it is used nowhere. Does it really remove NA? This said: estfun takes the estimation function from your model object. Your model is essentially fit on all observations that are not NA, whereas I presume your cluster contains all observations, including NA. Now I don't know what kind of model fm represents, but normally there is information in the model object about the removed observations. I illustrate this using lm :> x <- rnorm(100) > y <- c(rnorm(90),NA,rnorm(9)) > test <- lm(x~y) > str(test)List of 13 ... (tons of information) $ model :'data.frame': 99 obs. of 2 variables: ... (more tons of information) ..- attr(*, "na.action")=Class 'omit' Named int 91 .. .. ..- attr(*, "names")= chr "91" - attr(*, "class")= chr "lm" Now we know that we can do :> not <-attr(test$model,"na.action") > y[-not]So try this ### NOT TESTED cl <- function(dat, na.rm = TRUE, fm, cluster){ require(sandwich) require(lmtest) not <- attr(fm$model,"na.action") cluster <- cluster[-not] with( dat ,{ M <- length(unique(cluster)) N <- length(cluster) K <- fm$rank dfc <- (M/(M-1))*((N-1)/(N-K)) uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, cluster, sum)) ) ); vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) coeftest(fm, vcovCL) } ) } Cheers Joris On Wed, Jun 9, 2010 at 12:06 AM, edmund jones <edmund.j.jones at gmail.com> wrote:> Hi, > > I am relatively new to R; when creating functions, I run into problems with > missing values. I would like my functions to ignore rows with missing values > for arguments of my function) in the analysis (as for example is the case in > STATA). Note that I don't want my function to drop rows if there are missing > arguments elsewhere in a row, ie for variables that are not arguments of my > function. > > As an example: here is a clustering function I wrote: > > > cl <- function(dat, na.rm = TRUE, fm, cluster){ > > attach( dat , warn.conflicts = F) > > library(sandwich) > > library(lmtest) > > M <- length(unique(cluster)) > > N <- length(cluster) > > K <- fm$rank > > dfc <- (M/(M-1))*((N-1)/(N-K)) > > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, > cluster, sum)) ) ); > > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) > > coeftest(fm, vcovCL) > > } > > > When I run my function, I get the message: > > > Error in tapply(x, cluster, sum) : arguments must have same length > > > If I specify instead attach(na.omit(dat), warn.conflicts = F) ?and don't > have the "na.rm=TRUE" argument, then my function runs; but only for the rows > where there are no missing values AT ALL; however, I don't care if there are > missing values for variables on which I am not applying my function. > > > For example, I have information on children's size; if I want regress scores > on age and parents' education, clustering on class, I would like missing > values in size not to interfere (ie if I have scores, age, parents' > education, and class, but not size, I don't want to drop this observation). > > > I tried to look at the code of "lm" to see how the na.action part works, but > I couldn't figure it out... This is exactly how I would like to deal with > missing values. > > > I tried to write > > cl <- function(dat, fm, cluster, na.action){ > > attach( dat , warn.conflicts = F) > > library(sandwich) > > library(lmtest) > > ?M <- length(unique(cluster)) > > ?N <- length(cluster) > > ?K <- fm$rank > > ?dfc <- (M/(M-1))*((N-1)/(N-K)) > > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, > cluster, sum)) ) ); > > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) > > ?coeftest(fm, vcovCL) > > } > > ?attr(cl,"na.action") <- na.exclude > > > but it still didn't work... > > > Any ideas of how to deal with this issue? > > Thank you for your answers! > > Edmund > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
Joris Meys
2010-Jun-14 14:36 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
Ah, I overlooked that possibility. You can do following : .... not <- attr(fm$model,"na.action") if( ! is.null(not)){ # only drop the NA values if there are any left out of the model cluster <- cluster[-not] dat <- dat[-not,] } with(dat,{ .... On Mon, Jun 14, 2010 at 4:30 PM, edmund jones <edmund.j.jones at gmail.com> wrote:> Thanks a lot! > > So, what you propose works perfectly (the "cluster" variable is indeed a > vector), except in the case where I have no missing values in my regression. > With the following function: > > cl2 <- function(dat,fm, cluster){ > attach(dat, warn.conflicts = F) > require(sandwich) > require(lmtest) > not <- attr(fm$model,"na.action") > cluster <- cluster[-not] > with( dat[-not,] ,{ > M <- length(unique(cluster)) > N <- length(cluster) > K <- fm$rank > dfc <- (M/(M-1))*((N-1)/(N-K)) > uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum)); > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) > coeftest(fm, vcovCL) > } > ) > } > > If I have no missing values in the arguments of fm, get the message > "error in -not: invalid argument to unary operator" > > In your example: > > >> x <- rnorm(100) >> y <- c(rnorm(90),NA,rnorm(9)) >> test <- lm(x~y) >> str(test) > List of 13 > ?... (tons of information) > ?$ model ? ? ? ?:'data.frame': ?99 obs. of ?2 variables: > ?... (more tons of information) > ?..- attr(*, "na.action")=Class 'omit' ?Named int 91 > ?.. .. ..- attr(*, "names")= chr "91" > ?- attr(*, "class")= chr "lm" > > Now we know that we can do : >> not <-attr(test$model,"na.action") >> y[-not] > > If I have, for example, y <- rnorm(100), I also get the error: > "error in -not: invalid argument to unary operator" > > > In my database: > > ???? female??? income transport lunch dist reg_f > 4900????? 0 18.405990???????? 0???? 0?? 75?? 750 > 4901????? 0??????? NA???????? 0??? NA?? 75?? 753 > 4902????? 1??????? NA???????? 0???? 1?? 75?? 752 > 4903????? 1??????? NA???????? 0???? 1?? 75?? 751 > 4904????? 1 69.678340???????? 1???? 0?? 74?? 740 > 4905????? 0 57.953230???????? 1???? 0?? 73?? 730 > 4906????? 1 85.835130???????? 0???? 1?? 68?? 680 > 4907????? 0 81.952980???????? 0???? 0?? 75?? 750 > 4908????? 1 46.837490???????? 1???? 0?? 74?? 740 > 4909????? 0??????? NA???????? 1???? 0??? 5??? 52 > 4910????? 1 65.041360???????? 0???? 1?? 75?? 750 > 4911????? 0 77.451870???????? 1???? 0?? 75?? 750 > 4912????? 0 96.148590???????? 1???? 0?? 75?? 750 > 4913????? 0 64.510410???????? 0???? 0?? 74?? 740 > 4914????? 0 69.391230???????? 0???? 0?? 75?? 750 > 4915????? 0? 4.804243???????? 0???? 1?? 65?? 650 > 4916????? 0??????? NA???????? 0???? 0?? 75?? 751 > 4917????? 1??????? NA???????? 0???? 0?? 75?? 751 > 4918????? 1??????? NA???????? 0???? 1?? 40?? 401 > 4919????? 1 49.920750???????? 0???? 1?? 76?? 760 > 4920????? 0??????? NA???????? 0???? 1?? 76?? 763 > 4921????? 0 10.187910???????? 0???? 0?? 77?? 770 > 4922????? 0 14.839710???????? 0???? 1?? 77?? 770 > 4923????? 1 32.041000???????? 0???? 0?? 77?? 770 > 4924????? 0 85.639440???????? 0???? 0?? 77?? 770 > 4925????? 1 86.308410???????? 0???? 0?? 68?? 680 > 4926????? 0 79.223910???????? 0???? 0??? 7??? 70 > 4927????? 0 81.825800???????? 0???? 0?? 78?? 780 > 4928????? 0 31.931000???????? 0???? 1?? 37?? 370 > 4929????? 0 53.282310???????? 0???? 1?? 41?? 410 > 4930????? 1 31.312910???????? 1???? 1?? 25?? 250 > 4931????? 1 50.478870???????? 0???? 1?? 25?? 250 > 4932????? 0??????? NA???????? 0???? 0?? 66?? 662 > 4933????? 1 58.156940???????? 0???? 1?? 31?? 310 > 4934????? 0??????? NA???????? 0???? 1??? 1??? 13 > 4935????? 1??????? NA???????? 0???? 1??? 1??? 12 > 4936????? 1 59.149180???????? 0???? 0??? 3??? 30 > 4937????? 1? 5.400807???????? 0???? 1??? 5??? 50 > 4938????? 1 76.828630???????? 0???? 0??? 6??? 60 > 4939????? 1 73.488300???????? 0???? 1?? 63?? 630 > 4940????? 0? 6.529074???????? 0???? 1??? 6??? 60 > 4941????? 0??????? NA???????? 0???? 0??? 6??? 61 > 4942????? 1 70.128530???????? 0???? 0??? 3??? 30 > 4943????? 0??????? NA???????? 0???? 1?? 53?? 531 > 4944????? 1 75.715350???????? 1???? 0??? 6??? 60 > 4945????? 0? 8.623850???????? 0???? 1?? 24?? 240 > 4946????? 1 79.062470???????? 0???? 1?? 62?? 620 > 4947????? 0 83.863370???????? 1???? 1?? 11?? 110 > 4948????? 0 58.904450???????? 0???? 1?? 62?? 620 > 4949????? 0 88.500290???????? 0???? 0??? 9??? 90 > 4950????? 0??????? NA???????? 1???? 1??? 9??? 90 > 4951???? NA??????? NA???????? 0???? 1?? 15?? 151 > > It works perfectly if I do > cl2(testdata, lm(income ~ transport + reg_f ), female) > > but not for cl2(testdata, lm(dist ~ transport + reg_f ), female) > Or any other case about when the arguments of the "lm" function have no > "missing" values. > > How can I tell R to do this only if there is a "missing" problem? > > Thanks a lot for your help! Working on your previous replies has been very > helpful in understanding how R works. > > Cheers, > Edmund. > > > 2010/6/13 Joris Meys <jorismeys at gmail.com> >> >> ?Next to that, Off course you have to use the right ?indices for >> "cluster", but as I have no clue what the "dist" is, I just put there >> something. So if it is a matrix, naturally you'll get errors. If I >> make a vector "cluster" with the same length as the dataframe, giving >> the clustering, then it works perfectly fine. >> >> >> >> >> On Sat, Jun 12, 2010 at 4:20 PM, edmund jones <edmund.j.jones at gmail.com> >> wrote: >> > Hi, >> > Thank you for your reply; >> > I tried what you suggested, but it still didn't work. >> > I see that my question was not precise enough; I have tried to specify >> > it >> > this time and I have also added data. >> >> I still can't run your code, because I don't have the cluster argument >> "dist". >> >> > In other words: how can I tell R to do this line >> > >> > y <- testdata[which(apply(testdata[,c()],1, function(x) >> > is.element(NA,x))==F >> > ),] >> I showed you how to "do that line". The line "not <- >> attr(fm$model,"na.action")" gives you the row numbers of deleted >> observations in the fit. Now you just have to use those as an index to >> clear up any environment. ?Try with the "dat" maybe : >> >> cl <- function(dat, fm, cluster){ >> require(sandwich) >> require(lmtest) >> ?not <- attr(fm$model,"na.action") >> >> with( dat[-not,] ,{ >> >> ? M <- length(unique(cluster)) >> ? N <- length(cluster) >> ? K <- fm$rank >> >> ? dfc <- (M/(M-1))*((N-1)/(N-K)) >> >> ? uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, >> ? cluster, sum)) ) ); >> >> ? vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) >> >> ? coeftest(fm, vcovCL) >> ? } ) >> } >> >> Cheers >> Joris >> >> >> > >> > for the columns that will be the arguments of my "cl" function, so I >> > don't >> > have to manually ignore the observations for which I have some missing >> > variables of interest for this function? >> > >> > Thank you very much. >> > >> > Cheers, >> > Edmund >> > >> > 2010/6/9 Joris Meys <jorismeys at gmail.com> >> >> >> >> It's difficult to help you if we don't know what the data looks like. >> >> Two more tips : >> >> - look at ?with instead of attach(). The latter causes a lot of >> >> trouble further on. >> >> - use require() instead of library(). See also the help files on this. >> >> >> >> I also wonder what you're doing with the na.rm=TRUE. I don't see how >> >> it affects the function, as it is used nowhere. Does it really remove >> >> NA? >> >> >> >> This said: estfun takes the estimation function from your model >> >> object. Your model is essentially fit on all observations that are not >> >> NA, whereas I presume your cluster contains all observations, >> >> including NA. Now I don't know what kind of model fm represents, but >> >> normally there is information in the model object about the removed >> >> observations. I illustrate this using lm : >> >> >> >> > x <- rnorm(100) >> >> > y <- c(rnorm(90),NA,rnorm(9)) >> >> > test <- lm(x~y) >> >> > str(test) >> >> List of 13 >> >> ?... (tons of information) >> >> ?$ model ? ? ? ?:'data.frame': ?99 obs. of ?2 variables: >> >> ?... (more tons of information) >> >> ?..- attr(*, "na.action")=Class 'omit' ?Named int 91 >> >> ?.. .. ..- attr(*, "names")= chr "91" >> >> ?- attr(*, "class")= chr "lm" >> >> >> >> Now we know that we can do : >> >> > not <-attr(test$model,"na.action") >> >> > y[-not] >> >> >> >> So try this >> >> ### NOT TESTED >> >> cl <- function(dat, na.rm = TRUE, fm, cluster){ >> >> require(sandwich) >> >> require(lmtest) >> >> >> >> not <- attr(fm$model,"na.action") >> >> cluster <- cluster[-not] >> >> >> >> with( dat ,{ >> >> ? ?M <- length(unique(cluster)) >> >> ? ?N <- length(cluster) >> >> ? ?K <- fm$rank >> >> >> >> ? ?dfc <- (M/(M-1))*((N-1)/(N-K)) >> >> >> >> ? ?uj <- data.frame(apply(estfun(fm),2, function(x) >> >> data.frame(tapply(x, >> >> ? ?cluster, sum)) ) ); >> >> >> >> ? ?vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) >> >> >> >> ? ?coeftest(fm, vcovCL) >> >> ? ?} ) >> >> } >> >> >> >> Cheers >> >> Joris >> >> >> >> On Wed, Jun 9, 2010 at 12:06 AM, edmund jones >> >> <edmund.j.jones at gmail.com> >> >> wrote: >> >> > Hi, >> >> > >> >> > I am relatively new to R; when creating functions, I run into >> >> > problems >> >> > with >> >> > missing values. I would like my functions to ignore rows with missing >> >> > values >> >> > for arguments of my function) in the analysis (as for example is the >> >> > case in >> >> > STATA). Note that I don't want my function to drop rows if there are >> >> > missing >> >> > arguments elsewhere in a row, ie for variables that are not arguments >> >> > of >> >> > my >> >> > function. >> >> > >> >> > As an example: here is a clustering function I wrote: >> >> > >> >> > >> >> > cl <- function(dat, na.rm = TRUE, fm, cluster){ >> >> > >> >> > attach( dat , warn.conflicts = F) >> >> > >> >> > library(sandwich) >> >> > >> >> > library(lmtest) >> >> > >> >> > M <- length(unique(cluster)) >> >> > >> >> > N <- length(cluster) >> >> > >> >> > K <- fm$rank >> >> > >> >> > dfc <- (M/(M-1))*((N-1)/(N-K)) >> >> > >> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, >> >> > cluster, sum)) ) ); >> >> > >> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) >> >> > >> >> > coeftest(fm, vcovCL) >> >> > >> >> > } >> >> > >> >> > >> >> > When I run my function, I get the message: >> >> > >> >> > >> >> > Error in tapply(x, cluster, sum) : arguments must have same length >> >> > >> >> > >> >> > If I specify instead attach(na.omit(dat), warn.conflicts = F) ?and >> >> > don't >> >> > have the "na.rm=TRUE" argument, then my function runs; but only for >> >> > the >> >> > rows >> >> > where there are no missing values AT ALL; however, I don't care if >> >> > there >> >> > are >> >> > missing values for variables on which I am not applying my function. >> >> > >> >> > >> >> > For example, I have information on children's size; if I want regress >> >> > scores >> >> > on age and parents' education, clustering on class, I would like >> >> > missing >> >> > values in size not to interfere (ie if I have scores, age, parents' >> >> > education, and class, but not size, I don't want to drop this >> >> > observation). >> >> > >> >> > >> >> > I tried to look at the code of "lm" to see how the na.action part >> >> > works, >> >> > but >> >> > I couldn't figure it out... This is exactly how I would like to deal >> >> > with >> >> > missing values. >> >> > >> >> > >> >> > I tried to write >> >> > >> >> > cl <- function(dat, fm, cluster, na.action){ >> >> > >> >> > attach( dat , warn.conflicts = F) >> >> > >> >> > library(sandwich) >> >> > >> >> > library(lmtest) >> >> > >> >> > ?M <- length(unique(cluster)) >> >> > >> >> > ?N <- length(cluster) >> >> > >> >> > ?K <- fm$rank >> >> > >> >> > ?dfc <- (M/(M-1))*((N-1)/(N-K)) >> >> > >> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, >> >> > cluster, sum)) ) ); >> >> > >> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) >> >> > >> >> > ?coeftest(fm, vcovCL) >> >> > >> >> > } >> >> > >> >> > ?attr(cl,"na.action") <- na.exclude >> >> > >> >> > >> >> > but it still didn't work... >> >> > >> >> > >> >> > Any ideas of how to deal with this issue? >> >> > >> >> > Thank you for your answers! >> >> > >> >> > Edmund >> >> > >> >> > ? ? ? ?[[alternative HTML version deleted]] >> >> > >> >> > ______________________________________________ >> >> > R-help at r-project.org mailing list >> >> > https://stat.ethz.ch/mailman/listinfo/r-help >> >> > PLEASE do read the posting guide >> >> > http://www.R-project.org/posting-guide.html >> >> > and provide commented, minimal, self-contained, reproducible code. >> >> > >> >> >> >> >> >> >> >> -- >> >> Joris Meys >> >> Statistical consultant >> >> >> >> Ghent University >> >> Faculty of Bioscience Engineering >> >> Department of Applied mathematics, biometrics and process control >> >> >> >> tel : +32 9 264 59 87 >> >> Joris.Meys at Ugent.be >> >> ------------------------------- >> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> > >> > >> >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Ghent University >> Faculty of Bioscience Engineering >> Department of Applied mathematics, biometrics and process control >> >> tel : +32 9 264 59 87 >> Joris.Meys at Ugent.be >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php