edmund jones
2010-Jun-08 22:06 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
Hi,
I am relatively new to R; when creating functions, I run into problems with
missing values. I would like my functions to ignore rows with missing values
for arguments of my function) in the analysis (as for example is the case in
STATA). Note that I don't want my function to drop rows if there are missing
arguments elsewhere in a row, ie for variables that are not arguments of my
function.
As an example: here is a clustering function I wrote:
cl <- function(dat, na.rm = TRUE, fm, cluster){
attach( dat , warn.conflicts = F)
library(sandwich)
library(lmtest)
M <- length(unique(cluster))
N <- length(cluster)
K <- fm$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
cluster, sum)) ) );
vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
coeftest(fm, vcovCL)
}
When I run my function, I get the message:
Error in tapply(x, cluster, sum) : arguments must have same length
If I specify instead attach(na.omit(dat), warn.conflicts = F) and don't
have the "na.rm=TRUE" argument, then my function runs; but only for
the rows
where there are no missing values AT ALL; however, I don't care if there are
missing values for variables on which I am not applying my function.
For example, I have information on children's size; if I want regress scores
on age and parents' education, clustering on class, I would like missing
values in size not to interfere (ie if I have scores, age, parents'
education, and class, but not size, I don't want to drop this observation).
I tried to look at the code of "lm" to see how the na.action part
works, but
I couldn't figure it out... This is exactly how I would like to deal with
missing values.
I tried to write
cl <- function(dat, fm, cluster, na.action){
attach( dat , warn.conflicts = F)
library(sandwich)
library(lmtest)
M <- length(unique(cluster))
N <- length(cluster)
K <- fm$rank
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
cluster, sum)) ) );
vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
coeftest(fm, vcovCL)
}
attr(cl,"na.action") <- na.exclude
but it still didn't work...
Any ideas of how to deal with this issue?
Thank you for your answers!
Edmund
[[alternative HTML version deleted]]
Joris Meys
2010-Jun-09 09:33 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
It's difficult to help you if we don't know what the data looks like. Two more tips : - look at ?with instead of attach(). The latter causes a lot of trouble further on. - use require() instead of library(). See also the help files on this. I also wonder what you're doing with the na.rm=TRUE. I don't see how it affects the function, as it is used nowhere. Does it really remove NA? This said: estfun takes the estimation function from your model object. Your model is essentially fit on all observations that are not NA, whereas I presume your cluster contains all observations, including NA. Now I don't know what kind of model fm represents, but normally there is information in the model object about the removed observations. I illustrate this using lm :> x <- rnorm(100) > y <- c(rnorm(90),NA,rnorm(9)) > test <- lm(x~y) > str(test)List of 13 ... (tons of information) $ model :'data.frame': 99 obs. of 2 variables: ... (more tons of information) ..- attr(*, "na.action")=Class 'omit' Named int 91 .. .. ..- attr(*, "names")= chr "91" - attr(*, "class")= chr "lm" Now we know that we can do :> not <-attr(test$model,"na.action") > y[-not]So try this ### NOT TESTED cl <- function(dat, na.rm = TRUE, fm, cluster){ require(sandwich) require(lmtest) not <- attr(fm$model,"na.action") cluster <- cluster[-not] with( dat ,{ M <- length(unique(cluster)) N <- length(cluster) K <- fm$rank dfc <- (M/(M-1))*((N-1)/(N-K)) uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, cluster, sum)) ) ); vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) coeftest(fm, vcovCL) } ) } Cheers Joris On Wed, Jun 9, 2010 at 12:06 AM, edmund jones <edmund.j.jones at gmail.com> wrote:> Hi, > > I am relatively new to R; when creating functions, I run into problems with > missing values. I would like my functions to ignore rows with missing values > for arguments of my function) in the analysis (as for example is the case in > STATA). Note that I don't want my function to drop rows if there are missing > arguments elsewhere in a row, ie for variables that are not arguments of my > function. > > As an example: here is a clustering function I wrote: > > > cl <- function(dat, na.rm = TRUE, fm, cluster){ > > attach( dat , warn.conflicts = F) > > library(sandwich) > > library(lmtest) > > M <- length(unique(cluster)) > > N <- length(cluster) > > K <- fm$rank > > dfc <- (M/(M-1))*((N-1)/(N-K)) > > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, > cluster, sum)) ) ); > > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) > > coeftest(fm, vcovCL) > > } > > > When I run my function, I get the message: > > > Error in tapply(x, cluster, sum) : arguments must have same length > > > If I specify instead attach(na.omit(dat), warn.conflicts = F) ?and don't > have the "na.rm=TRUE" argument, then my function runs; but only for the rows > where there are no missing values AT ALL; however, I don't care if there are > missing values for variables on which I am not applying my function. > > > For example, I have information on children's size; if I want regress scores > on age and parents' education, clustering on class, I would like missing > values in size not to interfere (ie if I have scores, age, parents' > education, and class, but not size, I don't want to drop this observation). > > > I tried to look at the code of "lm" to see how the na.action part works, but > I couldn't figure it out... This is exactly how I would like to deal with > missing values. > > > I tried to write > > cl <- function(dat, fm, cluster, na.action){ > > attach( dat , warn.conflicts = F) > > library(sandwich) > > library(lmtest) > > ?M <- length(unique(cluster)) > > ?N <- length(cluster) > > ?K <- fm$rank > > ?dfc <- (M/(M-1))*((N-1)/(N-K)) > > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x, > cluster, sum)) ) ); > > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N) > > ?coeftest(fm, vcovCL) > > } > > ?attr(cl,"na.action") <- na.exclude > > > but it still didn't work... > > > Any ideas of how to deal with this issue? > > Thank you for your answers! > > Edmund > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control tel : +32 9 264 59 87 Joris.Meys at Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
Joris Meys
2010-Jun-14 14:36 UTC
[R] how to ignore rows missing arguments of a function when creating a function?
Ah, I overlooked that possibility.
You can do following :
....
not <- attr(fm$model,"na.action")
if( ! is.null(not)){ # only drop the NA values if there are any left
out of the model
cluster <- cluster[-not]
dat <- dat[-not,]
}
with(dat,{
....
On Mon, Jun 14, 2010 at 4:30 PM, edmund jones <edmund.j.jones at
gmail.com> wrote:> Thanks a lot!
>
> So, what you propose works perfectly (the "cluster" variable is
indeed a
> vector), except in the case where I have no missing values in my
regression.
> With the following function:
>
> cl2 <- function(dat,fm, cluster){
> attach(dat, warn.conflicts = F)
> require(sandwich)
> require(lmtest)
> not <- attr(fm$model,"na.action")
> cluster <- cluster[-not]
> with( dat[-not,] ,{
> M <- length(unique(cluster))
> N <- length(cluster)
> K <- fm$rank
> dfc <- (M/(M-1))*((N-1)/(N-K))
> uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
> vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
> coeftest(fm, vcovCL)
> }
> )
> }
>
> If I have no missing values in the arguments of fm, get the message
> "error in -not: invalid argument to unary operator"
>
> In your example:
>
>
>> x <- rnorm(100)
>> y <- c(rnorm(90),NA,rnorm(9))
>> test <- lm(x~y)
>> str(test)
> List of 13
> ?... (tons of information)
> ?$ model ? ? ? ?:'data.frame': ?99 obs. of ?2 variables:
> ?... (more tons of information)
> ?..- attr(*, "na.action")=Class 'omit' ?Named int 91
> ?.. .. ..- attr(*, "names")= chr "91"
> ?- attr(*, "class")= chr "lm"
>
> Now we know that we can do :
>> not <-attr(test$model,"na.action")
>> y[-not]
>
> If I have, for example, y <- rnorm(100), I also get the error:
> "error in -not: invalid argument to unary operator"
>
>
> In my database:
>
> ???? female??? income transport lunch dist reg_f
> 4900????? 0 18.405990???????? 0???? 0?? 75?? 750
> 4901????? 0??????? NA???????? 0??? NA?? 75?? 753
> 4902????? 1??????? NA???????? 0???? 1?? 75?? 752
> 4903????? 1??????? NA???????? 0???? 1?? 75?? 751
> 4904????? 1 69.678340???????? 1???? 0?? 74?? 740
> 4905????? 0 57.953230???????? 1???? 0?? 73?? 730
> 4906????? 1 85.835130???????? 0???? 1?? 68?? 680
> 4907????? 0 81.952980???????? 0???? 0?? 75?? 750
> 4908????? 1 46.837490???????? 1???? 0?? 74?? 740
> 4909????? 0??????? NA???????? 1???? 0??? 5??? 52
> 4910????? 1 65.041360???????? 0???? 1?? 75?? 750
> 4911????? 0 77.451870???????? 1???? 0?? 75?? 750
> 4912????? 0 96.148590???????? 1???? 0?? 75?? 750
> 4913????? 0 64.510410???????? 0???? 0?? 74?? 740
> 4914????? 0 69.391230???????? 0???? 0?? 75?? 750
> 4915????? 0? 4.804243???????? 0???? 1?? 65?? 650
> 4916????? 0??????? NA???????? 0???? 0?? 75?? 751
> 4917????? 1??????? NA???????? 0???? 0?? 75?? 751
> 4918????? 1??????? NA???????? 0???? 1?? 40?? 401
> 4919????? 1 49.920750???????? 0???? 1?? 76?? 760
> 4920????? 0??????? NA???????? 0???? 1?? 76?? 763
> 4921????? 0 10.187910???????? 0???? 0?? 77?? 770
> 4922????? 0 14.839710???????? 0???? 1?? 77?? 770
> 4923????? 1 32.041000???????? 0???? 0?? 77?? 770
> 4924????? 0 85.639440???????? 0???? 0?? 77?? 770
> 4925????? 1 86.308410???????? 0???? 0?? 68?? 680
> 4926????? 0 79.223910???????? 0???? 0??? 7??? 70
> 4927????? 0 81.825800???????? 0???? 0?? 78?? 780
> 4928????? 0 31.931000???????? 0???? 1?? 37?? 370
> 4929????? 0 53.282310???????? 0???? 1?? 41?? 410
> 4930????? 1 31.312910???????? 1???? 1?? 25?? 250
> 4931????? 1 50.478870???????? 0???? 1?? 25?? 250
> 4932????? 0??????? NA???????? 0???? 0?? 66?? 662
> 4933????? 1 58.156940???????? 0???? 1?? 31?? 310
> 4934????? 0??????? NA???????? 0???? 1??? 1??? 13
> 4935????? 1??????? NA???????? 0???? 1??? 1??? 12
> 4936????? 1 59.149180???????? 0???? 0??? 3??? 30
> 4937????? 1? 5.400807???????? 0???? 1??? 5??? 50
> 4938????? 1 76.828630???????? 0???? 0??? 6??? 60
> 4939????? 1 73.488300???????? 0???? 1?? 63?? 630
> 4940????? 0? 6.529074???????? 0???? 1??? 6??? 60
> 4941????? 0??????? NA???????? 0???? 0??? 6??? 61
> 4942????? 1 70.128530???????? 0???? 0??? 3??? 30
> 4943????? 0??????? NA???????? 0???? 1?? 53?? 531
> 4944????? 1 75.715350???????? 1???? 0??? 6??? 60
> 4945????? 0? 8.623850???????? 0???? 1?? 24?? 240
> 4946????? 1 79.062470???????? 0???? 1?? 62?? 620
> 4947????? 0 83.863370???????? 1???? 1?? 11?? 110
> 4948????? 0 58.904450???????? 0???? 1?? 62?? 620
> 4949????? 0 88.500290???????? 0???? 0??? 9??? 90
> 4950????? 0??????? NA???????? 1???? 1??? 9??? 90
> 4951???? NA??????? NA???????? 0???? 1?? 15?? 151
>
> It works perfectly if I do
> cl2(testdata, lm(income ~ transport + reg_f ), female)
>
> but not for cl2(testdata, lm(dist ~ transport + reg_f ), female)
> Or any other case about when the arguments of the "lm" function
have no
> "missing" values.
>
> How can I tell R to do this only if there is a "missing" problem?
>
> Thanks a lot for your help! Working on your previous replies has been very
> helpful in understanding how R works.
>
> Cheers,
> Edmund.
>
>
> 2010/6/13 Joris Meys <jorismeys at gmail.com>
>>
>> ?Next to that, Off course you have to use the right ?indices for
>> "cluster", but as I have no clue what the "dist"
is, I just put there
>> something. So if it is a matrix, naturally you'll get errors. If I
>> make a vector "cluster" with the same length as the
dataframe, giving
>> the clustering, then it works perfectly fine.
>>
>>
>>
>>
>> On Sat, Jun 12, 2010 at 4:20 PM, edmund jones <edmund.j.jones at
gmail.com>
>> wrote:
>> > Hi,
>> > Thank you for your reply;
>> > I tried what you suggested, but it still didn't work.
>> > I see that my question was not precise enough; I have tried to
specify
>> > it
>> > this time and I have also added data.
>>
>> I still can't run your code, because I don't have the cluster
argument
>> "dist".
>>
>> > In other words: how can I tell R to do this line
>> >
>> > y <- testdata[which(apply(testdata[,c()],1, function(x)
>> > is.element(NA,x))==F
>> > ),]
>> I showed you how to "do that line". The line "not <-
>> attr(fm$model,"na.action")" gives you the row numbers of
deleted
>> observations in the fit. Now you just have to use those as an index to
>> clear up any environment. ?Try with the "dat" maybe :
>>
>> cl <- function(dat, fm, cluster){
>> require(sandwich)
>> require(lmtest)
>> ?not <- attr(fm$model,"na.action")
>>
>> with( dat[-not,] ,{
>>
>> ? M <- length(unique(cluster))
>> ? N <- length(cluster)
>> ? K <- fm$rank
>>
>> ? dfc <- (M/(M-1))*((N-1)/(N-K))
>>
>> ? uj <- data.frame(apply(estfun(fm),2, function(x)
data.frame(tapply(x,
>> ? cluster, sum)) ) );
>>
>> ? vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>>
>> ? coeftest(fm, vcovCL)
>> ? } )
>> }
>>
>> Cheers
>> Joris
>>
>>
>> >
>> > for the columns that will be the arguments of my "cl"
function, so I
>> > don't
>> > have to manually ignore the observations for which I have some
missing
>> > variables of interest for this function?
>> >
>> > Thank you very much.
>> >
>> > Cheers,
>> > Edmund
>> >
>> > 2010/6/9 Joris Meys <jorismeys at gmail.com>
>> >>
>> >> It's difficult to help you if we don't know what the
data looks like.
>> >> Two more tips :
>> >> - look at ?with instead of attach(). The latter causes a lot
of
>> >> trouble further on.
>> >> - use require() instead of library(). See also the help files
on this.
>> >>
>> >> I also wonder what you're doing with the na.rm=TRUE. I
don't see how
>> >> it affects the function, as it is used nowhere. Does it really
remove
>> >> NA?
>> >>
>> >> This said: estfun takes the estimation function from your
model
>> >> object. Your model is essentially fit on all observations that
are not
>> >> NA, whereas I presume your cluster contains all observations,
>> >> including NA. Now I don't know what kind of model fm
represents, but
>> >> normally there is information in the model object about the
removed
>> >> observations. I illustrate this using lm :
>> >>
>> >> > x <- rnorm(100)
>> >> > y <- c(rnorm(90),NA,rnorm(9))
>> >> > test <- lm(x~y)
>> >> > str(test)
>> >> List of 13
>> >> ?... (tons of information)
>> >> ?$ model ? ? ? ?:'data.frame': ?99 obs. of ?2
variables:
>> >> ?... (more tons of information)
>> >> ?..- attr(*, "na.action")=Class 'omit'
?Named int 91
>> >> ?.. .. ..- attr(*, "names")= chr "91"
>> >> ?- attr(*, "class")= chr "lm"
>> >>
>> >> Now we know that we can do :
>> >> > not <-attr(test$model,"na.action")
>> >> > y[-not]
>> >>
>> >> So try this
>> >> ### NOT TESTED
>> >> cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> require(sandwich)
>> >> require(lmtest)
>> >>
>> >> not <- attr(fm$model,"na.action")
>> >> cluster <- cluster[-not]
>> >>
>> >> with( dat ,{
>> >> ? ?M <- length(unique(cluster))
>> >> ? ?N <- length(cluster)
>> >> ? ?K <- fm$rank
>> >>
>> >> ? ?dfc <- (M/(M-1))*((N-1)/(N-K))
>> >>
>> >> ? ?uj <- data.frame(apply(estfun(fm),2, function(x)
>> >> data.frame(tapply(x,
>> >> ? ?cluster, sum)) ) );
>> >>
>> >> ? ?vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >>
>> >> ? ?coeftest(fm, vcovCL)
>> >> ? ?} )
>> >> }
>> >>
>> >> Cheers
>> >> Joris
>> >>
>> >> On Wed, Jun 9, 2010 at 12:06 AM, edmund jones
>> >> <edmund.j.jones at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I am relatively new to R; when creating functions, I run
into
>> >> > problems
>> >> > with
>> >> > missing values. I would like my functions to ignore rows
with missing
>> >> > values
>> >> > for arguments of my function) in the analysis (as for
example is the
>> >> > case in
>> >> > STATA). Note that I don't want my function to drop
rows if there are
>> >> > missing
>> >> > arguments elsewhere in a row, ie for variables that are
not arguments
>> >> > of
>> >> > my
>> >> > function.
>> >> >
>> >> > As an example: here is a clustering function I wrote:
>> >> >
>> >> >
>> >> > cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> > M <- length(unique(cluster))
>> >> >
>> >> > N <- length(cluster)
>> >> >
>> >> > K <- fm$rank
>> >> >
>> >> > dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x)
data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> > coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> >
>> >> > When I run my function, I get the message:
>> >> >
>> >> >
>> >> > Error in tapply(x, cluster, sum) : arguments must have
same length
>> >> >
>> >> >
>> >> > If I specify instead attach(na.omit(dat), warn.conflicts
= F) ?and
>> >> > don't
>> >> > have the "na.rm=TRUE" argument, then my
function runs; but only for
>> >> > the
>> >> > rows
>> >> > where there are no missing values AT ALL; however, I
don't care if
>> >> > there
>> >> > are
>> >> > missing values for variables on which I am not applying
my function.
>> >> >
>> >> >
>> >> > For example, I have information on children's size;
if I want regress
>> >> > scores
>> >> > on age and parents' education, clustering on class, I
would like
>> >> > missing
>> >> > values in size not to interfere (ie if I have scores,
age, parents'
>> >> > education, and class, but not size, I don't want to
drop this
>> >> > observation).
>> >> >
>> >> >
>> >> > I tried to look at the code of "lm" to see how
the na.action part
>> >> > works,
>> >> > but
>> >> > I couldn't figure it out... This is exactly how I
would like to deal
>> >> > with
>> >> > missing values.
>> >> >
>> >> >
>> >> > I tried to write
>> >> >
>> >> > cl <- function(dat, fm, cluster, na.action){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> > ?M <- length(unique(cluster))
>> >> >
>> >> > ?N <- length(cluster)
>> >> >
>> >> > ?K <- fm$rank
>> >> >
>> >> > ?dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x)
data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> > ?coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> > ?attr(cl,"na.action") <- na.exclude
>> >> >
>> >> >
>> >> > but it still didn't work...
>> >> >
>> >> >
>> >> > Any ideas of how to deal with this issue?
>> >> >
>> >> > Thank you for your answers!
>> >> >
>> >> > Edmund
>> >> >
>> >> > ? ? ? ?[[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> > http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained,
reproducible code.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Joris Meys
>> >> Statistical consultant
>> >>
>> >> Ghent University
>> >> Faculty of Bioscience Engineering
>> >> Department of Applied mathematics, biometrics and process
control
>> >>
>> >> tel : +32 9 264 59 87
>> >> Joris.Meys at Ugent.be
>> >> -------------------------------
>> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>> >
>> >
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>
--
Joris Meys
Statistical consultant
Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control
tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php