thr3ads.net - R help - [R] Re: Re: Find Closest 5 Cases? [Feb 2004]

If this information is useful, please help other people find it:
Share via:

dsheuman@rogers.com

2004-Feb-13 20:35 UTC

[R] Re: Re: Find Closest 5 Cases?

Art (and group),

I'm doing this as a form of missing value analysis.  Approximately 30% of
the cases are missing data for one variable.  To impute values for those cases,
I'd like to match those cases that are missing the variable to all other
cases and then take an average of those to infill.

I realize there are many methods for imputing data.  I'm not well versed on
any in particular (expect regression and cluster analysis).  That said, given
that I have an extensive data set already with most variables populated, I can
find the closest observations in N-dimentional space and impute the value that
way - by focusing on the best matches.

If there are any other thoughts on how to do this (relatively easily), I'm
open to suggestions and being educated.

Thanks,

Danny
> From: Art Kendall <Art at DrKendall.org>
> Date: 2004/02/13 Fri PM 02:47:00 EST
> To: Danny Heuman <a0079454 at airnews.net>
> Subject: Re: Find Closest 5 Cases?
> 
> This would be extremely compute intensive.
> Why are you trying to do this?
> Do the 5 percentages sum to a constant total?
> 
> If you tell us more about the problem and its context perhaps we can make
some suggestions.
> 
> E.g., if you could live with groups of any size that are close
> you might try transforming the percentages to z's and applying a
TWOSTEP
> procedure.
> 
> If your really, really need 5, the use of cluster membership variables
> and distances from cluster centers, could be used to limit searches, but
> I wouldn't want to try to work it out without more info especially
since
> I do not presently have SPSS on my system so I could verify my
> recommendations.
> 
> Hope this helps.
> 
> Art
> Art at DrKendall.org
> Social Research Consultants
> University Park, MD USA
> (301) 864-5570
> 
> 
> Danny Heuman wrote:
> 
> > I have a need to identify for each CASE the closest (or most similar)
5
> > other CASES (not including itself as it is automatically the closest).
I
> > have a fairly large matrix (50000 cases by 50 vars). 
> 
> 
> 
> 
>

Art Kendall

2004-Feb-13 21:09 UTC

head link

[R] Re: Find Closest 5 Cases?

Dealing with missing data can be very complex.  A lot depends on the 
actual research area under study.  Giving reasonable suggestions would 
take a lot more understanding of the context in which the question is 
being asked, the nature of the data, and the review procedures the 
results would undergo.  How much effort it would take to justify a novel 
way of dealing with missing data also needs to be considered.

Are there variables for each case outside the 5 that are measured as 
percentages?  
Why was the data gathered in the first place? 
What questions is it being used to answer?

Why are the values missing for these particular cases?  Is there any 
reason to believe that missingness is related to what the "true value"
is?

Art

dsheuman at rogers.com wrote:
>Art (and group),
>
>I'm doing this as a form of missing value analysis.  Approximately 30%
of the cases are missing data for one variable.  To impute values for those
cases, I'd like to match those cases that are missing the variable to all
other cases and then take an average of those to infill.
>
>I realize there are many methods for imputing data.  I'm not well versed
on any in particular (expect regression and cluster analysis).  That said, given
that I have an extensive data set already with most variables populated, I can
find the closest observations in N-dimentional space and impute the value that
way - by focusing on the best matches.
>
>If there are any other thoughts on how to do this (relatively easily),
I'm open to suggestions and being educated.
>
>Thanks,
>
>Danny
>
>  
>
>>From: Art Kendall <Art at DrKendall.org>
>>Date: 2004/02/13 Fri PM 02:47:00 EST
>>To: Danny Heuman <a0079454 at airnews.net>
>>Subject: Re: Find Closest 5 Cases?
>>
>>This would be extremely compute intensive.
>>Why are you trying to do this?
>>Do the 5 percentages sum to a constant total?
>>
>>If you tell us more about the problem and its context perhaps we can
make some suggestions.
>>
>>E.g., if you could live with groups of any size that are close
>>you might try transforming the percentages to z's and applying a
TWOSTEP
>>procedure.
>>
>>If your really, really need 5, the use of cluster membership variables
>>and distances from cluster centers, could be used to limit searches, but
>>I wouldn't want to try to work it out without more info especially
since
>>I do not presently have SPSS on my system so I could verify my
>>recommendations.
>>
>>Hope this helps.
>>
>>Art
>>Art at DrKendall.org
>>Social Research Consultants
>>University Park, MD USA
>>(301) 864-5570
>>
>>
>>Danny Heuman wrote:
>>
>>    
>>
>>>I have a need to identify for each CASE the closest (or most
similar) 5
>>>other CASES (not including itself as it is automatically the
closest).  I
>>>have a fairly large matrix (50000 cases by 50 vars). 
>>>      
>>>
>>
>>
>>
>>    
>>
>
>
>  
>

Sean Davis

2004-Feb-13 21:45 UTC

head link

[R] Re: Re: Find Closest 5 Cases?

Danny,

In the bioconductor suite (www.bioconductor.org) in the pamr package there
is a program called pamr.knnimpute that will probably at least close to what
you would like to do.

Sean
-- 
Sean Davis, M.D., Ph.D.

Clinical Fellow
National Institutes of Health
National Cancer Institute
National Human Genome Research Institute

Clinical Fellow, Johns Hopkins
Department of Pediatric Oncology
-- 



On 2/13/04 3:35 PM, "dsheuman at rogers.com" <dsheuman at
rogers.com> wrote:
> Art (and group),
> 
> I'm doing this as a form of missing value analysis.  Approximately 30%
of the
> cases are missing data for one variable.  To impute values for those cases,
> I'd like to match those cases that are missing the variable to all
other cases
> and then take an average of those to infill.
> 
> I realize there are many methods for imputing data.  I'm not well
versed on
> any in particular (expect regression and cluster analysis).  That said,
given
> that I have an extensive data set already with most variables populated, I
can
> find the closest observations in N-dimentional space and impute the value
that
> way - by focusing on the best matches.
> 
> If there are any other thoughts on how to do this (relatively easily),
I'm
> open to suggestions and being educated.
> 
> Thanks,
> 
> Danny
> 
>> From: Art Kendall <Art at DrKendall.org>
>> Date: 2004/02/13 Fri PM 02:47:00 EST
>> To: Danny Heuman <a0079454 at airnews.net>
>> Subject: Re: Find Closest 5 Cases?
>> 
>> This would be extremely compute intensive.
>> Why are you trying to do this?
>> Do the 5 percentages sum to a constant total?
>> 
>> If you tell us more about the problem and its context perhaps we can
make
>> some suggestions.
>> 
>> E.g., if you could live with groups of any size that are close
>> you might try transforming the percentages to z's and applying a
TWOSTEP
>> procedure.
>> 
>> If your really, really need 5, the use of cluster membership variables
>> and distances from cluster centers, could be used to limit searches,
but
>> I wouldn't want to try to work it out without more info especially
since
>> I do not presently have SPSS on my system so I could verify my
>> recommendations.
>> 
>> Hope this helps.
>> 
>> Art
>> Art at DrKendall.org
>> Social Research Consultants
>> University Park, MD USA
>> (301) 864-5570
>> 
>> 
>> Danny Heuman wrote:
>> 
>>> I have a need to identify for each CASE the closest (or most
similar) 5
>>> other CASES (not including itself as it is automatically the
closest).  I
>>> have a fairly large matrix (50000 cases by 50 vars).
>> 
>> 
>> 
>> 
>> 
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Chuck Cleland

2004-Feb-13 22:02 UTC

head link

[R] Re: Re: Find Closest 5 Cases?

dsheuman at rogers.com wrote:> I'm doing this as a form of missing value analysis.  Approximately 30%
of the cases are missing data for one variable.  To impute values for those
cases, I'd like to match those cases that are missing the variable to all
other cases and then take an average of those to infill.
> 
> I realize there are many methods for imputing data.  I'm not well
versed on any in particular (expect regression and cluster analysis).  That
said, given that I have an extensive data set already with most variables
populated, I can find the closest observations in N-dimentional space and impute
the value that way - by focusing on the best matches.
> 
> If there are any other thoughts on how to do this (relatively easily),
I'm open to suggestions and being educated.
You might have a look at impute.knn() in the impute package on CRAN.

mymat <- matrix(rbinom(50000*20, 1, .5), ncol=20)
mymat[sample(50000, 50000*.30),5] <- NA
summary(mymat)
summary(impute.knn(mymat, k=5)$data)

hope this helps,

Chuck Cleland

-- 
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 452-1424 (M, W, F)
fax: (917) 438-0894

Possibly Parallel Threads

Search for more maybe matching threads

R help - Feb 2004 - Re: Re: Find Closest 5 Cases?

[R] Re: Re: Find Closest 5 Cases?

[R] Re: Find Closest 5 Cases?

[R] Re: Re: Find Closest 5 Cases?

[R] Re: Re: Find Closest 5 Cases?

Possibly Parallel Threads