Leif Kirschenbaum
2013-Jun-25 01:47 UTC
[R] Unique matching of two sets of multidimensional data
Dear list, I've searched the archives and tried some code, however would appreciate some input - even a pointer in the direction of the correct function to use. Given N samples each of which is measured for characteristics x1, x2, x3,... (m ˜ 6) where each characteristic is a roughly normally distributed numeric, but with different center and scale. Then the N samples are measured again for characteristics again as x1, x2, x3,..., however the identity of the samples is unknown. Is there a function which will assign the unique identities from the first measurement to the second measurement? I've tried scaling by using the pooled variance of each x1 (i.e. 2N values to estimate the variance of the measure of characteristic x1, the characteristic x2, etc.) to construct the normalized distance from one sample's second measurement x1, x2, x3... to each of the first measurements and then pick the minimum distance to assign an identity to the second measurement. Then loop over all the second measurements to find the first measurement "closest" to it. However I result with one sample ID from the first measurement being assigned to multiple second measurements. How could I minimize the matching between the second measurements and the first with unique sample ID assignment? Example: measure height, weight, and blood pressure of 100 people with their names recorded (scale and ruler both have some random unknown error) measure the height, weight, and blood pressure of those 100 people again, but you forgot to write down their names. (assume that the scale and ruler errors have not changed since the first measurement) How to assign the second set of measurements to the first? Leif Kirschenbaum, Ph.D., PMP Principal Reliability Engineer Parts Engineering Design Reliability Product Reliability SSL 3825 Fabian Way M/S H-21 Palo Alto, CA 94303 Tel: +1-650-852-6580 Facsimile: +1-650-852-7832 www.ssloral.com This e-mail, and any attachments, are intended solely for the use of the intended recipient(s) and may contain legally privileged, proprietary and/or confidential information. Any use, disclosure, dissemination, distribution or copying of this e-mail and any attachments for any purposes that have not been specifically authorized by the sender is strictly prohibited. If you are not the intended recipient, please immediately notify the sender by reply e-mail and permanently delete all copies and attachments. The entire content of this e-mail is for "information purposes" only and should not be relied upon by the recipient in any way unless otherwise confirmed in writing by way of letter or facsimile. ________________________________ This message (including any attachments) may contain con...{{dropped:7}}
Adams, Jean
2013-Jun-25 13:34 UTC
[R] Unique matching of two sets of multidimensional data
You could give 1-nearest neighbor classification a try. For example, a <- data.frame(person=1:10, ht=rnorm(10, mean=5, sd=1), wt=rnorm(10, mean=180, sd=30), bp=rnorm(10, mean=120, sd=10)) meas.err <- data.frame(ht=rnorm(10, sd=0.1), wt=rnorm(10, sd=3), bp=rnorm(10, sd=1)) b <- (a[, -1] + meas.err)[sample(10), ] library(class) b$person <- knn1(a[, -1], b, a$person) On Mon, Jun 24, 2013 at 8:47 PM, Leif Kirschenbaum < Kirschenbaum.Leif@ssd.loral.com> wrote:> Dear list, > I've searched the archives and tried some code, however would appreciate > some input - even a pointer in the direction of the correct function to use. > > Given N samples each of which is measured for characteristics x1, x2, > x3,... (m ˜ 6) where each characteristic is a roughly normally distributed > numeric, but with different center and scale. > Then the N samples are measured again for characteristics again as x1, x2, > x3,..., however the identity of the samples is unknown. > > Is there a function which will assign the unique identities from the first > measurement to the second measurement? > > I've tried scaling by using the pooled variance of each x1 (i.e. 2N values > to estimate the variance of the measure of characteristic x1, the > characteristic x2, etc.) to construct the normalized distance from one > sample's second measurement x1, x2, x3... to each of the first measurements > and then pick the minimum distance to assign an identity to the second > measurement. Then loop over all the second measurements to find the first > measurement "closest" to it. > However I result with one sample ID from the first measurement being > assigned to multiple second measurements. > > How could I minimize the matching between the second measurements and the > first with unique sample ID assignment? > > > Example: > measure height, weight, and blood pressure of 100 people with their names > recorded (scale and ruler both have some random unknown error) > measure the height, weight, and blood pressure of those 100 people again, > but you forgot to write down their names. (assume that the scale and ruler > errors have not changed since the first measurement) > > How to assign the second set of measurements to the first? > > > Leif Kirschenbaum, Ph.D., PMP > Principal Reliability Engineer > Parts Engineering > Design Reliability > Product Reliability > SSL > 3825 Fabian Way M/S H-21 > Palo Alto, CA 94303 > Tel: +1-650-852-6580 > Facsimile: +1-650-852-7832 > www.ssloral.com > > This e-mail, and any attachments, are intended solely for the use of the > intended recipient(s) and may contain > legally privileged, proprietary and/or confidential information. Any use, > disclosure, dissemination, distribution or > copying of this e-mail and any attachments for any purposes that have not > been specifically authorized by the > sender is strictly prohibited. If you are not the intended recipient, > please immediately notify the sender by reply > e-mail and permanently delete all copies and attachments. > The entire content of this e-mail is for "information purposes" only and > should not be relied upon by the recipient > in any way unless otherwise confirmed in writing by way of letter or > facsimile. > > > ________________________________ > This message (including any attachments) may contain con...{{dropped:7}} > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >[[alternative HTML version deleted]]