Leif Kirschenbaum
2013-Jun-25 01:47 UTC
[R] Unique matching of two sets of multidimensional data
Dear list,
I've searched the archives and tried some code, however would appreciate
some input - even a pointer in the direction of the correct function to use.
Given N samples each of which is measured for characteristics x1, x2, x3,... (m
˜ 6) where each characteristic is a roughly normally distributed numeric, but
with different center and scale.
Then the N samples are measured again for characteristics again as x1, x2,
x3,..., however the identity of the samples is unknown.
Is there a function which will assign the unique identities from the first
measurement to the second measurement?
I've tried scaling by using the pooled variance of each x1 (i.e. 2N values
to estimate the variance of the measure of characteristic x1, the characteristic
x2, etc.) to construct the normalized distance from one sample's second
measurement x1, x2, x3... to each of the first measurements and then pick the
minimum distance to assign an identity to the second measurement. Then loop
over all the second measurements to find the first measurement
"closest" to it.
However I result with one sample ID from the first measurement being assigned to
multiple second measurements.
How could I minimize the matching between the second measurements and the first
with unique sample ID assignment?
Example:
measure height, weight, and blood pressure of 100 people with their names
recorded (scale and ruler both have some random unknown error)
measure the height, weight, and blood pressure of those 100 people again, but
you forgot to write down their names. (assume that the scale and ruler errors
have not changed since the first measurement)
How to assign the second set of measurements to the first?
Leif Kirschenbaum, Ph.D., PMP
Principal Reliability Engineer
Parts Engineering
Design Reliability
Product Reliability
SSL
3825 Fabian Way M/S H-21
Palo Alto, CA 94303
Tel: +1-650-852-6580
Facsimile: +1-650-852-7832
www.ssloral.com
This e-mail, and any attachments, are intended solely for the use of the
intended recipient(s) and may contain
legally privileged, proprietary and/or confidential information. Any use,
disclosure, dissemination, distribution or
copying of this e-mail and any attachments for any purposes that have not been
specifically authorized by the
sender is strictly prohibited. If you are not the intended recipient, please
immediately notify the sender by reply
e-mail and permanently delete all copies and attachments.
The entire content of this e-mail is for "information purposes" only
and should not be relied upon by the recipient
in any way unless otherwise confirmed in writing by way of letter or facsimile.
________________________________
This message (including any attachments) may contain con...{{dropped:7}}
Adams, Jean
2013-Jun-25 13:34 UTC
[R] Unique matching of two sets of multidimensional data
You could give 1-nearest neighbor classification a try. For example,
a <- data.frame(person=1:10, ht=rnorm(10, mean=5, sd=1),
wt=rnorm(10, mean=180, sd=30), bp=rnorm(10, mean=120, sd=10))
meas.err <- data.frame(ht=rnorm(10, sd=0.1),
wt=rnorm(10, sd=3), bp=rnorm(10, sd=1))
b <- (a[, -1] + meas.err)[sample(10), ]
library(class)
b$person <- knn1(a[, -1], b, a$person)
On Mon, Jun 24, 2013 at 8:47 PM, Leif Kirschenbaum <
Kirschenbaum.Leif@ssd.loral.com> wrote:
> Dear list,
> I've searched the archives and tried some code, however would
appreciate
> some input - even a pointer in the direction of the correct function to
use.
>
> Given N samples each of which is measured for characteristics x1, x2,
> x3,... (m ˜ 6) where each characteristic is a roughly normally distributed
> numeric, but with different center and scale.
> Then the N samples are measured again for characteristics again as x1, x2,
> x3,..., however the identity of the samples is unknown.
>
> Is there a function which will assign the unique identities from the first
> measurement to the second measurement?
>
> I've tried scaling by using the pooled variance of each x1 (i.e. 2N
values
> to estimate the variance of the measure of characteristic x1, the
> characteristic x2, etc.) to construct the normalized distance from one
> sample's second measurement x1, x2, x3... to each of the first
measurements
> and then pick the minimum distance to assign an identity to the second
> measurement. Then loop over all the second measurements to find the first
> measurement "closest" to it.
> However I result with one sample ID from the first measurement being
> assigned to multiple second measurements.
>
> How could I minimize the matching between the second measurements and the
> first with unique sample ID assignment?
>
>
> Example:
> measure height, weight, and blood pressure of 100 people with their names
> recorded (scale and ruler both have some random unknown error)
> measure the height, weight, and blood pressure of those 100 people again,
> but you forgot to write down their names. (assume that the scale and ruler
> errors have not changed since the first measurement)
>
> How to assign the second set of measurements to the first?
>
>
> Leif Kirschenbaum, Ph.D., PMP
> Principal Reliability Engineer
> Parts Engineering
> Design Reliability
> Product Reliability
> SSL
> 3825 Fabian Way M/S H-21
> Palo Alto, CA 94303
> Tel: +1-650-852-6580
> Facsimile: +1-650-852-7832
> www.ssloral.com
>
> This e-mail, and any attachments, are intended solely for the use of the
> intended recipient(s) and may contain
> legally privileged, proprietary and/or confidential information. Any use,
> disclosure, dissemination, distribution or
> copying of this e-mail and any attachments for any purposes that have not
> been specifically authorized by the
> sender is strictly prohibited. If you are not the intended recipient,
> please immediately notify the sender by reply
> e-mail and permanently delete all copies and attachments.
> The entire content of this e-mail is for "information purposes"
only and
> should not be relied upon by the recipient
> in any way unless otherwise confirmed in writing by way of letter or
> facsimile.
>
>
> ________________________________
> This message (including any attachments) may contain con...{{dropped:7}}
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
[[alternative HTML version deleted]]