Michael Friendly
2006-Apr-06 16:16 UTC
[R] calculating similarity/distance among hierarchically classified items
This is a question about how to calculate similarities/distances among items that are classified by hierarchical attributes for the purpose of visualizing the relations among items by means of clustering, MDS, self-organizing maps, and so forth. I have a set of ~260 items that have been classified using two sets of hierarchically-organized codes on the basis of form and content. The data looks like that below, where the last two variables (ITEMFORM and ITEMCONTENT) are each a ';' separated list of codes assigned to each item. The items are identified by the KEY variable. (Other fields are ignored here.) KEY,YEAR,WHERE,CONTENT,FORM,ITEMFORM,ITEMCONTENT 1782Fourcroy,1782,Eur,Hdem,Stats,F5;F5K;F5N;F5N1,C8;C82 1785Crome,1785,Eur,Pdem,Stats,F5;F5N;F5N1,C7 1786Playfair,1786,Eur,Hdem,Stats,F6;F68;F69;F61;F62,C3;C32;C321;C323 1787Chladni,1787,Eur,Other,Other,F5;F55;FH;FD;FD3,C9;C95 1794Buxton,1794,Eur,Other,Tech,F3;F31;F7;F72;F722,C9;C9A 1795Pouchet,1795,Eur,Math,Stats,F5;F5G;FG;FG7,C2 1796Watt,1796,Eur,Pdem,Tech,FGB,C7;C9;C9A 1798Senefelder,1798,Eur,Other,Tech,FB;F5,C9;C97 ... The codes are hierarchical in the sense that, e.g., C321 corresponds to the levels in a tree, Commerce (C3) > Internal (C32) > Labour (C321) F5G corresponds to Diagram (F5) > Nomogram (F5G) so the number of characters in a code is the level in the tree. There are about 290 distinct codes, with varying frequency of use, from 1 ..~40, so the data could be rearranged to a 260x290 incidence matrix of items x codes. In computing similarities between items, all measures I know of for binary attribute data treat the attributes as nominal, and so ignore the hierarchical nature of the codes. To take that into account, the 0/1 values could be replaced by the tree level values (0=NA, 1..5) of the codes in each column. Then some measure of similarity could be computed based on the profiles for each pair of items. But I don't know what measure (Gower's, euclidean, etc.) would be (most, or arguably) appropriate here. Is this a situation that anyone recognizes? Or, maybe there is another way to approach this. I'd appreciate any suggestions. -- Michael Friendly Email: friendly at yorku.ca Professor, Psychology Dept. York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA