Hello, Please advice on encoding data for the following clustering problem. I have a dataset with car usage info. Dataset has the following fields: 1. Car model (Toyoya Celica, BMW, Nissan X-Trail, Mazda Cosmo, etc.) 2. Year built 3. Country where the car runs 4. Distance run by car before major repairs Important: The above dataset is sparse. In most cases "Distance" is not known for all countries for a given car. Problem: For a given car predict the "Distance" it will run before major repairs in a country for which "Distance" is unknown. My approach: I want to represent each record in the dataset as a sparse vector with the following components: 1. Binary (1/0) car model components. Number of these components equals the number of all possible models in the dataset. 2. Binary (1/0) country where the car runs. Number of these components equals the number of all possible countries in the dataset. 3. Distance. A single integer component, equals the distance run by car. Next I want to cluster (k-means) these vectors and analyze resulting groups. Questions: 1) In my vectors I mix components of different nature - binary (model, country) and continuous (distance). How to calculate component-wise distance between vectors? Cosine similarity? 2) Other ways to encode components with finite set of values (model, country) to work well with continuous components (such as distance)? Thanks! Anton [[alternative HTML version deleted]]