thr3ads.net - R help - [R] Clustering newbie question [Dec 2012]

If this information is useful, please help other people find it:
Share via:

Anton Ashanin

2012-Dec-18 22:24 UTC

[R] Clustering newbie question

Hello,
Please advice on encoding data for the following clustering problem. 
I have a dataset with car usage info. Dataset has the following fields:
1. Car model  (Toyoya Celica, BMW, Nissan X-Trail, Mazda Cosmo, etc.)
2. Year built 
3. Country where the car runs 
4. Distance run by car before major repairs 

Important: The above dataset is sparse. 
In most cases "Distance" is not known for all countries for a given
car.   

Problem: 
For a given car predict the "Distance" it will run before major
repairs in a country for which "Distance" is unknown.

My approach:
I want to represent each record in the dataset as a sparse vector with the
following components:
1. Binary (1/0) car model components. Number of these components equals the
number of all possible models in the dataset.
2. Binary (1/0) country where the car runs. Number of these components equals
the number of all possible countries in the dataset.
3. Distance. A single integer component, equals the distance run by car.

Next I want to cluster (k-means) these vectors and analyze resulting groups. 

Questions:
1) In my vectors I mix components of different nature - binary (model,
country)  and continuous (distance). How to calculate component-wise distance
between vectors? Cosine similarity?
2) Other ways to encode components with finite set of values (model, country) to
work well with continuous components (such as distance)?

Thanks!
Anton
	[[alternative HTML version deleted]]

Maybe Matching Threads

Search for more seemingly similar threads

R help - Dec 2012 - Clustering newbie question

[R] Clustering newbie question

Maybe Matching Threads

Wisdom of the Ancients