Topic:Cluster Analysis
Audience: MCA + IT III Year JPNCE Mahboobnagar 18-04-2008
What is Cluster Analysis?
Cluster : Collection of data objects
(Intraclass similarity) - Objects are similar to objects in same cluster
(Interclass dissimilarity) - Objects are dissimilar to objects in other clusters
Cluster analysis
Statistical method for grouping a set of data objects into clusters
A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity
Clustering is unsupervised classification
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs
Insurance: Identifying groups of motor insurance policy holders with a high average claim cost
City-planning: Identifying groups of houses according to their house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
Data Representation
Data matrix (two mode)
N objects with p attributes
Dissimilarity matrix (one mode)
d(i,j) : dissimilarity
between i and j
Types of Data in Cluster Analysis
Interval-Scaled Variables
Binary Variables
Nominal, Ordinal, and Ratio-Scaled Variables
Variables of Mixed Types
Interval-Scaled Variables
Continuous measurements of a roughly linear scale
E.g. weight, height, temperature, etc.
Using Interval-Scaled Values
Step 1: Standardize the data
To ensure they all have equal weight
To match up different scales into a uniform, single scale
Not always needed! Sometimes we require unequal weights for an attribute
Step 2: Compute dissimilarity between records
Use Euclidean, Manhattan or Minkowski distance
Data Types and Distance Metrics
Distances are normally used to measure the similarity or dissimilarity between two data objects
Minkowski distance:
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer
Data Types and Distance Metrics (Cont’d)
If q = 1, d is Manhattan distance
If q = 2, d is Euclidean distance
Data Types and Distance Metrics (Cont’d)
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Can also use weighted distance, or other dissimilarity measures.
Binary Attributes
A contingency table for binary data
Simple matching coefficient (if the binary attribute is symmetric):
Jaccard coefficient (if the binary attribute is asymmetric):
Dissimilarity between
Binary Attributes
Example
Nominal Attributes
A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of attributes that are same for both records, p: total # of attributes
Method 2: rewrite the database and create a new binary attribute for each of the m states
For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.
Ordinal Attributes
An ordinal attribute can be discrete or continuous
Order is important (ex.rank)
Can be treated like interval-scaled
replacing xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by
compute the dissimilarity using methods for interval-scaled attributes
Ratio-Scaled Attributes
Ratio-scaled attribute: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt
Methods:
treat them like interval-scaled attributes — not a good choice because scales may be distorted
apply logarithmic transformation
yif = log(xif)
treat them as continuous ordinal data and treat their rank as interval-scaled.
Attributes of Mixed Types
A database may contain all the six types of attributes
symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.
Use a weighted formula to combine their effects.
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.
f is interval-based: use the normalized distance
f is ordinal or ratio-scaled
compute ranks rif and
and treat zif as interval-scaled
Data Mining Concepts and Techniques
Data Mining Concepts and Techniques
Introduction
Clustering is an unsupervised method of data analysis
Data instances grouped according to some notion of similarity
Access only to the set of features describing each object
No information as to where each instance should be placed with partition
However there might be background knowledge about the domain or data set that could be useful to algorithm
In this paper the authors try to integrate this background knowledge into clustering algorithms.
K-means Clustering
Used to partition a data set into k groups
Group instances based on attributes into k groups
High intra-cluster similarity; Low inter-cluster similarity
Cluster similarity is measured in regards to the mean value of objects in the cluster.
How does K-means work ?
First, select K random instances from the data – initial cluster centers
Second, each instance is assigned to its closest (most similar) cluster center
Third, each cluster center is updated to the mean of its constituent instances
Repeat steps two and three till there is no further change in assignment of instances to clusters
Constrained K-means Clustering
Two pair-wise constraints
Must-link: constraints which specify that two instances have to be in the same cluster
Cannot-link: constraints which specify that two instances must not be placed in the same cluster
When using a set of constraints we have to take the transitive closure
Constraints may be derived from
Partially labeled data
Background knowledge about the domain or data set
Constrained Algorithm
First, select K random instances from the data – initial cluster centers
Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail
Third, each cluster center is updated to the mean of its constituent instances
Repeat steps two and three till there is no further change in assignment of instances to clusters
VIOLATE-CONSTRAINT
instance I, cluster K,
must-link constraint M, cannot-link constraint C
For each (i, i=) in M: if i= is not in K, return true.
For each (i, i≠) in C : if i≠ is in K, return true
Otherwise return false
Experimental Results on
GPS Lane Finding
Large database of digital road maps available
These maps contain only coarse information about the location of the road
By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection
Approach
Based on the observation that drivers tend to drive within lane boundaries
Lanes should correspond to “densely traveled” regions in contrast to the lane boundaries
Possible to collect data about the location of cars and then cluster that data to automatically determine where the individual lanes are located
GPS Lane Finding (cont’d)
Collect data about the location of cars as they drive along a given road
Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles
Each data instance has two features:
1. Distance along the road segment
2. Perpendicular offset from the road centerline
For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes
GPS Lane Finding (cont’d)
For the problem of automatic lane detection,
Two domain-specific heuristics for generating constraints
Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane.
Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster.
To better analyze performance in the domain, authors modified the cluster center representation
GPS Lane Finding (cont’d)
Conclusion
Measurable improvement in accuracy
The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive.
If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster
Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster.
Could be extended to hierarchical algorithms
Friday, April 18, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment