TECHNOFUSION: 4/13/08

Topic:Cluster Analysis
Audience: MCA + IT III Year JPNCE Mahboobnagar 18-04-2008

What is Cluster Analysis?
 Cluster : Collection of data objects
 (Intraclass similarity) - Objects are similar to objects in same cluster
 (Interclass dissimilarity) - Objects are dissimilar to objects in other clusters

 Cluster analysis
 Statistical method for grouping a set of data objects into clusters
 A good clustering method produces high quality clusters with high intraclass similarity and low interclass similarity

 Clustering is unsupervised classification

Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs

 Insurance: Identifying groups of motor insurance policy holders with a high average claim cost

 City-planning: Identifying groups of houses according to their house type, value, and geographical location

 Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Data Representation

 Data matrix (two mode)
 N objects with p attributes

 Dissimilarity matrix (one mode)
 d(i,j) : dissimilarity
between i and j
Types of Data in Cluster Analysis

 Interval-Scaled Variables

 Binary Variables

 Nominal, Ordinal, and Ratio-Scaled Variables

 Variables of Mixed Types
Interval-Scaled Variables
 Continuous measurements of a roughly linear scale
 E.g. weight, height, temperature, etc.
Using Interval-Scaled Values

 Step 1: Standardize the data
 To ensure they all have equal weight
 To match up different scales into a uniform, single scale
 Not always needed! Sometimes we require unequal weights for an attribute

 Step 2: Compute dissimilarity between records
 Use Euclidean, Manhattan or Minkowski distance
Data Types and Distance Metrics

 Distances are normally used to measure the similarity or dissimilarity between two data objects

 Minkowski distance:

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is a positive integer

Data Types and Distance Metrics (Cont’d)

 If q = 1, d is Manhattan distance

 If q = 2, d is Euclidean distance
Data Types and Distance Metrics (Cont’d)

 Properties
 d(i,j)  0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j)  d(i,k) + d(k,j)

 Can also use weighted distance, or other dissimilarity measures.
Binary Attributes
 A contingency table for binary data

 Simple matching coefficient (if the binary attribute is symmetric):

 Jaccard coefficient (if the binary attribute is asymmetric):

Dissimilarity between
Binary Attributes
 Example

Nominal Attributes
 A generalization of the binary attribute in that it can take more than 2 states, e.g., red, yellow, blue, green

 Method 1: Simple matching
 m: # of attributes that are same for both records, p: total # of attributes

 Method 2: rewrite the database and create a new binary attribute for each of the m states
 For an object with color yellow, the yellow attribute is set to 1, while the remaining attributes are set to 0.

Ordinal Attributes
 An ordinal attribute can be discrete or continuous

 Order is important (ex.rank)

 Can be treated like interval-scaled
 replacing xif by their rank

 map the range of each variable onto [0, 1] by replacing i-th object in the f-th attribute by

 compute the dissimilarity using methods for interval-scaled attributes

Ratio-Scaled Attributes
 Ratio-scaled attribute: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt

 Methods:
 treat them like interval-scaled attributes — not a good choice because scales may be distorted
 apply logarithmic transformation
yif = log(xif)
 treat them as continuous ordinal data and treat their rank as interval-scaled.

Attributes of Mixed Types
 A database may contain all the six types of attributes
 symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio.

 Use a weighted formula to combine their effects.

 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 o.w.

 f is interval-based: use the normalized distance

 f is ordinal or ratio-scaled
 compute ranks rif and
 and treat zif as interval-scaled

Data Mining Concepts and Techniques

Data Mining Concepts and Techniques
Introduction
 Clustering is an unsupervised method of data analysis
 Data instances grouped according to some notion of similarity
 Access only to the set of features describing each object
 No information as to where each instance should be placed with partition
 However there might be background knowledge about the domain or data set that could be useful to algorithm
 In this paper the authors try to integrate this background knowledge into clustering algorithms.

K-means Clustering
 Used to partition a data set into k groups

 Group instances based on attributes into k groups
 High intra-cluster similarity; Low inter-cluster similarity
 Cluster similarity is measured in regards to the mean value of objects in the cluster.

 How does K-means work ?
 First, select K random instances from the data – initial cluster centers
 Second, each instance is assigned to its closest (most similar) cluster center
 Third, each cluster center is updated to the mean of its constituent instances
 Repeat steps two and three till there is no further change in assignment of instances to clusters
Constrained K-means Clustering
 Two pair-wise constraints
 Must-link: constraints which specify that two instances have to be in the same cluster
 Cannot-link: constraints which specify that two instances must not be placed in the same cluster
 When using a set of constraints we have to take the transitive closure
 Constraints may be derived from
 Partially labeled data
 Background knowledge about the domain or data set

Constrained Algorithm
 First, select K random instances from the data – initial cluster centers

 Second, each instance is assigned to its closest (most similar) cluster center such that VIOLATE-CONSTRAINT(I, K, M, C) is false. If no such cluster exists , fail

 Third, each cluster center is updated to the mean of its constituent instances

 Repeat steps two and three till there is no further change in assignment of instances to clusters

 VIOLATE-CONSTRAINT
instance I, cluster K,
must-link constraint M, cannot-link constraint C
 For each (i, i=) in M: if i= is not in K, return true.
 For each (i, i≠) in C : if i≠ is in K, return true
 Otherwise return false

Experimental Results on
GPS Lane Finding
 Large database of digital road maps available
 These maps contain only coarse information about the location of the road
 By refining maps down to the lane level we can enable a host of more sophisticated applications such as lane departure detection

 Approach
 Based on the observation that drivers tend to drive within lane boundaries
 Lanes should correspond to “densely traveled” regions in contrast to the lane boundaries
 Possible to collect data about the location of cars and then cluster that data to automatically determine where the individual lanes are located
GPS Lane Finding (cont’d)
 Collect data about the location of cars as they drive along a given road

 Collect data once per second from several drivers using GPS receivers affixed to top of their vehicles

 Each data instance has two features:
 1. Distance along the road segment
 2. Perpendicular offset from the road centerline

 For evaluation purposes drivers were asked to indicate which lane they occupied and any lane changes

GPS Lane Finding (cont’d)
 For the problem of automatic lane detection,
Two domain-specific heuristics for generating constraints

 Trace contiguity means that, in the absence of lane changes, all of the points generated from the same vehicle in a single pass over a road segment should end up in the same lane.
 Maximum separation refers to a limit on how far apart two points can be (perpendicular to the centerline) while still being in the same lane. If two points are separated by at least four meters, then we generate a constraint that will prevent those two points from being placed in the same cluster.

 To better analyze performance in the domain, authors modified the cluster center representation

GPS Lane Finding (cont’d)
Conclusion
 Measurable improvement in accuracy

 The use of constraints while clustering means that, unlike the regular k-means algorithm, the assignment of instances to clusters can be order-sensitive.
 If a poor decision is made early on, the algorithm may later encounter an instance i that has no possible valid cluster
 Ideally, the algorithm would be able to backtrack, rearranging some of the instances so that i could then be validly assigned to a cluster.

 Could be extended to hierarchical algorithms

TECHNOFUSION

Friday, April 18, 2008

Cluster Analysis

sockets!!!

Blog Archive

About Me