|
BASIC DATA MINING METHODS
In general, Data Mining is based on typical methods or techniques
such as prediction, clustering, classification, association-
and sequence discovery. Besides the typical Data
Mining methods, modified algorithms are used as well to
discover profiles, patterns or interesting behavior in large
data sets. In order to modify a typical Data Mining method
or to develop a new Data Mining algorithm you should first
understand the basic methods of Data Mining.
PREDICTION
Prediction is a statement or a claim that a particular
event will occur in the future in more certain terms than
a forecast [Wikipedia, 2007]. Figure 1 explains how a Data
Mining predictive algorithm is trained on data sets and
used to predict an outcome.
Stage 1. Algorithm
training/Algorithm learning. The Data Mining predictive algorithm is analysing
the data sets and learning why and when an event (also known
as target) occurs. The algorithm also learns why and when
an event (or target) does not occur. At the end of this
stage the algorithm is trained and ready to be tested to
a data set of which the target is unknown.
Stage 2. Algorithm
testing/Algorithm validating. The trained Data Mining algorithm is tested and
validated, in order to measure the quality and stability
of the algorithm.
Stage 3. Algorithm
application. The trained, tested and validated Data Mining
algorithm is applied to the data set with unknown target.
At the end of this stage the algorithm produces predictive
scores or a predictive outcome value for each instance (also
called case) in the data set. For example, for instance
with ID 10000 the chance that the event will occur in the
future is 87%.

Figure 1. Prediction.
CLUSTERING
Clustering is defined as a common descriptive task where
one seeks to identify a finite set of categories or clusters
to describe the data. The categories may be mutually exclusive
and exhaustive, or consist of a richer representation such
as hierarchical or overlapping categories.
Cluster analysis groups together items that have similar
characteristics. The challenge is to find groups of items
(without any predefinition) that naturally fall together,
to assign instances to these groups and to be able to assign
new instances to the groups or clusters. The values of attributes
that measure different aspects of the instance characterize
the instance [B Hay. Sequence Alignment Methods in Web Usage Mining,
LUC, 2003].
Figure 2 shows some examples of clustering
output presentations. Every
cluster is identified by a number (first
column). The distance between each cluster and other
clusters (second column) gives you an idea about the
magnitude of cluster differences. For example, cluster 1 and
3 are much less distant from each other (distance measure =
0.2) compared to cluster 1 and 2 (distance measure = 0.8).
Finally, clusters are described by means of queries (third
column) which are used to build profiles. For example, cases
in cluster 1 are 30-35 year old good paying customers who
buy products x, y, q and r. Cases in cluster 2 are bad
paying customers who buy products x and y. This explains the
high distance measure of 0.8 between cluster 1 and 2. Cases
in cluster 3 are 30-35 year old good paying customers who
buy products x, y and r. This explains the low distance
measure of 0.2 between cluster 1 and 3.

Figure 2.
Some examples of clustering output presentations.
CLASSIFICATION
Classification is defined as a process learning a function
that maps (classifies) a data item into one of several predefined
classes. Classification rules group items into a predefined
profile according to their common attributes. New data items that are added to the database are classified
into the profile. In classification learning, a learning
scheme takes a set of classified examples from which it
is expected to learn a way of classifiying unseen examples
[B Hay. Sequence Alignment Methods in Web Usage Mining,
LUC, 2003].
ASSOCIATION DISCOVERY
Data Mining associations discover interesting connections,
combinations, relations or correlations of elements (also
known as attributes or variables) in data sets (or data
bases). More specific, association rules present unordered
associations and correlations among data items where the
presence of one set of items in a transaction implies the
presence of other items [B Hay. Sequence Alignment Methods
in Web Usage Mining, LUC, 2003].
Figure 3 shows some examples
of association results. The top figure illustrates how customer
buy behavior may be explained by means of associations among
variables in data sets of different industries. For example,
assocation discovery of supermarket data uncovers an interesting
pattern on Friday evening, if men buy beer they also buy
baby pampers. The data analysed is stored through customer
loyalty cards and records information such as customer sex,
date of purchase, time of purchase, product(s) purchased
etc. Other interesting information results from analysing
data files stored in a financial organisation. Targeting
customers by means of micro marketing campaigns for mortgage
services may be done more precisely if you know that married
couples having both university degree, no children, take a
mortgage after age 33. Likewise, results from analysing data
files in automotive industry spots a new profile of retired
couples (living in the countryside) who surprisingly buy
expensive cars for small distance traveling. The bottom
figure illustrates association results of patient illness. With regard to cost management, association
discovery provides an unexpected pattern that younger
patients stay longer than older patients for "fragments of
torsion dystonia".


Figure 3.
Association discovery.
SEQUENCE DISCOVERY
In
mathematics, a sequence is an ordered list of objects (or
events) and contains members (also called elements or terms)
and the number of terms (possibly infinite) is called the
length of the sequence. Unlike a set, order matters, and
the exact same elements can appear multiple times at
different positions in the sequence. For example, (C,R,Y) is
a sequence of letters that differs from (Y,C,R), as the
ordering matters. Sequences can be finite, as in this
example, or infinite, such as the sequence of all even
positive integers (2,4,6,...) [Wikipedia, 2008].
In Data
Mining, sequence discovery provides ordered list of events.
A certain user-specified minimum support and -confidence may represent
interestingness of sequential patterns which are measured
through time intervals (day, week, month,
year...). The type of event depends
on the business case. For example, in Web Usage Mining an
event is defined as a visited web page. In Customer Buy
Behavior an event is defined as a purchased product (or
basket of products). In Health Care industry, an event may
be defined as a medical examination test.
|
Let
your Data speak!

|