BASIC DATA MINING METHODS

In general, Data Mining is based on typical methods or techniques such as prediction, clustering, classification, association- and  sequence discovery. Besides the typical Data Mining methods, modified algorithms are used as well to discover profiles, patterns or interesting behavior in large data sets. In order to modify a typical Data Mining method or to develop a new Data Mining algorithm you should first understand the basic methods of Data Mining.

PREDICTION

Prediction is a statement or a claim that a particular event will occur in the future in more certain terms than a forecast [Wikipedia, 2007]. Figure 1 explains how a Data Mining predictive algorithm is trained on data sets and used to predict an outcome.

Stage 1. Algorithm training/Algorithm learning. The Data Mining predictive algorithm is analysing the data sets and learning why and when an event (also known as target) occurs. The algorithm also learns why and when an event (or target) does not occur. At the end of this stage the algorithm is trained and ready to be tested to a data set of which the target is unknown.

Stage 2. Algorithm testing/Algorithm validating. The trained Data Mining algorithm is tested and validated, in order to measure the quality and stability of the algorithm.

Stage 3. Algorithm application. The trained, tested and validated Data Mining algorithm is applied to the data set with unknown target. At the end of this stage the algorithm produces predictive scores or a predictive outcome value for each instance (also called case) in the data set. For example, for instance with ID 10000 the chance that the event will occur in the future is 87%.


Figure 1. Prediction.

CLUSTERING

Clustering is defined as a common descriptive task where one seeks to identify a finite set of categories or clusters to describe the data. The categories may be mutually exclusive and exhaustive, or consist of a richer representation such as hierarchical or overlapping categories.

Cluster analysis groups together items that have similar characteristics. The challenge is to find groups of items (without any predefinition) that naturally fall together, to assign instances to these groups and to be able to assign new instances to the groups or clusters. The values of attributes that measure different aspects of the instance characterize the instance [B Hay. Sequence Alignment Methods in Web Usage Mining, LUC, 2003].

Figure 2 shows some examples of clustering output presentations. Every cluster is identified by a number (first column). The distance between each cluster and other clusters (second column) gives you an idea about the magnitude of cluster differences. For example, cluster 1 and 3 are much less distant from each other (distance measure = 0.2) compared to cluster 1 and 2 (distance measure = 0.8). Finally, clusters are described by means of queries (third column) which are used to build profiles. For example, cases in cluster 1 are 30-35 year old good paying customers who buy products x, y, q and r. Cases in cluster 2 are bad paying customers who buy products x and y. This explains the high distance measure of 0.8 between cluster 1 and 2. Cases in cluster 3 are 30-35 year old good paying customers who buy products x, y and r. This explains the low distance measure of 0.2 between cluster 1 and 3.


Figure 2.
Some examples of clustering output presentations.

CLASSIFICATION

Classification is defined as a process learning a function that maps (classifies) a data item into one of several predefined classes. Classification rules group items into a predefined profile according to their common attributes. New data items that are added to the database are classified into the profile. In classification learning, a learning scheme takes a set of classified examples from which it is expected to learn a way of classifiying unseen examples [B Hay. Sequence Alignment Methods in Web Usage Mining, LUC, 2003].

ASSOCIATION DISCOVERY

Data Mining associations discover interesting connections, combinations, relations or correlations of elements (also known as attributes or variables) in data sets (or data bases). More specific, association rules present unordered associations and correlations among data items where the presence of one set of items in a transaction implies the presence of other items [B Hay. Sequence Alignment Methods in Web Usage Mining, LUC, 2003].

Figure 3 shows some examples of association results. The top figure illustrates how customer buy behavior may be explained by means of associations among variables in data sets of different industries. For example, assocation discovery of supermarket data uncovers an interesting pattern on Friday evening, if men buy beer they also buy baby pampers. The data analysed is stored through customer loyalty cards and records information such as customer sex, date of purchase, time of purchase, product(s) purchased etc. Other interesting information results from analysing data files stored in a financial organisation. Targeting customers by means of micro marketing campaigns for mortgage services may be done more precisely if you know that married couples having both university degree, no children, take a mortgage after age 33. Likewise, results from analysing data files in automotive industry spots a new profile of retired couples (living in the countryside) who surprisingly buy expensive cars for small distance traveling. The bottom figure illustrates association results of patient illness. With regard to cost management, association discovery provides an unexpected pattern that younger patients stay longer than older patients for "fragments of torsion dystonia".



Figure 3. Association discovery.

SEQUENCE DISCOVERY

In mathematics, a sequence is an ordered list of objects (or events) and contains members (also called elements or terms) and the number of terms (possibly infinite) is called the length of the sequence. Unlike a set, order matters, and the exact same elements can appear multiple times at different positions in the sequence. For example, (C,R,Y) is a sequence of letters that differs from (Y,C,R), as the ordering matters. Sequences can be finite, as in this example, or infinite, such as the sequence of all even positive integers (2,4,6,...) [Wikipedia, 2008].

In Data Mining, sequence discovery provides ordered list of events. A certain user-specified minimum support and -confidence may represent interestingness of sequential patterns which are measured through time  intervals (day, week, month, year...). The type of event depends on the business case. For example, in Web Usage Mining an event is defined as a visited web page. In Customer Buy Behavior an event is defined as a purchased product (or basket of products). In Health Care industry, an event may be defined as a medical examination test.  

Let your Data speak!