#### Transcript Knowledge Discovery in Databases

Brief introduction to lectures Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Transparencies prepared by Ho Tu Bao [JAIST] 1 Lecture 5: Automatic Cluster Detection • One of the most widely used KDD classification techniques for unsupervised data. • Content of the lecture 1. 2. 3. 4. Introduction Partitioning Clustering Hierarchical Clustering Software and case-studies • Prerequisite: Nothing special 2 Partitioning Clustering A partition of a set of n objects X {x1 , x2 ,..., xn } is a collection of K disjoint non - empty subsets P1 , P2 ,..., PK of X (K n), often called clusters , satisfying the following conditions : (1) they are disjoint : Pi Pj 0 for all Pi and Pj, i j (2) their union is X : P1 P2 ... PK X. Denote the partition P {P1, P2,..., PK }, Pi are called components of P • Each cluster must contain at least one object • Each object must belong to exactly one group 3 Partitioning Clustering What is a “good” partitioning clustering? Key ideas: Objects in each group are similar and objects between different groups are dissimilar. P {{x1 , x4 , x7 , x9 , x10},{x2 , x7 },{x3 , x5 , x6 }} P1 P2 P3 Minimize the within-group distance and Maximize the between-group distance. Notice: Many ways to define the “within-group distance” (the average of distance to the group’s center or the average of distance between all pairs of objects, etc.) and to define the “between-group distance”. It is in general impossible to find the optimal clustering. 4 Hierarchical Clustering Partition Q is nested into partition P if every component of Q is a subset of a component of P. P {x1 , x4 , x7 , x9 , x10},{x2 , x8},{x3 , x5 , x6 } Q {x1 , x4 , x9 },{x7 , x10},{x2 , x8},{x5},{x3 , x6 } A hierarchical clustering is a sequence of partitions in which each partition is nested into the next partition in the sequence. (This definition is for bottom-up hierarchical clustering. In case of top-down hierarchical clustering, “next” becomes “previous”). 5 Bottom-up Hierarchical Clustering x1 , x2 , x3 , x4 , x5 , x6 {x1 , x2 , x3 , x4},{x5 , x6} {x1 , x2 , x3},{x4},{x5 , x6} {x1},{x2 , x3},{x4},{x5},{x6} {x1},{x2},{x3},{x4},{x5},{x6} x1 x2 x3 x4 x5 x6 6 Top-Down Hierarchical Clustering x1 , x2 , x3 , x4 , x5 , x6 {x1 , x2 , x3 , x4},{x5 , x6} {x1 , x2 , x3},{x4},{x5 , x6} {x1},{x2 , x3},{x4},{x5},{x6} {x1},{x2},{x3},{x4},{x5},{x6} x1 x2 x3 x4 x5 x6 7 OSHAM: Hybrid Model Brief Description of Concepts Concept Hierarchy Discovered Concepts Multiple Inheritance Concepts Wisconsin Breast Cancer Data Attributes 8 Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 9 Lecture 6: Neural networks • One of the most widely used KDD classification techniques. • Content of the lecture 1. Neural network representation 2. Feed-forward neural networks 3. Using back-propagation algorithm 4. Case-studies • Prerequisite: Nothing special 10 Brief introduction to lectures Lecture 1: Overview of KDD Lecture 2: Preparing data Lecture 3: Decision tree induction Lecture 4: Mining association rules Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge 11 Lecture 7 Evaluation of discovered knowledge • One of the most widely used KDD classification techniques. • Content of the lecture 1. Cross validation 2. Bootstrapping 3. Case-studies • Prerequisite: Nothing special 12 Out-of-sample testing Training data Induction method 2/3 Historical Data (warehouse) Sampling method Sample data Sampling method Model 1/3 Testing data Error estimation error The quality of the test sample estimate is dependent on the number of test cases and the validity of the independent assumption 13 Cross Validation Historical Data (warehouse) Sampling method 10-fold cross validation appears adequate (n = 10) iterate Sample 1 Sample data Sampling method Sample 2 ... Sample n - Mutually exclusive - Equal size Induction method Model Error estimation Run’s error Error estimation 14 Evaluation: k-fold cross validation (k=3) 1 1 2 to be evaluated 2 A method 3 1 3 2 3 A data set randomly split the data set into 3 subsets of equal size run on each 2 subsets as training data to find knowledge 3 2 1 test on the rest subset as testing data to evaluate the accuracy average the accuracies as final evaluation 15 Outline of the presentation Objectives, Brief Discussion Prerequisite Introduction and and Content to Lectures Conclusion This presentation summarizes the content and organization of lectures in module “Knowledge Discovery and Data Mining” 16