Orient BlackSwan

About the BookAbout the AuthorTable of Contents

This book addresses all the major and latest techniques of data mining. It deals in detail with the algorithms for discovering association rules for clustering and building decision trees, and techniques such as neural networks, genetic algorithms, rough set theory and support vector machine used in data mining. The algorithmic details of different techniques such as Apriori, Pincer-search, Dynamic Itemset Counting, FP-Tree growth, SLIQ, SPRINT, BOAT, CART, RainForest, BIRCH, CURE, BUBBLE, ROCK, STIRR, PAM, CLARANS, DBSCAN, GSP, SPADE and SPIRIT are covered. The book also discusses the mining of web, spatial, temporal and text data. In the third edition, the chapter on data warehousing concepts was thoroughly revised to include multidimensional data modelling and cube computation. The discussion on genetic algorithms was also expanded as a separate chapter. In the fourth edition, a chapter on ROC curve for visualizing the performance of a binary classifier and the method for computing AUC and its uses has been included.

Students of computer science, mathematical science and management will find this introductory textbook beneficial for a first course on the subject; the exposition of concepts with supporting illustrative examples and exercises makes it suitable for self-study as well.

Arun K Pujari, faculty and Dean of the School of Computer and Information Sciences, University of Hyderabad (UoH), is currently serving as the vice-chancellor of the Central University of Rajasthan. He obtained his post-graduation in mathematics from Sambalpur University (1974) and PhD from IIT Kanpur (1980). He joined UoH in 1985 as a reader and became a professor in 1990. Professor Pujari has wide experience as an administrator. He has served as a member of UGC, DST, DRDO, ISRO and AICTE, and as vice-chancellor of Sambalpur University (November 2008 to November 2011). He has also been on visiting assignments to several institutions that include the Institute of Industrial Sciences, University of Tokyo; International Institute of Software Technology, United Nations University, Macau; University of Memphis, USA; and Griffith University, Australia, among others.

Foreword xv
Prologue xvii
Preface to the Fourth Edition xix
Preface to the First Edition xxi
Acknowledgements
1. INTRODUCTION
1.1 Introduction 1.2 Data Mining as a Subject
1.3 Guide to this Book
2. DATA WAREHOUSING
2.1 Introduction
2.2 Data Warehouse Architecture
2.3 Dimensional Modelling
2.4 Categorisation of Hierarchies 2.5 Aggregate Function
2.6 Summarisability
2.7 Fact–Dimension Relationships
2.8 OLAP Operations
2.9 Lattice of Cuboids
2.10 OLAP Server
2.11 ROLAP
2.12 MOLAP
2.13 Cube Computation
2.14 Multiway Simultaneous Aggregation (ArrayCube)
2.15 BUC - Bottom-Up Cubing Algorithm
2.16 Condensed Cube
2.17 Coalescing
2.18 Dwarf
2.19 Other Cubing Techniques
2.20 Skycube
2.21 View Selection - Partial Materialisation
2.22 Data Marting
2.23 ETL
2.24 Data Cleaning
2.25 ELT vs. ETL
2.26 Cloud Data Warehousing Further Reading
Exercises
Bibliography
3. DATA MINING
3.1 Introduction
3.2 What is Data Mining?
3.3 Data Mining: Definitions
3.4 KDD vs. Data Mining
3.5 DBMS vs. DM
3.6 Other Related Areas
3.7 DM Techniques
3.8 Other Mining Problems
3.9 Issues and Challenges in DM
3.10 DM Application Areas
3.11 DM Applications—Case Studies
3.12 Conclusions
Further Reading
Exercises
Bibliography
4. ASSOCIATION RULES
4.1 Introduction
4.2 What is an Association Rule?
4.3 Methods to Discover Association Rules
4.4 Apriori Algorithm
4.5 Partition Algorithm
4.6 Pincer-Search Algorithm
4.7 Dynamic Itemset Counting Algorithm
4.8 FP-tree Growth Algorithm
4.9 Eclat and dEclat
4.10 Rapid Association Rule Mining (RARM)
4.11 Discussion on Different Algorithms
4.12 Incremental Algorithm
4.13 Border Algorithm
4.14 Generalised Association Rule
4.15 Association Rules with Item Constraints
4.16 Summary
Further Reading
Exercises
Bibliography
5. CLUSTERING TECHNIQUES
5.1 Introduction
5.2 Clustering Paradigms
5.3 Partitioning Algorithms
5.4 k-Medoid Algorithms
5.5 CLARA
5.6 CLARANS
5.7 Hierarchical Clustering
5.8 DBSCAN
5.9 BIRCH
5.10 CURE
5.11 Categorical Clustering Algorithms
5.12 STIRR
5.13 ROCK
5.14 CACTUS
5.15 Conclusions
Further Reading
Exercises
Bibliography
6. DECISION TREES
6.1 Introduction
6.2 What is a Decision Tree?
6.3 Tree Construction Principle
6.4 Best Split
6.5 Splitting Indices
6.6 Splitting Criteria
6.7 Decision Tree Construction Algorithms
6.8 CART
6.9 ID3
6.10 C4.5
6.11 CHAID
6.12 Summary
6.13 Decision Tree Construction with Presorting
6.14 RainForest
6.15 Approximate Methods
6.16 CLOUDS
6.17 BOAT
6.18 Pruning Technique
6.19 Integration of Pruning and Construction
6.20 Summary: An Ideal Algorithm
6.21 Other Topics
6.22 Conclusions
Further Reading
Exercises
Bibliography
7. ROUGH SET THEORY
7.1 Introduction
7.2 Definitions
7.3 Example
7.4 Reduct
7. 5 Propositional Reasoning and PIAP to Compute Reducts
7.6 Types of Reducts
7.7 Rule Extraction
7.8 Decision tree
7.9 Rough Sets and Fuzzy Sets
7.10 Granular Computing
Further Reading
Exercises
Bibliography
8. GENETIC ALGORITHM
8.1 Introduction
8.2 Basic Steps of GA
8. 3 Selection
8.4 Crossover
8.5 Mutation
8.6 Data Mining Using GA
8.7 GA for Rule Discovery
8.8 GA and Decision Tree
8.9 Clustering Using GA
Conclusions
Further Reading
Exercises
Bibliography
9. OTHER TECHNIQUES
9.1 Introduction
9.2 What is a Neural Network?
9.3 Learning in NN
9.4 Unsupervised Learning
9.5 Data Mining Using NN: A Case Study
9.6 Support Vector Machines
9.7 Conclusions
Further Reading
Exercises
Bibliography

10. Performance Evaluation - ROC Curve
10.1 Introduction
10.2 Classification Accuracy
10.3 ROC Space
10.4 ROC Curves
10.5 ROC Curves and Class Distribution
10.6 ROC Convex Hull (ROCCH)
10.7 Method to Find the Optimal Threshold Point
10.8 Combining Classifiers
10.9 Area Under the ROC Curve (AUC )
10.10 Methods to Compute AUC
10.11 Averaging ROC Curves
10.12 R OC for Multi-class Classifiers
10.13 Precision–Recall Graph
10.14 DET Curves
10.15 Cost Curves
Further Reading
Exercises
Bibliography
11. WEB MINING
11.1 Introduction
11.2 Web Mining
11.3 Web Content Mining
11.4 Web Structure Mining
11.5 Web Usage Mining
11.6 Text Mining
11.7 Unstructured Text
11.8 Episode Rule Discovery for Texts
11.9 Hierarchy of Categories
11.10 Text Clustering
11.11 Conclusions
Further Reading
Exercises
Bibliography
12. TEMPORAL AND SPATIAL DATA MINING
12.1 Introduction
12.2 What is Temporal Data Mining?
12.3 Temporal Association Rules
12.4 Sequence Mining
12.5 The GSP Algorithm
12.6 SPADE
12.7 SPIRIT
12.8 WUM
12.9 Episode Discovery
12.10 Event Prediction Problem
12.11 Time-series Analysis
12.12 Spatial Mining
12.13 Spatial Mining Tasks
12.14 Spatial Clustering
12.15 Spatial Trends
12.16 Conclusions
Further Reading
Exercises
Bibliography
Index