Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://lnfm1.sai.msu.ru/~rastor/Books/Abbass_&_Sarker_&_Newton-Data_Mining_Heuristic_Approach.pdf
Äàòà èçìåíåíèÿ: Tue Nov 11 17:36:30 2008
Äàòà èíäåêñèðîâàíèÿ: Mon Oct 1 23:09:30 2012
Êîäèðîâêà:

Ïîèñêîâûå ñëîâà: ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï ï

Data Mining: A Heuristic Approach
Hussein A. Abbass Ruhul A. Sarker Charles S. Newton University of New South Wales, Australia

Idea Group Publishing

Information Science Publishing

Hershey · London · Melbourne · Singapore · Beijing


Acquisitions Editor: Managing Editor: Development Editor: Copy Editor: Typesetter: Cover Design: Printed at:

Mehdi Khosrowpour Jan Travers Michele Rossi Maria Boyer Tamara Gillis Debra Andree Integrated Book Technology

Published in the United States of America by Idea Group Publishing 1331 E. Chocolate Avenue Hershey PA 17033-1117 Tel: 717-533-8845 Fax: 717-533-8661 E-mail: cust@idea-group.com Web site: http://www.idea-group.com and in the United Kingdom by Idea Group Publishing 3 Henrietta Street Covent Garden London WC2E 8LU Tel: 44 20 7240 0856 Fax: 44 20 7379 3313 Web site: http://www.eurospan.co.uk Copyright © 2002 by Idea Group Publishing. All rights reserved. No part of this book may be reproduced in any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher. Library of Congress Cataloging-in-Publication Data Data mining : a heuristic approach / [edited by] Hussein Aly Abbass, Ruhul Amin Sarker, Charles S. Newton. p. cm. Includes index. ISBN 1-930708-25-4 1. Data mining. 2. Database searching. 3. Heuristic programming. I. Abbass, Hussein. II. Sarker, Ruhul. III. Newton, Charles, 1942QA76.9.D343 D36 2001 006.31--dc21 2001039775

British Cataloguing in Publication Data A Cataloguing in Publication record for this book is available from the British Library.


NEW from Idea Group Publishing
· · · · · · · · · · · · · · · · · · · · · · · · · · · Data Mining: A Heuristic Approach Hussein Aly Abbass, Ruhul Amin Sarker and Charles S. Newton/ 1-930708-25-4 Managing Information Technology in Small Business: Challenges and Solutions Stephen Burgess/ 1-930708-35-1 Managing Web Usage in the Workplace: A Social, Ethical and Legal Perspective Murugan Anandarajan and Claire A. Simmers/ 1-930708-18-1 Challenges of Information Technology Education in the 21st Century Eli Cohen/ 1-930708-34-3 Social Responsibility in the Information Age: Issues and Controversies Gurpreet Dhillon/ 1-930708-11-4 Database Integrity: Challenges and Solutions Jorge H. Doorn and Laura Rivero/ 1-930708-38-6 Managing Virtual Web Organizations in the 21st Century: Issues and Challenges Ulrich Franke/ 1-930708-24-6 Managing Business with Electronic Commerce: Issues and Trends Aryya Gangopadhyay/ 1-930708-12-2 Electronic Government: Design, Applications and Management åke GrÆnlund/ 1-930708-19-X Knowledge Media in Health Care: Opportunities and Challenges Rolf Grutter/ 1-930708-13-0 Internet Management Issues: A Global Perspective John D. Haynes/ 1-930708-21-1 Enterprise Resource Planning: Global Opportunities and Challenges Liaquat Hossain, Jon David Patrick and M. A. Rashid/ 1-930708-36-X The Design and Management of Effective Distance Learning Programs Richard Discenza, Caroline Howard, and Karen Schenk/ 1-930708-20-3 Multirate Systems: Design and Applications Gordana Jovanovic-Dolecek/ 1-930708-30-0 Managing IT/Community Partnerships in the 21st Century Jonathan Lazar/ 1-930708-33-5 Multimedia Networking: Technology, Management and Applications Syed Mahbubur Rahman/ 1-930708-14-9 Cases on Worldwide E-Commerce: Theory in Action Mahesh Raisinghani/ 1-930708-27-0 Designing Instruction for Technology-Enhanced Learning Patricia L. Rogers/ 1-930708-28-9 Heuristic and Optimization for Knowledge Discovery Ruhul Amin Sarker, Hussein Aly Abbass and Charles Newton/ 1-930708-26-2 Distributed Multimedia Databases: Techniques and Applications Timothy K. Shih/ 1-930708-29-7 Neural Networks in Business: Techniques and Applications Kate Smith and Jatinder Gupta/ 1-930708-31-9 Information Technology and Collective Obligations: Topics and Debate Robert Skovira/ 1-930708-37-8 Managing the Human Side of Information Technology: Challenges and Solutions Edward Szewczak and Coral Snodgrass/ 1-930708-32-7 Cases on Global IT Applications and Management: Successes and Pitfalls Felix B. Tan/ 1-930708-16-5 Enterprise Networking: Multilayer Switching and Applications Vasilis Theoharakis and Dimitrios Serpanos/ 1-930708-17-3 Measuring the Value of Information Technology Han T. M. van der Zee/ 1-930708-08-4 Business to Business Electronic Commerce: Challenges and Solutions Merrill Warkentin/ 1-930708-09-2

Excellent additions to your library!
Receive the Idea Group Publishing catalog with descriptions of these books by calling, toll free 1/800-345-4332 or visit the IGP Online Bookstore at: http://www.idea-group.com!


Data Mining: A Heuristic Approach Table of Contents
Preface ............................................................................................................................ vi

Part One: General Heuristics
Chapter 1: From Evolution to Immune to Swarm to ...? A Simple Introduction to Modern Heuristics ....................................................... 1 Hussein A. Abbass, University of New South Wales, Australia Chapter 2: Approximating Proximity for Fast and Robust Distance-Based Clustering ................................................................................ 2 2 Vladimir Estivill-Castro, University of Newcastle, Australia Michael Houle, University of Sydney, Australia

Part Two: Evolutionary Algorithms
Chapter 3: On the Use of Evolutionary Algorithms in Data Mining .......................... 4 8 Erick CantÇ-Paz, Lawrence Livermore National Laboratory, USA Chandrika Kamath, Lawrence Livermore National Laboratory, USA Chapter 4: The discovery of interesting nuggets using heuristic techniques .......... 7 2 Beatriz de la Iglesia, University of East Anglia, UK Victor J. Rayward-Smith, University of East Anglia, UK Chapter 5: Estimation of Distribution Algorithms for Feature Subset Selection in Large Dimensionality Domains ..................................................... 9 7 Ißaki Inza, University of the Basque Country, Spain Pedro Larraßaga, University of the Basque Country, Spain Basilio Sierra, University of the Basque Country, Spain Chapter 6: Towards the Cross-Fertilization of Multiple Heuristics: Evolving Teams of Local Bayesian Learners ................................................... 1 1 7 Jorge MuruzÀbal, Universidad Rey Juan Carlos, Spain Chapter 7: Evolution of Spatial Data Templates for Object Classification .............. 1 4 3 Neil Dunstan, University of New England, Australia Michael de Raadt, University of Southern Queensland, Australia

Part Three: Genetic Programming
Chapter 8: Genetic Programming as a Data-Mining Tool ....................................... 1 5 7 Peter W.H. Smith, City University, UK


Chapter 9: A Building Block Approach to Genetic Programming for Rule Discovery ............................................................................................. 1 7 4 A.P. Engelbrecht, University of Pretoria, South Africa Sonja Rouwhorst, Vrije Universiteit Amsterdam, The Netherlands L. Schoeman, University of Pretoria, South Africa

Part Four: Ant Colony Optimization and Immune Systems
Chapter 10: An Ant Colony Algorithm for Classification Rule Discovery ............. 1 9 1 Rafael S. Parpinelli, Centro Federal de Educacao Tecnologica do Parana, Brazil Heitor S. Lopes, Centro Federal de Educacao Tecnologica do Parana, Brazil Alex A. Freitas, Pontificia Universidade Catolica do Parana, Brazil Chapter 11: Artificial Immune Systems: Using the Immune System as Inspiration for Data Mining ......................................................................... 2 0 9 Jon Timmis, University of Kent at Canterbury, UK Thomas Knight, University of Kent at Canterbury, UK Chapter 12: aiNet: An Artificial Immune Network for Data Analysis .................... 2 3 1 Leandro Nunes de Castro, State University of Campinas, Brazil Fernando J. Von Zuben, State University of Campinas, Brazil

Part Five: Parallel Data Mining
Chapter 13: Parallel Data Mining ............................................................................. 2 6 1 David Taniar, Monash University, Australia J. Wenny Rahayu, La Trobe University, Australia About the Authors ...................................................................................................... 2 9 0 Index ........................................................................................................................... 2 9 7


vi

Preface
The last decade has witnessed a revolution in interdisciplinary research where the boundaries of different areas have overlapped or even disappeared. New fields of research emerge each day where two or more fields have integrated to form a new identity. Examples of these emerging areas include bioinformatics (synthesizing biology with computer and information systems), data mining (combining statistics, optimization, machine learning, artificial intelligence, and databases), and modern heuristics (integrating ideas from tens of fields such as biology, forest, immunology, statistical mechanics, and physics to inspire search techniques). These integrations have proved useful in substantiating problemsolving approaches with reliable and robust techniques to handle the increasing demand from practitioners to solve real-life problems. With the revolution in genetics, databases, automation, and robotics, problems are no longer those that can be solved analytically in a feasible time. Complexity arises because of new discoveries about the genome, path planning, changing environments, chaotic systems, and many others, and has contributed to the increased demand to find search techniques that are capable of getting a good enough solution in a reasonable time. This has directed research into heuristics. During the same period of time, databases have grown exponentially in large stores and companies. In the old days, system analysts faced many difficulties in finding enough data to feed into their models. The picture has changed and now the reverse picture is a daily problem­how to understand the large amount of data we have accumulated over the years. Simultaneously, investors have realized that data is a hidden treasure in their companies. With data, one can analyze the behavior of competitors, understand the system better, and diagnose the faults in strategies and systems. Research into statistics, machine learning, and data analysis has been resurrected. Unfortunately, with the amount of data and the complexity of the underlying models, traditional approaches in statistics, machine learning, and traditional data analysis fail to cope with this level of complexity. The need therefore arises for better approaches that are able to handle complex models in a reasonable amount of time. These approaches have been named data mining (sometimes data farming) to distinguish them from traditional statistics, machine learning, and other data analysis techniques. In addition, decision makers were not interested in techniques that rely too much on the underlying assumptions in statistical models. The challenge is to not have any assumptions about the model and try to come up with something new, something that is not obvious or predictable (at least from the decision makers' point of view). Some unobvious thing may have significant values to the decision maker. Identifying a hidden trend in the data or a buried fault in the system is by all accounts a treasure for the investor who knows that avoiding loss results in profit and that knowledge in a complex market is a key criterion for success and continuity. Notwithstanding, models that are free from assumptions­or at least have minimum assumptions­are expensive to use. The dramatic search space cannot be navigated using traditional search techniques. This has highlighted a natural demand for the use of heuristic search methods in data mining. This book is a repository of research papers describing the applications of modern


vii
heuristics to data mining. This is a unique­and as far as we know, the first­book that provides up-to-date research in coupling these two topics of modern heuristics and data mining. Although it is by all means an incomplete coverage, it does provide some leading research in this area. This book contains open-solicited and invited chapters written by leading researchers in the field. All chapters were peer reviewed by at least two recognized researchers in the field in addition to one of the editors. Contributors come from almost all the continents and therefore, the book presents a global approach to the discipline. The book contains 13 chapters divided into five parts as follows: · Part 1: General Heuristics · Part 2: Evolutionary Algorithms · Part 3: Genetic Programming · Part 4: Ant Colony Optimization and Immune Systems · Part 5: Parallel Data Mining Part 1 gives an introduction to modern heuristics as presented in the first chapter. The chapter serves as a textbook-like introduction for readers without a background in heuristics or those who would like to refresh their knowledge. Chapter 2 is an excellent example of the use of hill climbing for clustering. In this chapter, Vladimir Estivill-Castro and Michael E. Houle from the University of Newcastle and the University of Sydney, respectively, provide a methodical overview of clustering and hill climbing methods to clustering. They detail the use of proximity information to assess the scalability and robustness of clustering. Part 2 covers the well-known evolutionary algorithms. After almost three decades of continuous research in this area, the vast amount of papers in the literature is beyond a single survey paper. However, in Chapter 3, Erick CantÇ-Paz and Chandrika Kamath from Lawrence Livermore National Laboratory, USA, provide a brave and very successful attempt to survey the literature describing the use of evolutionary algorithms in data mining. With over 75 references, they scrutinize the data mining process and the role of evolutionary algorithms in each stage of the process. In Chapter 4, Beatriz de la Iglesia and Victor J. Rayward-Smith, from the University of East Anglia, UK, provide a superb paper on the application of Simulated Annealing, Tabu Search, and Genetic Algorithms (GA) to nugget discovery or classification where an important class is under-represented in the database. They summarize in their chapter different measures of performance for the classification problem in general and compare their results against 12 classification algorithms. Ißaki Inza, Pedro Larraßaga, and Basilio Sierra from the University of the Basque Country, Spain, follow, in Chapter 5, with an outstanding piece of work on feature subset selection using a different type of evolutionary algorithms, the Estimation of Distribution Algorithms (EDA). In EDA, a probability distribution of the best individuals in the population is maintained to sample the individuals in subsequent generations. Traditional crossover and mutation operators are replaced by the re-sampling process. They applied EDA to the Feature Subset Selection problem and showed that it significantly improves the prediction accuracy. In Chapter 6, Jorge MuruzÀbal from the University of Rey Juan Carlos, Spain, presents the brilliant idea of evolving teams of local Bayesian learners. Bayes theorem was resurrected as a result of the revolution in computer science. Nevertheless, Bayesian approaches, such as


viii
Bayesian Networks, require large amounts of computational effort, and the search algorithm can easily become stuck in a local minimum. Dr. MuruzÀbal combined the power of the Bayesian approach with the ability of Evolutionary Algorithms and Learning Classifier Systems for the classification process. Neil Dunstan from the University of New England, and Michael de Raadt from the University of Southern Queensland, Australia, provide an interesting application of the use of evolutionary algorithms for the classification and detection of Unexploded Ordnance present on military sites in Chapter 7. Part 3 covers the area of Genetic Programming (GP). GP is very similar to the traditional GA in its use of selection and recombination as the means of evolution. Different from GA, GP represents the solution as a tree, and therefore the crossover and mutation operators are adopted to handle tree structures. This part starts with Chapter 8 by Peter W.H. Smith from City University, UK, who provides an interesting introduction to the use of GP for data mining and the problems facing GP in this domain. Before discarding GP as a useful tool for data mining, A.P. Engelbrecht and L Schoeman from the University of Pretoria, South Africa along with Sonja Rouwhorst from the University of Vrije, The Netherlands, provide a building block approach to genetic programming for rule discovery in Chapter 9. They show that their proposed GP methodology is comparable to the famous C4.5 decision tree classifier­a famous decision tree classifier. Part 4 covers the increasingly growing areas of Ant Colony Optimization and Immune Systems. Rafael S. Parpinelli and Heitor S. Lopes from Centro Federal de Educacao Tecnologica do Parana, and Alex A. Freitas from Pontificia Universidade Catolica do Parana, Brazil, present a pioneer attempt, in Chapter 10, to apply ant colony optimization to rule discovery. Their results are very promising and through an extremely interesting approach, they present their techniques. Jon Timmis and Thomas Knight, from the University of Kent at Canterbury, UK, introduce Artificial Immune Systems (AIS) in Chapter 11. In a notable presentation, they present the AIS domain and how can it be used for data mining. Leandro Nunes de Castro and Fernando J. Von Zuben, from the State University of Campinas, Brazil, follow in Chapter 12 with the use of AIS for clustering. The chapter presents a remarkable metaphor for the use of AIS with an outstanding potential for the proposed algorithm. In general, the data mining task is very expensive, whether we are using heuristics or any other technique. It was therefore impossible not to present this book without discussing parallel data mining. This is the task carried out by David Taniar from Monash University and J. Wenny Rahayu from La Trobe University, Australia, in Part 5, Chapter 13. They both have written a self-contained and detailed chapter in an exhilarating style, thereby bringing the book to a close. It is hoped that this book will trigger great interest into data mining and heuristics, leading to many more articles and books!


ix

Acknowledgments
We would like to express our gratitude to the contributors without whose submissions this book would not have been born. We owe a great deal to the reviewers who reviewed entire chapters and gave the authors and editors much needed guidance. Also, we would like to thank those dedicated reviewers, who did not contribute through authoring chapters to the current book or to our second book Heuristics and Optimization for Knowledge Discovery­ Paul Darwen, Ross Hayward, and Joarder Kamruzzaman. A further special note of thanks must go also to all the staff at Idea Group Publishing, whose contributions throughout the whole process from the conception of the idea to final publication have been invaluable. In closing, we wish to thank all the authors for their insights and excellent contributions to this book. In addition, this book would not have been possible without the ongoing professional support from Senior Editor Dr. Mehdi Khosrowpour, Managing Editor Ms. Jan Travers and Development Editor Ms. Michele Rossi at Idea Group Publishing. Finally, we want to thank our families for their love, support, and patience throughout this project. Hussein A. Abbass, Ruhul Sarker, and Charles Newton Editors (2001)


PART ONE: GENERAL HEURISTICS


2 Abbass

Chapter I

From Evolution to Immune to Swarm to ...? A Simple Introduction to Modern Heuristics
Hussein A. Abbass University of New South Wales, Australia

The definition of heuristic search has evolved over the last two decades. With the continuous success of modern heuristics in solving many combinatorial problems, it is imperative to scrutinize the success of these methods applied to data mining. This book provides a repository for the applications of heuristics to data mining. In this chapter, however, we present a textbook-like simple introduction to heuristics. It is apparent that the limited space of this chapter will not be enough to elucidate each of the discussed techniques. Notwithstanding, our emphasis will be conceptual. We will familiarize the reader with the different heuristics effortlessly, together with a list of references that should allow the researcher to find his/her own way in this large area of research. The heuristics that will be covered in this chapter are simulated annealing (SA), tabu search (TS), genetic algorithms (GA), immune systems (IS), and ant colony optimization (ACO).

Copyright © 2002, Idea Group Publishing.


From Evolution to Immune to Swarm to ... 3

INTRODUCTION
Problem solving is the core of many disciplines. To solve a problem properly, we need first to represent it. Problem representation is a critical step in problem solving as it can help in finding good solutions quickly and it can make it almost impossible not to find a solution at all. In practice, there are many different ways to represent a problem. For example, operations research (OR) is a field that represents a problem quantitatively. In artificial intelligence (AI), a problem is usually represented by a graph, whether this graph is a network, tree, or any other graph representation. In computer science and engineering, tools such as system charts are used to assist in the problem representation. In general, deciding on an appropriate representation of a problem influences the choice of the appropriate approach to solve it. Therefore, we need somehow to choose the problem solving approach before representing the problem. However, it is often difficult to decide on the problem solving approach before completing the representation. For example, we may choose to represent a problem using an optimization model, then we find out that this is not suitable because there are some qualitative aspects that also need to be captured in our representation. Once a problem is represented, the need arises for a search algorithm to explore the different alternatives (solutions) to solve the problem and to choose one or more good possible solutions. If there are no means of evaluating the solutions' quality, we are usually just interested in finding any solution. If there is a criterion that we can use to differentiate between different solutions, we are usually interested in finding the best or optimal solution. Two types of optimality are generally distinguished: local and global. A local optimal solution is the best solution found within a region (neighborhood) of the search space, but not necessarily the best solution in the overall search space. A global optimal solution is the best solution in the overall search space. To formally define these concepts, we need first to introduce one of the definitions of a neighborhood. A neighborhood B(x) in the search space (X) defined on X Rn and centered on a solution x is defined by the Euclidean distance ; that is B(x) = {x Rn | ||x ­ x|| <, >0}. Now, we can define local and global optimality as follows: Definition 1: Local optimality A solution x(X) is said to be a local minimum of the problem iff >0 such that f(x) f(x)x (B(x) (X)). Definition 2: Global optimality A solution x(X) is said to be a global minimum of the problem iff >0 such that f(x) f(x)x (X). Finding a global optimal solution in most real-life applications is difficult. The number of alternatives that exist in the search space is usually enormous and cannot be searched in a reasonable amount of time. However, we are usually interested in good enough solutions--or what we will call from now on, satisfactory solutions. To search for a local, global, or satisfactory solution, we need to use a search mechanism. Search is an important field of research, not only because it serves all


4 Abbass

disciplines, but also because problems are getting larger and more complex; therefore, more efficient search techniques need to be developed every day. This is true whether a problem is solved quantitatively or qualitatively. In the literature, there exist three types of search mechanisms (Turban, 1990), analytical, blind, and heuristic search techniques. These are discussed below. · Analytical Search: An analytical search algorithm is guided using some mathematical function. In optimization, for example, some search algorithms are guided using the gradient, whereas others the Hessian. These types of algorithms guarantee to find the optimal solution if it exists. However, in most cases they only guarantee to find a local optimal solution and not the global one. · Blind Search: Blind search--sometimes called unguided search - is usually categorized into two classes: complete and incomplete. A complete search technique simply enumerates the search space and exhaustively searches for the optimal solution. An incomplete search technique keeps generating a set of solutions until an optimal one is found. Incomplete search techniques do not guarantee to find the optimal solution since they are usually biased in the way they search the problem space. · Heuristic Search: It is a guided search, widely used in practice, but does not guarantee to find the optimal solution. However, in most cases it works and produces high quality (satisfactory) solutions. To be concise in our description, we need to distinguish between a general purpose search technique (such as all the techniques covered in this chapter), which can be applied to a wide range of problems, and a special purpose search technique which is domain specific (such as GSAT for the propositional satisfiability problem and back-propagation for training artificial neural networks) which will not be addressed in this chapter. A general search algorithm has three main phases: initial start, a method for generating solutions, and a criterion to terminate the search. Logically, to search a space, we need to find a starting point. The choice of a starting point is very critical in most search algorithms as it usually biases the search towards some area of the search space. This is the first type of bias introduced into the search algorithm, and to overcome this bias, we usually need to run the algorithm many times with different starting points. The second stage in a search algorithm is to define how a new solution can be generated, another type of bias. An algorithm, which is guided by the gradient, may become stuck in a saddle point. Finally, the choice of a stopping criterion depends on the problem on hand. If we have a large-scale problem, the decision maker may not be willing to wait for years to get a solution. In this case, we may end the search even before the algorithm stabilizes. From some researchers' points of view, this is unacceptable. However in practice, it is necessary. An important issue that needs to be considered in the design of a search algorithm is whether it is population based or not. Most traditional OR and AI methods maintain a single solution at a time. Therefore, the algorithm starts with a


From Evolution to Immune to Swarm to ... 5

solution and then moves from it to another. Some heuristic search methods, however, use a population(s) of solutions. In this case, we try to improve the population as a whole, rather than improving a single solution at a time. Other heuristics maintain a probability distribution of the population instead of storing a large number of individuals (solutions) in the memory. Another issue when designing a search algorithm is the balance between intensification and exploration of the search. Early intensification of the search increases the probability that the algorithm will return a local optimal solution. Late intensification of the search may result in a waste of resources. The last issue which should be considered in designing a search algorithm is the type of knowledge used by the algorithm and the type of search strategy. Positive knowledge means that the algorithm rewards good solutions and negative knowledge means that the algorithm penalizes bad solutions. By rewarding or penalizing some solutions in the search space, an algorithm generates some belief about the good or bad areas in the search. A positive search strategy biases the search towards a good area of the search space, and a negative search strategy avoids an already explored area to explore those areas in the search space that have not been previously covered. Keeping these issues of designing a search algorithm in mind, we can now introduce heuristic search. The word heuristic originated from the Greek root , or to discover. In problem solving, a heuristic is a rule of thumb approach. In artificial intelligence, a heuristic is a procedure that may lack a proof. In optimization, a heuristic is an approach which may not be guaranteed to converge. In all previous fields, a heuristic is a type of search that may not be guaranteed to find a solution, but put simply "it works". About heuristics, Newell and Simon wrote (Simon 1960): "We now have the elements of a theory of heuristic (as contrasted with algorithmic) problem solving; and we can use this theory both to understand human heuristic processes and to simulate such processes with digital computers." Th