Recently, our capabilities of both generating and collecting data have been increasing rapidly. The widespread use of bar codes for most commercial products, the computerization of many business and government transactions, and the advances in the data collection tools have provided us with huge amounts of data. Millions of databases have been used in business management, government administration, scientific and engineering data management, and many other applications. It is noted that the number of such databases keeps growing rapidly because of the availability of powerful and affordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance.
Data mining, which is also referred to as knowledge discovery in databases, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases.
- Data Mining: An Overview from the Database Perspective.
Our project deals with mining for classification rules: rules predicting the value (the class) of a user specified goal attribute based on the values of other attributes, called predicting attributes. Traditional methods for discovering such rules are Decision Trees, Neural Networks, Rule Induction, Bayesian Learning, etc. We propose a Genetic Program which evolves function trees whose output is a boolean value representing membership in a goal class. This method, along with decision trees, holds a distinct advantage over the other methods: it produces human-readable rules. However, the GP has more potential than the Decision Tree because it performs a global search, it better represents interaction between attributes, and it can be tuned to user needs by modifying its fitness function.
For our experiment we have collected large amounts of data about the counties in Georgia (1400 attributes worth). It consists of census data and agricultural production data for each county. Our goal is to use a subset of these attributes to predict other attributes; for example, trying to predict if a county has a high number of lottery ticket sales. We will also provide visualization of the learned rules, in 2-Dimensional plots, and spatially with a county map representation of Georgia.
Overview
and Association Rules (4/22)
Using Chapter 20.6.4 in Databases Systems: The Complete Book will help:
1. Trace the Id3 Algorithm on the following data : Where the task
is to predict the value of EnjoySport.
| Example | Sky | AirTemp |
Humidity |
WInd |
Water |
Forecast |
EnjoySport |
| 1 |
Sunny |
Warm |
Normal |
Strong |
Warm |
Same |
Yes |
| 2 |
Sunny |
Warm |
High |
Strong |
Warm |
Same |
Yes |
| 3 |
Rainy |
Cold |
High |
Strong |
Warm |
Change |
No |
| 4 |
Sunny |
Warm |
High |
Strong |
Warm |
Change |
Yes |
| 5 |
Sunny |
Warm |
Normal |
Weak |
Warm |
Same |
No |
a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids?
b) How can you prove that the new distribution of samples is better than
the initial one?
4. Explain why the A-priori algorithm is an efficient implementation
of an Association Search.
5.Given the eight (Market Baskets) sets
B1 = {milk, coke, beer}
B2 = {milk, pepsi, juice}
B3 = {milk, beer}
B4 = {coke, juice}
B5 = {milk, pepsi, beer}
B6 = {milk, beer, juice, pepsi}
B7 = {coke, beer, juice}
B8 = {beer, pepsi}
a)As a percentage of the baskets, what is the support of the set {beer,
juice}?
b)What is the support of the set {coke, pepsi}?
c(What is the association rule (i.e., confidence) of milk given beer (milk
=> beer)?
6. From the baskets in the second question, if the support threshold
is 35% what is the set of candidate triples?
7. Part A
The Weather Channel wants you to show temperature values on a
visualization as follows:
-10 to 40 degrees with shades of blue with rgb values from 255 to
55
41 to 81 degrees with shades of green with rgb values from 55 to 255
82 to 107 degrees with shades of red with rgb values from 55 to 255
The final rgb value is red*256*256 + green*256 + blue.
Write the mapping function to compute the color values.
What does it return for 30 degrees?
Part B
What is the main problem that affects dynamic, interactive queries?
Briefly describe one strategy for overcoming this problem.
IEEE International Conference on Data Mining 2003
ReferencesA Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery
Data Mining with Decision Trees and Decision Rules
Data Mining: An Overview from the Database Perspective
Information Visualization and Visual Data Mining
J3DV: A Java-Based 3D Database Visualization Tool
Visualization of Historical Wildfire Data: Application of a DX-Oracle
Viz Links and PackagesRESEARCH
Information Visualization Course - John Stasko at Georgia Tech
http://www.cc.gatech.edu/classes/AY2003/cs7450_spring/
The Detailed Syllabus has excellent notes and links
Dynamic Queries at Univ. of Maryland
http://www.cs.umd.edu/hcil/spotfire/
Daniel Keim - Data Mining And Knowledge Discovery Research
http://dbvis.fmi.uni-konstanz.de/group/get_member.php?member=keim
Information Visualization Links
http://www.cwi.nl/InfoVisu/links.html
OPEN SOURCE SYSTEMS
OpenDX C++ Visualization from IBM - Open Source
http://www.opendx.org/
OpenDX Tutorial
http://www.phys.ocean.dal.ca/docs/DX_tutorial.html
WEKA - Java Machine Learning/Data Mining Open Source System
http://www.cs.waikato.ac.nz/~ml/weka/
WEKA Java Visualization System
http://bioinf.man.ac.uk/microarray/maxd/maxdView/
XmdvTool for visual exploration of multivariate data sets
http://davis.wpi.edu/~xmdv/
The Visualization ToolKit (VTK) - open source C++, with Java interface
http://public.kitware.com/VTK/index.php
Jazz and Piccolo Java toolkits for Exploration Interfaces- Univ. Maryland
http://www.cs.umd.edu/hcil/jazz/
Grappa - A Java Graph Package
http://www.research.att.com/sw/tools/graphviz/packages/grappa.html
COMMERCIAL SYSTEMS
Java market map viz
http://www.smartmoney.com/marketmap/instructions.html
http://www.smartmoney.com/marketmap/
Commercial version of Real Estate Dynamic Query System
http://www.dq.com/homefind.shtml
Click on Evaluation copy link for demo
ADVIZOR - Their white papers are very good
www.advizor.com
Spotfire
http://www.spotfire.com/