Data Mining and Visualization

Matt Perry, Eric Stiles, Ashley Taylor

Introduction

Recently, our capabilities of both generating and collecting data have been increasing rapidly. The widespread use of bar codes for most commercial products, the computerization of many business and government transactions, and the advances in the data collection tools have provided us with huge amounts of data. Millions of databases have been used in business management, government administration, scientific and engineering data management, and many other applications. It is noted that the number of such databases keeps growing rapidly because of the availability of powerful and affordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance.

Data mining, which is also referred to as knowledge discovery in databases, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases.

- Data Mining: An Overview from the Database Perspective.

Project Description

Our project deals with mining for classification rules: rules predicting the value (the class) of a user specified goal attribute based on the values of other attributes, called predicting attributes. Traditional methods for discovering such rules are Decision Trees, Neural Networks, Rule Induction, Bayesian Learning, etc. We propose a Genetic Program which evolves function trees whose output is a boolean value representing membership in a goal class. This method, along with decision trees, holds a distinct advantage over the other methods: it produces human-readable rules. However, the GP has more potential than the Decision Tree because it performs a global search, it better represents interaction between attributes, and it can be tuned to user needs by modifying its fitness function.

For our experiment we have collected large amounts of data about the counties in Georgia (1400 attributes worth). It consists of census data and agricultural production data for each county. Our goal is to use a subset of these attributes to predict other attributes; for example, trying to predict if a county has a high number of lottery ticket sales. We will also provide visualization of the learned rules, in 2-Dimensional plots, and spatially with a county map representation of Georgia.


Final Report

.doc        .html

Presentation Links


Overview and Association Rules (4/22)

Data Mining Notes 1

ID3 Algorithm

ID3 sample data

K - Means Partitional Algorithm Trace

Visualization Lecture



Sample Questions

Using Chapter 20.6.4 in Databases Systems: The Complete Book will help:


1. Trace the Id3 Algorithm on the following data : Where the task is to predict the value of EnjoySport.

Example Sky AirTemp
Humidity
WInd
Water
Forecast
EnjoySport
1
Sunny
Warm
Normal
Strong
Warm
Same
Yes
2
Sunny
Warm
High
Strong
Warm
Same
Yes
3
Rainy
Cold
High
Strong
Warm
Change
No
4
Sunny
Warm
High
Strong
Warm
Change
Yes
5
Sunny
Warm
Normal
Weak
Warm
Same
No

2. When developing decision trees, why would we prefer a shorter decision tree over a longer one. (Give at least two reasons).

3. Given the samples X1 = {1, 0}, X2 = {0, 1}, X3 = {2, 1}, and X4 = {3, 3}, suppose that the samples are randomly clustered into two clusters C1 = {X1, X3} and C2 = {X2, X4}.

a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids?

b) How can you prove that the new distribution of samples is better than the initial one?

4. Explain why the A-priori algorithm is an efficient implementation of an Association Search.

5.Given the eight (Market Baskets) sets
B1 = {milk, coke, beer}
B2 = {milk, pepsi, juice}
B3 = {milk, beer}
B4 = {coke, juice}
B5 = {milk, pepsi, beer}
B6 = {milk, beer, juice, pepsi}
B7 = {coke, beer, juice}
B8 = {beer, pepsi}


a)As a percentage of the baskets, what is the support of the set {beer, juice}?

b)What is the support of the set {coke, pepsi}?

c(What is the association rule (i.e., confidence) of milk given beer (milk => beer)?

6. From the baskets in the second question, if the support threshold is 35% what is the set of candidate triples?

7. Part A
The Weather Channel wants you to show temperature values on a
visualization as follows:
-10 to 40 degrees with shades of blue with rgb values from 255 to 55
41 to 81 degrees with shades of green with rgb values from 55 to 255
82 to 107 degrees with shades of red with rgb values from 55 to 255

The final rgb value is red*256*256 + green*256 + blue.
Write the mapping function to compute the color values.
What does it return for 30 degrees?

Part B
What is the main problem that affects dynamic, interactive queries?
Briefly describe one strategy for overcoming this problem.

Target Conference

IEEE International Conference on Data Mining 2003

References

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

Data Mining with Decision Trees and Decision Rules

Data Mining: An Overview from the Database Perspective

Information Visualization and Visual Data Mining

Visual Data Exploration

J3DV: A Java-Based 3D Database Visualization Tool

Visualization of Historical Wildfire Data: Application of a DX-Oracle

Georgia Statistics System

Viz Links and Packages

RESEARCH

Information Visualization Course - John Stasko at Georgia Tech
http://www.cc.gatech.edu/classes/AY2003/cs7450_spring/
The Detailed Syllabus has excellent notes and links

Dynamic Queries at Univ. of Maryland
http://www.cs.umd.edu/hcil/spotfire/

Daniel Keim - Data Mining And Knowledge Discovery Research
http://dbvis.fmi.uni-konstanz.de/group/get_member.php?member=keim

Information Visualization Links
http://www.cwi.nl/InfoVisu/links.html



OPEN SOURCE SYSTEMS

OpenDX C++ Visualization from IBM - Open Source
http://www.opendx.org/

OpenDX Tutorial
http://www.phys.ocean.dal.ca/docs/DX_tutorial.html

WEKA - Java Machine Learning/Data Mining Open Source System
http://www.cs.waikato.ac.nz/~ml/weka/

WEKA Java Visualization System
http://bioinf.man.ac.uk/microarray/maxd/maxdView/

XmdvTool for visual exploration of multivariate data sets
http://davis.wpi.edu/~xmdv/

The Visualization ToolKit (VTK) - open source C++, with Java interface
http://public.kitware.com/VTK/index.php

Jazz and Piccolo Java toolkits for Exploration Interfaces- Univ. Maryland
http://www.cs.umd.edu/hcil/jazz/

Grappa - A Java Graph Package
http://www.research.att.com/sw/tools/graphviz/packages/grappa.html



COMMERCIAL SYSTEMS

Java market map viz
http://www.smartmoney.com/marketmap/instructions.html
http://www.smartmoney.com/marketmap/

Commercial version of Real Estate Dynamic Query System
http://www.dq.com/homefind.shtml
Click on Evaluation copy link for demo

ADVIZOR - Their white papers are very good
www.advizor.com

Spotfire
http://www.spotfire.com/