Recently, our capabilities of both generating and collecting data have been increasing rapidly. The widespread use of bar codes for most commercial products, the computerization of many business and government transactions, and the advances in the data collection tools have provided us with huge amounts of data. Millions of databases have been used in business management, government administration, scientific and engineering data management, and many other applications. It is noted that the number of such databases keeps growing rapidly because of the availability of powerful and affordable database systems. This explosive growth in data and databases has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge. Consequently, data mining has become a research area with increasing importance.
Data mining, which is also referred to as knowledge discovery in databases, means a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases.
- Data Mining: An Overview from the Database Perspective.
Project Demo CodeThis file contains the gpminer package a a derived example lottery. The lottery example tries to predict if a county will have high lottery spending based on ten other attributes.
Presentation Links
Overview
and Association Rules (4/22)
Using Chapter 20.6.4 in Databases Systems: The Complete Book will
help:
1. Trace the Id3 Algorithm on the following data : Where
the task is to predict the value of EnjoySport.
| Example | Sky | AirTemp |
Humidity |
WInd |
Water |
Forecast |
EnjoySport |
| 1 |
Sunny |
Warm |
Normal |
Strong |
Warm |
Same |
Yes |
| 2 |
Sunny |
Warm |
High |
Strong |
Warm |
Same |
Yes |
| 3 |
Rainy |
Cold |
High |
Strong |
Warm |
Change |
No |
| 4 |
Sunny |
Warm |
High |
Strong |
Warm |
Change |
Yes |
| 5 |
Sunny |
Warm |
Normal |
Weak |
Warm |
Same |
No |
a) Apply one iteration of the K-means partitional-clustering algorithm, and find a new distribution of samples in clusters. What are the new centroids?
b) How can you prove that the new distribution of samples is better than the
initial one?
4. Explain why the A-priori algorithm is an
efficient implementation of an Association Search.
5.Given the
eight (Market Baskets) sets
B1 = {milk, coke, beer}
B2 = {milk, pepsi,
juice}
B3 = {milk, beer}
B4 = {coke, juice}
B5 = {milk, pepsi,
beer}
B6 = {milk, beer, juice, pepsi}
B7 = {coke, beer, juice}
B8 =
{beer, pepsi}
a)As a percentage of the baskets, what is the support
of the set {beer, juice}?
b)What is the support of the set {coke,
pepsi}?
c(What is the association rule (i.e., confidence) of milk given
beer (milk => beer)?
6. From the baskets in the second
question, if the support threshold is 35% what is the set of candidate
triples?
7. Part A
The Weather Channel wants you to show temperature values on a
visualization as follows:
-10 to 40 degrees with shades of blue with rgb
values from 255 to 55
41 to 81 degrees with shades of green with rgb values from 55
to 255
82 to 107 degrees with shades of red with rgb values from 55
to 255
The final rgb value is red*256*256 + green*256 + blue.
Write the mapping function to compute the color values.
What does it return for 30 degrees?
Part B
What is the main problem that affects dynamic, interactive
queries?
Briefly describe one strategy for overcoming this problem.
Project Description
Our project deals with mining for classification rules: rules predicting the value (the class) of a user specified goal attribute based on the values of other attributes, called predicting attributes. Traditional methods for discovering such rules are Decision Trees, Neural Networks, Rule Induction, Bayesian Learning, etc. We propose a Genetic Program which evolves function trees whose output is a boolean value representing membership in a goal class. This method, along with decision trees, holds a distinct advantage over the other methods: it produces human-readable rules. However, the GP has more potential than the Decision Tree because it performs a global search, it better represents interaction between attributes, and it can be tuned to user needs by modifying its fitness function.
For our experiment we have collected large amounts of data about the counties in Georgia (1400 attributes worth). It consists of census data and agricultural production data for each county. The goal of our project is to use this data to learn classification rules which predict whether or not a county has a high rate of some chronic disease, such as heart disease or lung cancer. We will also provide visualization of the learned rules, in 2-Dimensional and 3-Dimensional plots, and spatially with a county map representation of Georgia.
Target ConferenceIEEE International Conference on Data Mining 2003
ReferencesA Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery
Data Mining with Decision Trees and Decision Rules
Data Mining: An Overview from the Database Perspective
Information Visualization and Visual Data Mining
J3DV: A Java-Based 3D Database Visualization Tool
Visualization of Historical Wildfire Data: Application of a DX-Oracle
Viz Links and PackagesRESEARCH
Information Visualization Course - John Stasko at Georgia Tech
http://www.cc.gatech.edu/classes/AY2003/cs7450_spring/
The Detailed Syllabus has excellent notes and links
Dynamic Queries at Univ. of Maryland
http://www.cs.umd.edu/hcil/spotfire/
Daniel Keim - Data Mining And Knowledge Discovery Research
http://dbvis.fmi.uni-konstanz.de/group/get_member.php?member=keim
Information Visualization Links
http://www.cwi.nl/InfoVisu/links.html
OPEN SOURCE SYSTEMS
OpenDX C++ Visualization from IBM - Open Source
http://www.opendx.org/
OpenDX Tutorial
http://www.phys.ocean.dal.ca/docs/DX_tutorial.html
WEKA - Java Machine Learning/Data Mining Open Source System
http://www.cs.waikato.ac.nz/~ml/weka/
WEKA Java Visualization System
http://bioinf.man.ac.uk/microarray/maxd/maxdView/
XmdvTool for visual exploration of multivariate data sets
http://davis.wpi.edu/~xmdv/
The Visualization ToolKit (VTK) - open source C++, with Java interface
http://public.kitware.com/VTK/index.php
Jazz and Piccolo Java toolkits for Exploration Interfaces- Univ. Maryland
http://www.cs.umd.edu/hcil/jazz/
Grappa - A Java Graph Package
http://www.research.att.com/sw/tools/graphviz/packages/grappa.html
COMMERCIAL SYSTEMS
Java market map viz
http://www.smartmoney.com/marketmap/instructions.html
http://www.smartmoney.com/marketmap/
Commercial version of Real Estate Dynamic Query System
http://www.dq.com/homefind.shtml
Click on Evaluation copy link for demo
ADVIZOR - Their white papers are very good
www.advizor.com
Spotfire
http://www.spotfire.com/