CSCE 420 Lecture 28
« previous | Friday, April 26, 2013 | next »
Honors Lecture
Machine Learning
"Learning from experience" (in true light of "intelligence")
→ Adapt and Improve performance
- Knowledge Base Repair: correct errors in KB rules
- Neural Networks: graph of sensor inputs through node "network" to output
Supervised learning
Labeled examples:
- Labels
- finite set of classes/categories
- usually small, maybe only positive and negative classes
- Examples
- described as feature vectors
- maps state of features to an action
Driving Example
Labels:
- c1 = accelerate
- c2 = do nothing
- c3 = brake
Features:
- F1 = light (red, green, yellow)
- F2 = pedestrians
- F3 = distance to car in front
- F3 = surface wet?
Training Examples:
Unsupervised learning
Unlabeled Examples
Reinforcement learning
Indirect feedback
Credit assignment
Analyzing Learning
Usually plotted on learning curve graphs:
- X axis → number of examples
- Y axis → accuracy
- Usually plateaus before reaching 100% accuracy
Decision Trees
Each node represents a feature, and branches represent that feature's values:
F1 |-- True: F2 | |-- True: (+) | `-- False: (-) `-- False: (+)
How does one construct a decision tree?
ID3 Algorithm
For each node, What feature gains the most information?
ID3(example) returns Tree if example is pure, make leaf else for each feature f divide examples into 2 subsets A and B calculate information gain IG = Hex - sum(weight[b] * entropy[b], b in branch) for Fi that maximizes IG return Tree(ID3(A), ID3(B))
Entropy minimization
Over a domain of classes, if we were to plot entropy (Y axis) based on how many of classes were positive, we would get a parabola with roots 0 and and maximum at (n/2, 1)
Perceptrons: Linear Classifiers
Labeled examples of binary classes (positive or negative)
If these are plotted as points on a graph, how would we find a line (discriminant function, hyperplane) to divide the positives and negatives into groups.
In other words, if our examples have values and our target classes are , we can write our learned rule as a weighted linear combination compared to a threshhold:
if x1 w1 + x2 w2 >= c return + else return - end if
we want to find the weight vector such that and the mean squared error is minimized.
Gradient Descent
Delta Training Rule: for each example update:
, where is a small learning rate (e.g. 0.001
This eventually causes a random assignment to migrate toward the best position
This has application in neural networks, where each node is "trained" to understand its threshhold for executing.
Summary
Machine learning boils down to improving a function or a model (hypothesis):
- is instance space
- is a discrete output space
- is a hypothesis from Hypothesis Space
- for ID3, is the space of all decision tree
- for Perceptrons, is for instance space dimension of (don't forget the threshhold is a weight on the constant term!)