CSCE 420 Lecture 28

« previous | Friday, April 26, 2013 | next »

End Exam 2 content

Honors Lecture

Machine Learning

"Learning from experience" (in true light of "intelligence")

→ Adapt and Improve performance

Knowledge Base Repair: correct errors in KB rules
Neural Networks: graph of sensor inputs through node "network" to output

Supervised learning

Labeled examples:

Labels: finite set of classes/categories; usually small, maybe only positive and negative classes
Examples: described as feature vectors; maps state of features to an action

Driving Example

Labels:

c₁ = accelerate
c₂ = do nothing
c₃ = brake

Features:

F₁ = light (red, green, yellow)
F₂ = pedestrians
F₃ = distance to car in front
F₃ = surface wet?

Training Examples:

$\left\langle {\text{red}},\mathbf {F} ,120,\mathbf {F} =c_{3}\right\rangle$
$\left\langle {\text{green}},{\textbf {T}},100,\mathbf {T} c_{3}\right\rangle$

Unsupervised learning

Unlabeled Examples

Seems much more abstract...

Reinforcement learning

Indirect feedback

Credit assignment

Analyzing Learning

Usually plotted on learning curve graphs:

X axis → number of examples
Y axis → accuracy
Usually plateaus before reaching 100% accuracy

Decision Trees

Each node represents a feature, and branches represent that feature's values:

F₁
|-- True: F₂
|   |-- True:  (+)
|   `-- False: (-)
`-- False: (+)

How does one construct a decision tree?

ID3 Algorithm

For each node, What feature gains the most information?

ID3(example) returns Tree
  if example is pure, make leaf
  else
    for each feature f
      divide examples into 2 subsets A and B
      calculate information gain IG = H_ex - sum(weight[b] * entropy[b], b in branch)
      for F_i that maximizes IG
        return Tree(ID3(A), ID3(B))

Entropy minimization

$H=-\sum _{i}Pr[X=i]\,\log _{2}{(Pr[X=i])}$

Over a domain of $n$ classes, if we were to plot entropy (Y axis) based on how many of $n$ classes were positive, we would get a parabola with roots 0 and $n$ and maximum at (n/2, 1)

Perceptrons: Linear Classifiers

Labeled examples of binary classes (positive or negative)

If these are plotted as points on a graph, how would we find a line (discriminant function, hyperplane) to divide the positives and negatives into groups.

In other words, if our examples have values ${\vec {x}}=\left\langle x_{1},x_{2},\ldots ,x_{n}\right\rangle$ and our target classes are ${\vec {y}}=\left\langle +,-,+,+,\ldots \right\rangle$ , we can write our learned rule as a weighted linear combination compared to a threshhold:

if x₁ w₁ + x₂ w₂ >= c
    return +
else
    return -
end if

we want to find the weight vector ${\vec {w}}=\left\langle w_{1},w_{2},\ldots ,w_{n}\right\rangle$ such that ${\vec {x}}\cdot {\vec {w}}\geq 0$ and the mean squared error $E={\frac {1}{2}}\sum \left(x_{i}\,w_{i}-y_{i}\right)^{2}$ is minimized.

Gradient Descent

$\Delta w_{i}\propto -\nabla E=-\left({\frac {\partial E}{\partial w_{i}}}\right)$

Delta Training Rule: for each example update:

${\vec {w}}_{i}\gets {\vec {w_{1}}}+\eta \,{\vec {x}}_{i}\cdot (y_{i}-{\vec {w}}_{i}\cdot {\vec {x}}_{i})$ , where $\eta$ is a small learning rate (e.g. 0.001

This eventually causes a random ${\vec {w}}$ assignment to migrate toward the best position

This has application in neural networks, where each node is "trained" to understand its threshhold for executing.

Summary

Machine learning boils down to improving a function or a model (hypothesis):

$h:X\mapsto C$

$X$ is instance space
$C$ is a discrete output space
$h\in H$ $h\in H$ is a hypothesis from Hypothesis Space
- for ID3, $H$ is the space of all decision tree
- for Perceptrons, $H$ is $\mathbb {R} ^{n+1}$ for instance space dimension of $n$ (don't forget the threshhold is a weight on the constant term!)