1. Supervised Learning

Important Notation

Input features - input variables
Target variable - output you are trying to predict
Training set - list of m training examples
Hypothesis - function that given x will predict y
Regression problem - when the target variable is continuous
Classification problem - when the target variable is discrete
Parameters (weights) - weights for each x input (parameterize the space of linear functions mapping from X to Y).
Intercept term - convention of letting x0 = 1, so that our Î¸0x0 term becomes simply Î¸0.

Parametric algorithm - algorithm with fixed number of parameters (thetas), and don't need to keep around the entire data training set. So, stuff we need to keep around is unrelated to size of training set.

Linear Regression

Least Mean Squares (LMS) Update Rule

One way to minimize cost function. J is a convex quadratic function, so we know that convergence MUST occur (only 1 global optima).

Utilizes Gradient Descent

Descends gradient to find minima

Ordinary Least Squares Regression Model

Uses the cost function to the right to compute linear regression that minimizes cost function.

Utilizes cost function

Using LMS Update Rule

Repeat the following until convergence:

Weights calculated by

Batch gradient descent

Looks at every example in training set at every step.

Stochastic Gradient Descent

Updates theta with each training example. Best when training set large.

Using Normal equations

You can also optimize J by using normal equations and matrix algebra.

Cost function minimized

Why does least-squares cost function make sense?

Let us assume that all target variables and inputs are related via:

ei an error term, which are distributed IID according to Normal distribution with mean 0 and some variance.

Then, we have:

(since the error term is normally distributed)

Definition

Basically the same thing as linear regression - but is non-parametric algorithm. However, in making a prediction at a particular x, each training example is weighted based on how "far" it is from that x. So:

w's are non-negative values called weights.

Likelihood function

So "probability of the data" is:

Locally Weighted Linear Regression

Taking the log of the likelihood function:

We see that maximizing l is minimizing the right side, which looks like our cost function J!

Main Idea

Our logistic function

Logistic Regression

Definition

We want to use logistic regression for classification problems (linear reg does poorly since y can take on more than just 0 or 1. See perceptron learning algorithm).

Derivative of log likelihood

(with respect to theta j)