Important Notation

Input features - input variables 
Target variable - output you are trying to predict
Training set - list of m training examples
Hypothesis - function that given x will predict y 
Regression problem - when the target variable is continuous
Classification problem - when the target variable is discrete 
Parameters (weights) - weights for each x input (parameterize the space of linear functions mapping from X to Y).
Intercept term - convention of letting x0 = 1, so that our θ0x0 term becomes simply Î¸0. 

Parametric algorithm - algorithm with fixed number of parameters (thetas), and don't need to keep around the entire data training set. So, stuff we need to keep around is unrelated to size of training set. 

Linear Regression
Least Mean Squares (LMS) Update Rule
One way to minimize cost function. J is a convex quadratic function, so we know that convergence MUST occur (only 1 global optima).
Utilizes Gradient Descent 
Descends gradient to find minima 
Ordinary Least Squares Regression Model
Uses the cost function to the right to compute linear regression that minimizes cost function. 
Utilizes cost function
Using LMS Update Rule
Repeat the following until convergence: 
Weights calculated by 

Batch gradient descent 
Looks at every example in training set at every step.  
Stochastic Gradient Descent
Updates theta with each training example. Best when training set large. 
Using Normal equations
You can also optimize J by using normal equations and matrix algebra. 
Cost function minimized 
Why does least-squares cost function make sense?

Let us assume that all target variables and inputs are related via: 

ei an error term, which are distributed IID according to Normal distribution with mean 0 and some variance. 

Then, we have:

(since the error term is normally distributed)
Definition
Basically the same thing as linear regression - but is non-parametric algorithm. However, in making a prediction at a particular x, each training example is weighted based on how "far" it is from that x.  So: 

w's are non-negative values called weights.  
Likelihood function 

So "probability of the data" is: 
Locally Weighted Linear Regression
Taking the log of the likelihood function:

We see that maximizing l is minimizing the right side, which looks like our cost function J! 
Main Idea
Our logistic function 

Logistic Regression
Definition
We want to use logistic regression for classification problems (linear reg does poorly since y can take on more than just 0 or 1. See perceptron learning algorithm). 
Derivative of log likelihood

(with respect to theta j)

Logistic Function Graph
Hypothesis for lin reg

To fit theta to data, assume 

Likelihood function
Our update rule for logistic regression 

Same as LMS update rule!  Whoa.
Log likelihood 
Perceptron Learning Algorithm
Newton's Method to Maximize Log Likelihood 
Where 

Definition
A way to solve a classification problem with linear regression.  Let
Our update rule

Same as LMS update rule/logistic regression!  Whoa.
Why Newton's Method?
So far, we can maximize log likelihood by using normal equations, or use gradient descent.  We can also use Newton's method. 
Newton's Method 

To maximize theta, pick a theta close to minima and repeat:

Our update rule 

   Login to remove ads X
Feedback | How-To