Generative Learning Algorithms
Parameters 

For a specific xi feature, 

(^ also but for y=0)

Difference between discriminative
Discriminative models p(y|x), whereas generative models p(x|y).
Why this is useful

We can still get our original p(y|x) by computing using Bayes theorem: 

(and we don't even need to compute p(x) because we'll be trying to maximize p(y|x) and p(x) won't change) 

Gaussian Discriminant Analysis
RECAP: Multivariate Normal Distribution 
Why we use this: now we're looking for p(x|y=something).  Since x is multidimensional, we use a multivariate normal distribution to model p(x). 
When we use this (GDAs): when we have classification problem in which input features x are continuous-valued random variables. 
GDA Model
The model is: 

And is parameterized by a mean vector and a covariance matrix
AKA
Thus, we have the log likelihood function
Maximizing l wrt to parameters, we have
Relationship to logistic regression
Logistic regression also takes in x (continuous valued random variables) and classifies y. So, when is one model better than the other?
GDA vs Logistic Regression
If p(x|y) is multivariate gaussian, then it follows that p(y|x) is a logistic function. HOWEVER, the converse does not hold.  Thus, the GDA makes stronger modeling assumptions about the data so will better fit the model.  
- Specifically, when p(x|y) is indeed gaussian (with shared covariance matrix), then GDA is asymptotically efficient
- So, when p(x|y) is not gaussian, then logistic reg is more data efficient (requires less training to do "well")
Multi-variate Bernoulli event model
(Naive Bayes)
Overview
Why we use this: if we have a massive feature vector, unreasonable to use GDA bc you have shitton of parameters to calculate.  
When we use this: When we have feature vectors that are discrete (or we can discretize the features) and we have a classification problem.  
How phis are modified (multi-variate)
Naive Bayes Assumption
Likelihood function of the data
Why?
If one feature is never seen, then the phiY1 and phiY0 will evaluate to 0.  So p will evaluate to 0/0...
Parameters evaluated
To make a prediction
Calculate both for y=1 and y=0, and choose the highest p
Main Idea
With Laplace Smoothing
Multinomial event model
(Naive Bayes)
A naive bayes algorithm
Like multi-variate, except each feature vector is variable length, and takes into account frequency.  
How phis are modified (multinomial)
Likelihood function of the data
Parameters evaluated
   Login to remove ads X
Feedback | How-To