Shingling 
Represent document as set of "shingles", or segments of the original data.
- Shingles might be large size, so can hash them to four bytes each.
However, there may be LOTS of shingles, and even hashed shingles will be too much space. 
Minhashing 
Pick N different permutations of rows, and then picks values for h by scanning from the top until we come to a 1. 
- This is a signature.
- Instead of permuting, can just pick N hash functions 
 Do this for n different hash functions, so we  get a signature matrix.
1) Must be locality Sensitive 
For F  to be (d1, d2, p1, p2) sensitive: 
1) Probability that f(x)=f(y) is at least p1, if d(x,y) < d1
2) Probability that f(x)=f(y) is at most p2, if d(x,y) > d2
Basically says that function (d vs p) should decrease downward. 
(IMPORTANT) Probabilities 
Assume two columns C1, C2 are 80% similar, 20 bands, with 5 ints per band. 
- Probability of C1, C2 identical in one band: P(identical_one) = (0.8)^5
- Probability of C1, C2 not being identical in any band: (1-P(identical_one))^20
Theory of Locality Sensitive Functions
1) Must be more likely to make close pairs be candidate pairs than distant pairs. 
2) Must be statistically independent.
3) Efficient. 
Banding Technique 
Key observation: similar documents will have similar shingling and thus similar signatures. AKA, they will have bands in their signatures that are exactly the same. 
- Break up signature into b bands (will have r = lengthOfSignature/b rows in each band). 
- Hash bands into buckets of candidates.
- Examine only candidates in buckets with each other  
Similarity to Jaccard Similarity 
Explains why we can do minhashing, and it's the same as computing Jaccard similarity of entire sets.
ie. P(minhash signature for documents to agree) = s = Jaccard similarity 
That is, probability of equal minhash values for two columns = Jaccard similarity! 
AND-OR
Description
Minhashing Specifics
We have N hash functions and M rows. We will have a matrix signature of N x M, initialized to all inf values. 
Algorithm: You iterate through each row of original matrix.
- Compute the hash functions of row #.
- For each column, if 1, then we update (we update signature with min of current values in column of signature and current hashed vals)
Example
Document = "abcab", k=2
Shingles = {"ab", "bc", "ca"}
Jaccard Similarity of Two Sets
Size of intersection of two sets, divided by size of union of two sets.
OR-AND
Description
   Login to remove ads X
Feedback | How-To