NLP

Regex: Disjunctions
Letters 
matches any word in the brackets.
Ex. [wW] 
Ranges       
matches any capital letter and digit.
Ex. [A - Z], [0 - 9]
Negations 
matches anything that is not in the brackets.
Ex. [^Ss]
pipe 
matches either or.
Ex. yours | mine
Regex: ? * + .
?
Optional previous character.
Ex. colou?r
*
0 or more of previous character.
Ex. oo*h
+
1 or more of the previous character.
.
any character.
Ex. beg.n --> begin, begun, beg3n.
Regex: Anchors
^
beginning of line.
Ex. ^[A-Z] Palo Alto
$
end of line.
Ex. \.$ --> The end.
\
escapes the character.

Regex

Errors
False positives
Matching strings that we should not have matched.
Ex. there, then, other
False negatives
Matching things that we have matched. 
Ex. The and the

Word Tokenization

Segmenting/tokenizing
Lemma
same stem, part of speech, rough word sense.
Ex. cat and cats = same lemma
Wordform
the full inflected surface form.
Ex. cat and cats = different wordforms
Type
an element of the vocabulary.
Token
an instance of that type in running text.
Normalizing word formats
Information Retrieval: indexed text & query terms have same form.
reduce all letters to lower case, since users tend to use lower case.

ML:
for sentiment analysis, case is helpful

We implicitly define equivalence classes of terms.

Alternative: asymmetric expansion.
Segmenting sentences
!, ? are relatively easy.
Period "." not so much.

Build a binary classify.
Looks at a ".", decides EndOfSentence/NotEndOfSentnce.
Title
N = number of tokens
V = vocabulary = set of types
|V| size of vocabulary
Errors in Tokenization
punctuations, language issues
Lemmatization
Reduce inflections or variant forms to base form
ex. am, are, is --> be
ex. car, cars, car's cars' --> car
Morphology
The small meaningful units that make up words
Stems: the core meaning-bearing units.
Affixes: bits and pieces that adhere to stems.
Stemming
Reduce terms to their stems in information retrieval.
Maximum Matching Word Segmentation Algorithm
Maximum Matching Word Segmentation Algorithm
Given a wordlist of Chinese and a string.
1. Start a pointer at the beginning of the string.
2. Find the longest word in dictionary that matches the string starting at pointer.
3. Move the pointer over the word in string.
4. Go to 2.
Porter's algorithm
1a.
removes plurals.
sses -> ss. caresses -> caress
ies -> i. ponies -> poni 
ss -> caress -> caress
s -> . cats -> cat

1b.
( * v * )ing. walking -> walk
( * v * )ed. plastered -> plaster
...
2. for long stems
ational -> ate. relational -> relate
izer -> ize. digitizer -> digitize
ator -> ate operator -> operate
...
3.
al -> . revival -> reviv
able -> . adjustable -> adjust
ate -> . activate -> activ
...
Determining if a word is end-of-sentence:
a Decision Tree
1. Case of word with "."
Upper, Lower, Cap, Number
2. Case of word after "."
Upper, Lower, Cap, Number
3. Numeric features
Length of word with "."
Probability (word with "." occurs at eos).
Probability (word after "." occurs at bos).

* A decision tree is just an if-then-else statement.
* The interesting research is choosing the features
* Setting up the structure is often too hard to do by hand. too hard to pick thresholds for numeric features.
Determining if a word is end-of-sentence
Decision Trees
Logistic regression
SVM
Neural Nets.

Minimum Edit Distance

Minimum Edit Distance
THe minimum edit distance between two strings is the minimum number of editing operations.
* Insertion
* Deletion
* Substitution
Levenshtien Distance
If each operation cost of 1.
If substitutions cost 2.
How to find minimum edit distance
Initial state: the word we're transforming.
Operators: insert, delete, substitute.
Goal state: the word we're trying to get to.
Path cost: what we want to minimize.

DP of shortest path to revisited state.
Computing Alignments
Edit distance isn't sufficient
* We often need to align each character of the two strings to each other
We do this by keeping a "backtrace"
Every time we enter a cell, remember where we came from
When we reach the end,
* Trace back the path from the upper right corner to read off the alignment
Levenshtein Distance
Weighted Edit Distance
Spell Correction: some letters are more likely to be mistyped than others
Biology: certain kinds of deletions or insertions are more likely than others.
Weighted Edit Distance
Alignments in two fields
In Natural Language Processing
 * We generally talk about distance (minimized) and weights
In Computational Biology
 * We generally talk about similarity (maximized) and scores
The Overlap Detection Variant

The Needleman-Wunsch Algorithm
The Smith-Waterman Algorithm (Local Alignment)
Ignore badly aligning regions
Termination:
1. If we want the best local alignment:
F_OPT = max_i_j F(i, j)
 Find F_OPT and track back
2. If we want all local alignments scoring > t
 For all i,j find F(i,j) > t, and trace back?
Complicated by overlapping local alignments
   Login to remove ads X
Feedback | How-To