Stanford NLP | Myndbook

NLP

Regex: Disjunctions

Letters

matches any word in the brackets.

Ex. [wW]

Ranges

matches any capital letter and digit.

Ex. [A - Z], [0 - 9]

Negations

matches anything that is not in the brackets.

Ex. [^Ss]

pipe

matches either or.

Ex. yours | mine

Regex: ? * + .

?

Optional previous character.

Ex. colou?r

*

0 or more of previous character.

Ex. oo*h

+

1 or more of the previous character.

.

any character.

Ex. beg.n --> begin, begun, beg3n.

Regex: Anchors

^

beginning of line.

Ex. ^[A-Z] Palo Alto

$

end of line.

Ex. \.$ --> The end.

\

escapes the character.

Regex

Errors

False positives

Matching strings that we should not have matched.

Ex. there, then, other

False negatives

Matching things that we have matched.

Ex. The and the

Word Tokenization

Segmenting/tokenizing

Lemma

same stem, part of speech, rough word sense.

Ex. cat and cats = same lemma

Wordform

the full inflected surface form.

Ex. cat and cats = different wordforms

Type

an element of the vocabulary.

Token

an instance of that type in running text.

Normalizing word formats

Information Retrieval: indexed text & query terms have same form.

reduce all letters to lower case, since users tend to use lower case.

ML:

for sentiment analysis, case is helpful

We implicitly define equivalence classes of terms.

Alternative: asymmetric expansion.

Segmenting sentences

!, ? are relatively easy.

Period "." not so much.

Build a binary classify.

Looks at a ".", decides EndOfSentence/NotEndOfSentnce.

Title

N = number of tokens

V = vocabulary = set of types

|V| size of vocabulary

Errors in Tokenization

punctuations, language issues

Lemmatization

Reduce inflections or variant forms to base form

ex. am, are, is --> be

ex. car, cars, car's cars' --> car

Morphology

The small meaningful units that make up words

Stems: the core meaning-bearing units.

Affixes: bits and pieces that adhere to stems.

Stemming

Reduce terms to their stems in information retrieval.

Maximum Matching Word Segmentation Algorithm

Maximum Matching Word Segmentation Algorithm

Given a wordlist of Chinese and a string.

1. Start a pointer at the beginning of the string.

2. Find the longest word in dictionary that matches the string starting at pointer.

3. Move the pointer over the word in string.

4. Go to 2.

Porter's algorithm

1a.

removes plurals.

sses -> ss. caresses -> caress

ies -> i. ponies -> poni

ss -> caress -> caress

s -> . cats -> cat

1b.

( * v * )ing. walking -> walk

( * v * )ed. plastered -> plaster

...

2. for long stems

ational -> ate. relational -> relate

izer -> ize. digitizer -> digitize

ator -> ate operator -> operate

...

3.

al -> . revival -> reviv

able -> . adjustable -> adjust

ate -> . activate -> activ

...

Determining if a word is end-of-sentence:

a Decision Tree

1. Case of word with "."

Upper, Lower, Cap, Number

2. Case of word after "."

Upper, Lower, Cap, Number

3. Numeric features

Length of word with "."

Probability (word with "." occurs at eos).

Probability (word after "." occurs at bos).

* A decision tree is just an if-then-else statement.

* The interesting research is choosing the features

* Setting up the structure is often too hard to do by hand. too hard to pick thresholds for numeric features.

Determining if a word is end-of-sentence

Decision Trees

Logistic regression

SVM

Neural Nets.

Minimum Edit Distance

Minimum Edit Distance

THe minimum edit distance between two strings is the minimum number of editing operations.

* Insertion

* Deletion

* Substitution

Levenshtien Distance

If each operation cost of 1.

If substitutions cost 2.

How to find minimum edit distance

Initial state: the word we're transforming.

Operators: insert, delete, substitute.

Goal state: the word we're trying to get to.

Path cost: what we want to minimize.

DP of shortest path to revisited state.

Computing Alignments

Edit distance isn't sufficient

* We often need to align each character of the two strings to each other

We do this by keeping a "backtrace"

Every time we enter a cell, remember where we came from

When we reach the end,

* Trace back the path from the upper right corner to read off the alignment

Levenshtein Distance

Weighted Edit Distance

Spell Correction: some letters are more likely to be mistyped than others

Biology: certain kinds of deletions or insertions are more likely than others.

Weighted Edit Distance

Alignments in two fields

In Natural Language Processing

* We generally talk about distance (minimized) and weights

In Computational Biology

* We generally talk about similarity (maximized) and scores

The Overlap Detection Variant

The Needleman-Wunsch Algorithm

The Smith-Waterman Algorithm (Local Alignment)

Ignore badly aligning regions

Termination:

1. If we want the best local alignment:

F_OPT = max_i_j F(i, j)

Find F_OPT and track back

2. If we want all local alignments scoring > t

For all i,j find F(i,j) > t, and trace back?

Complicated by overlapping local alignments