Master Plan | Myndbook

生活

学习

球球

- 和球球一起认识树和鸟

- 护照

DS Tech

- Streamlit dashboard

- Visualization

- DS Cheatsheet

- Hadoop and Spark Concepts

NLP

Description

Deep Learning

- Learn: StatQuest, Book

- Implementation: PyTorch

Neural Network Basics

- Structure: Input -> Hidden Layers with Nodes -> Outputs

- Nodes: contains weights and bias term to map inputs from previous layers to the next and eventually summed up

- Activation functions (relu, sigmoid, softplus etc) are building blocks to creation of the final function

- Weights optimization is done through finding the derivative (i.e. gradient) of the cost / loss function with respect to each parameter and going in the direction of minimizing it (i.e. gradient descent)

- Backpropagation is an algorithm to calculate the gradient efficiently without needing to visiting all the nodes over and over again

- Common cost / loss function: SSR, MSE, Cross Entropy (negative log likelihood)

- Outputs: classification tasks usually uses Softmax to calculate the calibrated probability of a class

- Usually fully connected, meaning each feature input is connected to each node in each hidden layer

Convolutional Neural Network -> Grid Data

- Basic idea: convolution operation is used in one of the layers (also known as kernel, filter). Benefit: computationally easier (sparse connectivity, parameter sharing <weights of a filter is reused>), learned representations can be reused (parameter sharing, equivariant representations) such as edge, shape detection

- Convolutional layer structure: convolution stage (linear activation), by applying the filter to inputs, shifted by # of pixels determined by the stride and uses padding to preserve output size [(W-F+2P)/S +1] --> detector stage (nonlinear activation) --> pooling (summary stats of nearby outputs) [(W-F)/S+1]. Pooling makes representation invariant to small changes in inputs.

- After pooling, the output matrix is flattened to a vector for a fully connected layer

- Resources: Stanford, Pytorch tutorial

Representation / Embedding

- Idea: embed in a way such that variation is captured

- Global: Word2vec, Glove

- Context based

- Character vs. Word based

Experimentation & Inference

- ML vs. Economics on Prediction, Classification vs. Parameter Estimation (Causal Inference). ML is data driven and minimizes loss; Econ is theory driven and estimate counterfactuals (Athey paper)

- Causal inference techniques: Double ML, Structural methods, CausalTree (good for understanding HTE)

- 系统性学习资料

Textbook: Mixtape

Brady Neal Course

Rebecca Barter Blog

Hernán Robins Book

CNN Illustration

Spatial Dimension 细节

- C1: W=32, F=5, S=1, P=0, Output Dim = (32 - 5 + 0) / 1 + 1 = 28

- S2: W=28, F=2, S=2, Output Dim = (28 - 2) / 2 + 1 = 14

- C3: W=14, F=5, S=1, P=0, Output Dim = (14 - 5 + 0) / 1 + 1 = 10

- S4: W=10, F=2, S=2, Output Dim = (10 - 2)/2 + 1 = 5

- Number of features input for FC = 16 * 5 * 5

- Read more here

Recurrent and Recursive Nets -> Sequential Data

- Why: good for sequential data, input and output dimension does not need to match

- Challenge: computationally heavy (if need to remember too much information), diminishing gradients

- Other structures: bidirectional RNNs

- Solution: Gated Recurrent Unit (GRU), LSTM, Attention (limit context to nearby vectors through weights), Transformer

- Resource: Amidi from Stanford

工作

Career Development

- 修改Resume

- 研究公司 - Media Company, Tech Company, Other

Ideas

- 美食博主（快手，一通百通）

Basic RNN Structure

DAG / Causal Graphs

Graph Types

- Chain: T --> W --> Y

- Fork: T <-- W --> Y

- Collider: T --> W <-- Y

d-separation T and Y are d-sep by X if all paths between T and Y are blocked by X; d-sep implies association is causation and no unblocked backdoor path exists

Blocking backdoor paths establishes causal relations; achieve this by conditioning on fork and un-conditioning on collider

Causal Estimation Methods

- Concepts: Exchangeability (same average outcome even if subjects are swapped; treatment is exogenous); Positivity (probability of receiving treatment > 0); Common support (strata contains data for both treatment and control); ATE Average Treatment Effect combines (Y1|D=1 - Y0|D=1) and (Y1|D=0 - Y0|D=1), Conditional ATE CATE, Local ATE LATE or complier average causal effect, ATT Average Treatment Effect for the Treated only considers (Y1|D=1 - Y0|D=1); ATT and ATE are the same in RCTs because Y0|D=1 equals Y0|D=0, and same for Y1 (i.e. baselines would be the same, outcomes would be the same) (explanation)

- Estimation Methods:

-- IP Weighting: models treatment P(A=a|L) and then compute outcome

-- Standardization: models outcome E(Y|A=a, L=l) directly e.g. regularized regressions

-- Matching: exact match, distance based matching - bias correction methods can be used (ch5.3.2), propensity score based matching (ps + nn), coarsened exact matching

-- Instrumental Variables (IV) and 2SLS: estimates LATE

-- Regression Discontinuity (RDD): does not coexist with matching, estimates LATE

-- Diff-in-diff (DD): estimates ATT, requires parallel trend assumption

-- Synthetic control: e.g. forecasting to create a synthetic trend

-- Doubly robust estimator: combines IPW and Standardization using a canonical link

- Inverse Probability Weighting (IPW) IP weighting creates a pseudo population, so we get two copies of data for each individual under condition L: one receives treatment A and the other no treatment. Inverse probability weighting proportionately "bumps up" the under-represented arm within a condition L (Chapter 2.1). Graphically, IP weighting breaks the link between condition L and treatment A, so that the association between A and Y is causal. Key is the propensity score modeling should be a good representation of A given L. Can use robust variance estimator (GEE for this) to compute the average treatment effect and confidence interval. IP Weights: W = 1/f(A|L); Stabilized IP Weights: SW = f(A)/f(A/L), and can result in narrower confidence intervals for non-saturated models

- Linear Regression: a type of standardization. when it fails? Relationship is non-linear

- X-Learner: first, build one model each for treatment and control; then, impute the treatment effect for each observation using the opposite model (e.g. Y1 - Yhat1, where Yhat1 is built with control data); Then, build a final model to estimate the imputed treatment effect using inverse probability weighting

- Instrumental Variable: Z -> T <- U -> Y; Wald estimator COV(Y,Z)/COV(A,Z) or (E(Y|Z=1)-E(Y|Z=0))/(E(A|Z=1) - E(A|Z=0)); Calculates compliers average causal effect or LATE (sans defiers, always takers, and never-takers); STRONG LIMITATIONS from Hernán Robins: "...standard IV estimation is better reserved for settings with lots of unmeasured confounding, a truly dichotomous and time-fixed treatment, a strong and causal proposed instrument, and in which either effect homogeneity is expected to hold, or one is genuinely interested in the effect in the compliers and monotonicity is expected to hold."

RDD

IV

Causality vs. Association