RNN model and NLP summary

Data processing

one-hot encoding

Processing Text Data

Tokenization (text to words)
count word frequencies
- stored in a hash table:
image-20210424092307075
- Sort the table so that the frequency is in the descending order.
- Replace "frequency" by "index" (starting from 1.)
- The number of unique words is called "vocabulary".
image-20210424092500378

Text Processing and Word Embedding

Text processing

tokenization

Considerations in tokenization:
- Upper case to lower case. ("Apple" to "apple"?)
- Remove stop words, e.g., "the", "a", "of", etc.
- Typo correction. ("goood" to "good".)
build the dictionary

image-20210424093157449
align sequences (使每个sequence长度一样)

Word embedding

One-hot encoding
- First, represent words using one-hot vectors.
  - Suppose the dictionary contains unique words (vocabulary .
  - Then the one-hot vectors are -dimensional.
word embedding

image-20210424102903819

d 由用户决定

image-20210424103504604

image-20210424103619516

Recurrent Neural Networks (RNNs)

ho包含了 the 的信息

h1包含了 the cat 的信息

h2包含了 the cat sat 的信息

...

ht包含了整句话的信息
整个RNN只有一个参数A, 无论这条链有多长

Simple RNN

Simple RNN for IMDB Review

return_sequences = False: RNN只返回最后一个state ht

shortcomings of simple RNN

simple RNN is good at short-term dependence
simple RNN is bad at long-term dependence

image-20210424112706164

Summary

Long Short Term Memory (LSTM)

LSTM Model

Conveyor belt: the past information directly flows to the future

Forget gate

Forget gate (f): a vector (the same shape as and ).
- A value of zero means "let nothing through".
- A value of one means "let everything through!"
- 作用于
f的计算过程:

image-20210424143033375

Input gate

New value

Output gate

LSTM: Number of parameters

LSTM Using Keras

Summary

Making RNNs More Effective

Stacked RNN

Stacked LSTM

最后一层LSTM的 return_sequences = True

Bidirectional RNN

If there is no upper RNN layer, then return

Pretraining

Step 1: Train a model on large dataset.

Perhaps different problem
Perhaps different model.

Step 2: keep only the embedding layer

Step 3:

Summary

SimpleRNN and LSTM are two kinds of RNNs; always use LSTM instead of SimpleRNN.
Use Bi-RNN instead of RNN whenever possible.
Stacked RNN may be better than a single RNN layer (if is big).
Pretrain the embedding layer (if is small).

Machine Translation and Seq2Seq Model

Sequence-to-Sequence Model (Seq2Seq)

Tokenization & Build Dictionary
- Use 2 different tokenizers for the 2 languages.
- Then build 2 different dictionaries.
image-20210425104618944
One-Hot encoding

image-20210425104709325
training the Seq2Seq model

image-20210425104909292

image-20210425105357101

Improvements

Use Bi-LSTM in the encoder;

use unidirectional LSTM in the decoder.

Attention

Shortcoming of Seq2Seq model: The final state is incapable of remembering a long sequence.

Seq2Seq model with attention

Summary

Self-Attention

Transformer Model

Transformer is a Seq2Seq model.
Transformer is not RNN.
Purely based attention and dense layers.
Higher accuracy than RNNs on large datasets.

image-20210425133549242

Attention for Seq2Seq Model

Attention without RNN

compute weights:

image-20210425135152983
compute context vector:

repeat this process

Output of attention layer

Attention Layer for Machine Translation

RNN for machine translation: 状态 h 作为特征向量

Attention layer for ... : context vector c 作为特征向量

Summary

Self-Attention without RNN

Transformer model

Single-Head Self-Attention

Multi-Head Self-Attention

Using single-head self-attentions (which do not share parameters.)
- A single-head self-attention has 3 parameter matrices:
- Totally parameters matrices
Concatenating outputs of single-head self-attentions.
- Suppose single-head self-attentions' outputs are matrices.
- Multi-head's output shape: .

Self-Attention Layer + Dense Layer

这些全连接层完全一样 (same )

Stacked Self-Attention Layers

Transformer's Encoder

输入和输出一样, 所以可以用 resnet的残差结构, 把输入加到输出上

image-20210425170000786

Transformer's Decoder: One Block

Transformer

Summary