Loading...
墨滴

希仔

2021/04/25  阅读:68  主题:默认主题

NLP

RNN model and NLP summary

Data processing

  1. one-hot encoding

Processing Text Data

  1. Tokenization (text to words)

  2. count word frequencies

    • stored in a hash table:
    image-20210424092307075
    image-20210424092307075
    • Sort the table so that the frequency is in the descending order.
    • Replace "frequency" by "index" (starting from 1.)
    • The number of unique words is called "vocabulary".
    image-20210424092500378
    image-20210424092500378

Text Processing and Word Embedding

Text processing

  1. tokenization

    Considerations in tokenization:

    • Upper case to lower case. ("Apple" to "apple"?)
    • Remove stop words, e.g., "the", "a", "of", etc.
    • Typo correction. ("goood" to "good".)
  2. build the dictionary

    image-20210424093157449
    image-20210424093157449
  3. align sequences (使每个sequence长度一样)

Word embedding

  1. One-hot encoding

    • First, represent words using one-hot vectors.
      • Suppose the dictionary contains unique words (vocabulary .
      • Then the one-hot vectors are -dimensional.
  2. word embedding

    image-20210424102903819
    image-20210424102903819

    d 由用户决定

    image-20210424103504604
    image-20210424103504604
    image-20210424103619516
    image-20210424103619516

Recurrent Neural Networks (RNNs)

image-20210424111328256
image-20210424111328256
  1. ho包含了 the 的信息

    h1包含了 the cat 的信息

    h2包含了 the cat sat 的信息

    ​ ...

    ht包含了整句话的信息

  2. 整个RNN只有一个参数A, 无论这条链有多长

Simple RNN

image-20210424111925427
image-20210424111925427

Simple RNN for IMDB Review

image-20210424112111109
image-20210424112111109
image-20210424112148940
image-20210424112148940
image-20210424112234197
image-20210424112234197

return_sequences = False: RNN只返回最后一个state ht

image-20210424112404189
image-20210424112404189

shortcomings of simple RNN

  1. simple RNN is good at short-term dependence

  2. simple RNN is bad at long-term dependence

    image-20210424112706164
    image-20210424112706164

Summary

image-20210424112756390
image-20210424112756390
image-20210424112830348
image-20210424112830348

Long Short Term Memory (LSTM)

LSTM Model

image-20210424141759826
image-20210424141759826

image-20210424142317536

  • Conveyor belt: the past information directly flows to the future

Forget gate

image-20210424142554393
image-20210424142554393
image-20210424142621438
image-20210424142621438
image-20210424142650070
image-20210424142650070
  • Forget gate (f): a vector (the same shape as and ).

    • A value of zero means "let nothing through".
    • A value of one means "let everything through!"
    • 作用于
  • f的计算过程:

    image-20210424143033375
    image-20210424143033375

Input gate

image-20210424150935131
image-20210424150935131

New value

image-20210424151017314
image-20210424151017314

image-20210424151141395
image-20210424151141395

Output gate

image-20210424151318393
image-20210424151318393
image-20210424151401890
image-20210424151401890

LSTM: Number of parameters

image-20210424151512818
image-20210424151512818

LSTM Using Keras

image-20210424151638923
image-20210424151638923
image-20210424151651650
image-20210424151651650

Summary

image-20210424151922930
image-20210424151922930

Making RNNs More Effective

Stacked RNN

image-20210424160555123
image-20210424160555123

Stacked LSTM

image-20210424160728788
image-20210424160728788

​ 最后一层LSTM的 return_sequences = True

Bidirectional RNN

image-20210424161040651
image-20210424161040651

If there is no upper RNN layer, then return

image-20210424161152476
image-20210424161152476

Pretraining

image-20210424161239868
image-20210424161239868

Step 1: Train a model on large dataset.

  • Perhaps different problem
  • Perhaps different model.

Step 2: keep only the embedding layer

Step 3: image-20210424161503939

Summary

  • SimpleRNN and LSTM are two kinds of RNNs; always use LSTM instead of SimpleRNN.
  • Use Bi-RNN instead of RNN whenever possible.
  • Stacked RNN may be better than a single RNN layer (if is big).
  • Pretrain the embedding layer (if is small).

Machine Translation and Seq2Seq Model

Sequence-to-Sequence Model (Seq2Seq)

  1. Tokenization & Build Dictionary

    • Use 2 different tokenizers for the 2 languages.
    • Then build 2 different dictionaries.
    image-20210425104618944
    image-20210425104618944
  2. One-Hot encoding

    image-20210425104709325
    image-20210425104709325
  3. training the Seq2Seq model

    image-20210425104909292
    image-20210425104909292
    image-20210425105357101
    image-20210425105357101

Improvements

image-20210425105528131
image-20210425105528131

​ Use Bi-LSTM in the encoder;

​ use unidirectional LSTM in the decoder.

Attention

Shortcoming of Seq2Seq model: The final state is incapable of remembering a long sequence.

image-20210425110230206
image-20210425110230206

Seq2Seq model with attention

image-20210425110351977
image-20210425110351977
image-20210425110820289
image-20210425110820289
image-20210425110923416
image-20210425110923416
image-20210425111201865
image-20210425111201865

Summary

image-20210425111354457
image-20210425111354457

Self-Attention

image-20210425112211760
image-20210425112211760
image-20210425114211398
image-20210425114211398
image-20210425114315102
image-20210425114315102

Transformer Model

  • Transformer is a Seq2Seq model.

  • Transformer is not RNN.

  • Purely based attention and dense layers.

  • Higher accuracy than RNNs on large datasets.

    image-20210425133549242
    image-20210425133549242

Attention for Seq2Seq Model

image-20210425133951048
image-20210425133951048
image-20210425134113776
image-20210425134113776
image-20210425134203558
image-20210425134203558
image-20210425134453128
image-20210425134453128

Attention without RNN

image-20210425134737743
image-20210425134737743
image-20210425134800504
image-20210425134800504
  1. compute weights: image-20210425135116933

    image-20210425135152983
    image-20210425135152983
  2. compute context vector: image-20210425135229911

image-20210425135252982
image-20210425135252982
  1. repeat this process
image-20210425135333471
image-20210425135333471

Output of attention layer

image-20210425135424255
image-20210425135424255

Attention Layer for Machine Translation

image-20210425135543534
image-20210425135543534

RNN for machine translation: 状态 h 作为特征向量

Attention layer for ... : context vector c 作为特征向量

Summary

image-20210425135908028
image-20210425135908028

Self-Attention without RNN

image-20210425140006287
image-20210425140006287
image-20210425140112774
image-20210425140112774
image-20210425140129045
image-20210425140129045
image-20210425140151702
image-20210425140151702

Transformer model

Single-Head Self-Attention

image-20210425162116842
image-20210425162116842

Multi-Head Self-Attention

image-20210425162235189
image-20210425162235189
  • Using single-head self-attentions (which do not share parameters.)
    • A single-head self-attention has 3 parameter matrices:
    • Totally parameters matrices
  • Concatenating outputs of single-head self-attentions.
    • Suppose single-head self-attentions' outputs are matrices.
    • Multi-head's output shape: .

Self-Attention Layer + Dense Layer

image-20210425163322433
image-20210425163322433

​ 这些全连接层完全一样 (same )

Stacked Self-Attention Layers

image-20210425163614508
image-20210425163614508

Transformer's Encoder

image-20210425165738016
image-20210425165738016
  • 输入和输出一样, 所以可以用 resnet的残差结构, 把输入加到输出上

    image-20210425170000786
    image-20210425170000786

Transformer's Decoder: One Block

image-20210425170318480
image-20210425170318480

Transformer

image-20210425170438697
image-20210425170438697

Summary

image-20210425170607623
image-20210425170607623
image-20210425170619671
image-20210425170619671
image-20210425170656647
image-20210425170656647

希仔

2021/04/25  阅读:68  主题:默认主题

作者介绍

希仔