Multi-Layer Perceptron (MLP)

Concept

Batch: The number of input vectors, which is 2 in the example.
Hidden layer: Intermediate layers where the network extracts features using weights, biases, and non-linear activation functions.
Input/Output layer: As the name goes.

For each layer

$A_{i + 1} = R e LU (W_{i} X + b_{i})$

Before final result

We should use Sigmoid or Softmax to squash the result into 0 and 1.

Sigmoid

Sigmoid is used when there's only ONE column in each batch.

It is defined as: $σ (x) = \frac{1}{1 + e ^{- x}}$

Softmax

Softmax is the most commonly used activation function. It is the average number in a $e^{x}$ perspective.

$S o f t ma x (d_{i}) = \frac{e ^{d_{i}}}{\sum _{k = 0}^{i} e ^{d_{k}}}$

Backpropagation

Backpropagation is short for Backward Propagation of Errors.

Formula & Expression

First, let's look at the basic expression of neuron network which has only one node in each layer.

$a^{(L)} = σ (w^{(L)} a^{(L - 1)} + b^{(L)})$

The following will be based on a more complex situation, which has multiple nodes in each layer, reflecting how they are structured in practice.

Let's take a look at some concepts.

$L$ : Layer, the variable in this formula.
$a_{n}^{(L)}$ : The value of node $n$ in layer $L$ .
$w_{j k}^{(L)}$ : The weight between $a_{k}^{(L - 1)}$ and $a_{j}^{(L)}$ .
$b$ : Bias, a specific number.
$σ$ : A non-linear function, usually Stigma or ReLU.

We use $z$ to represent the whole bunch of things inside $σ$ .

$z_{j}^{(L)} = k = 0 \sum n_{L} - 1 w_{j k}^{(L)} a_{k}^{(L - 1)} + b_{j}^{(L)}$

So that $a$ can be written as:

$a_{j}^{(L)} = σ (z_{j}^{(L)})$

And the loss $C_{0}$ would be the sum of errors in the last layer.

$C_{0} = j = 0 \sum n_{L} - 1 (a_{j}^{(L)} - y_{j})^{2}$

where $y_{j}$ is the given target.

Chain Rule

$\frac{\partial C _{0}}{\partial w _{j k}^{(L)}} = \frac{\partial z _{j}^{(L)}}{\partial w _{j k}^{(L)}} \frac{\partial a _{j}^{(L)}}{\partial z _{j}^{(L)}} \frac{\partial C _{0}}{\partial a _{j}^{(L)}}$

According to the previous formulas, each part would be:

$\frac{\partial C _{0}}{\partial a _{j}^{(L)}} = 2 (a_{j}^{(L)} - y_{j})$ $\frac{\partial a _{j}^{(L)}}{\partial z _{j}^{(L)}} = σ^{'} (z^{(L)})$ $\frac{\partial z _{j}^{(L)}}{\partial w _{j k}^{(L)}} = a^{(L - 1)}$

So,

$\frac{\partial C _{0}}{\partial w _{j k}^{(L)}} = a^{(L - 1)} \cdot σ^{'} (z^{(L)}) \cdot 2 (a^{(L)} - y)$

Oops, there must be something wrong... $w_{j k}^{(L)}$ has many paths to affect $C_{0}$ , so $\frac{\partial C _{0}}{\partial w _{j k}^{(L)}}$ couldn't just depend on this simple formula.

Excel Document

The document is too complicated without any explanation. I would dive that deep someday when I read the source code of deep learning algorithm.

RNN

This document is also highly abstract, so I would ignore it and try to learn something by my own.

Why RNN?

In the traditional neural network, the layers are fully connected, and nodes in the same layer don't connect to each other. So the data flows from the input layer, passing through the hidden layers and finally to the output layer. It treats the input as a whole object, considering them statically.

What about RNN? In RNN, the node is connected to itself, continuously updating its value. Of course, we can also draw the chart in a row, each node representing the value of that node at a specific time. So it is not good at processing sequences, of which the orders are also important.

What is RNN?

The main node in RNN is continuously updating the value of itself, so it can be considered as a neural network with some loop weights.

The formula below is written by myself according to my understanding. It's just a note without preciseness.

$a_{t} = σ (w_{1} \cdot I_{t} + w_{2} \cdot a_{t - 1} + b_{1})$ $O_{t} = w_{3} \cdot a_{t} + b_{2}$

$I_{t}$ : The input at time $t$ . We can also call this time the depth of loop.
$O_{t}$ : The output that time $t$ .
$w$ , $b$ : Weight and bias. They remain the same in every loop.
$σ$ : The activation function, again usually ReLU here.

Advantage: It's good at handling sequences, especially the data with time axis.

Disadvantage: The vanishing/exploding gradient problem, which I'll talk about in the next part.

Long Short Term Memory

~~Complaint: Why are these documents more and more complex??? Without any explanation??? I can hardly understand even if I have learned so many things from online and AI.~~

Why LSTM?

As a specialized form of RNN, LSTM solves the vanishing gradient problem.

What is gradient? In backpropagation, we need to calculate the adjustment size based on the magnitude of the error, and this is called the "gradient".

What is the Vanishing Gradient Problem? When calculating the adjustment in a layer far from the output layer, we should use the Chain Rule. However, as we all know, the gradient is usually small, and they will be tinier and tinier when multiplied together. When the gradient signal reaches the first few layers, it has "vanished", especially because of the precision limit of our computer. What's worse, in RNN, every loop is like a "layer", and the vanishing gradient problem is severe with a fast increasing network "depth". Another reason is that the activation function, such as Sigmoid, often have a small derivative, which makes the gradient even smaller.

What will happen with the Vanishing Gradient Problem? For general neural networks, the weights in early layers would hardly update, sometimes failing to recognize the fundamental features of the data. For RNN especially, it will "forget" what it saw at the beginning of a long sequence.

Of course, there's an opposite problem, the Exploding Gradients, which have exactly the same reason.

To solve all the problems, here's LSTM.

What is LSTM?

Refer this video on Bilibili.

In this part, I will use $f_{n}$ to represent a linear function with fixed weight and bias, and use $n$ to identify different functions. In detail, the function is defined like this.

$f_{n} (x, h) = W x + U h + b$

And I will also introduce a new activation function, the $tanh$ function. It always gives a result between -1 and 1.

For the convenience of communication, let's define some short forms.

$L T M$ : long-term memory.
$S T M$ : short-term memory.
$I P T$ : input value.
$O P T$ : output value.
$σ (x)$ : the Sigmoid function.
$tanh (x)$ : the $tanh$ function.

$tanh (x) = \frac{e ^{x} - e ^{- x}}{e ^{x} + e ^{- x}}$

Now, let's look at the LSTM itself.

For each loop in LSTM, there are the long-term memory, the short-term memory, the input and the output. And in each loop, there are mainly three parts. I'm now going to dive into one single loop and explain these parts.

The first part is the Forget Gate. As the name suggests, it updates the long-term memory by forgetting part of it.

$L T M = L T M \cdot σ (f_{1} (S T M, I P T))$

Here, $L T M$ is multiplied by a factor which indicates in what percentage the long-term memory should be remembered.

The second part is the Input Gate. It also updates the long-term memory, but by adding a new value to it.

$L T M = L T M + (tanh (f_{2} (S T M, I P T)) \cdot σ (f_{3} (S T M, I P T)))$

In this formula, the cyan part gives the Potential Long-Term Memory, and the yellow part shows the percentage that potential long-term memory should be reserved.

The final part is called the Output Gate. It generates a new value for the Short-Term Memory.

$S T M = tanh (L T M) \cdot σ (f_{4} (S T M, I P T))$

Almost the same with the formal formula, the cyan part is the Potential Short-Term Memory and the yellow is its percentage to be remembered.

After a whole loop, the updated long-term and short-term memory, combined with a new input value, will be transferred to the next loop.

A more rigorous statement

To make a more rigorous statement, we have to define some variables.

$c_{t}$ : the Cell State at time $t$ , which is actually the value of the long-term memory.
$h_{t}$ : the Hidden State at time $t$ , representing the value of the short-term memory.
$x_{t}$ : Input value at time $t$ .

And we also have to introduce a symbol, $⊙$ , which is called the element-wise product.

$a_{1} a_{2} ⋮ a_{n} ⊙ b_{1} b_{2} ⋮ b_{n} = a_{1} b_{1} a_{2} b_{2} ⋮ a_{n} b_{n}$

We can finally appreciate the formula.

Here's the Forget Gate.

$f_{t} = σ (W_{f} \cdot x_{t} + U_{f} \cdot h_{t - 1} + b_{f})$ $c_{t} = f_{t} ⊙ c_{t - 1}$

And here's the Input Gate ( $i_{t}$ ) and Candidate Cell State ( $\tilde{c}_{t}$ )

$i_{t} = σ (W_{i} \cdot x_{t} + U_{i} \cdot h_{t - 1} + b_{i})$ $\tilde{c}_{t} = tanh (W_{c} \cdot x_{t} + U_{c} \cdot h_{t - 1} + b_{c})$ $c_{t} = (f_{t} ⊙ c_{t - 1}) + (i_{t} ⊙ \tilde{c}_{t})$

And the Output Gate ( $o_{t}$ )

$o_{t} = σ (W_{o} \cdot x_{t} + U_{o} \cdot h_{t - 1} + b_{o})$ $h_{t} = o_{t} ⊙ tanh (c_{t})$

The $h_{t}$ here is the final output.

xLSTM

Skip for now.

The PDF file of the paper is here.

ResNet

Almost about adding the value of former node into latter calculation. But I also won't dive deep. I'm not going to admit that the fact is I couldn't understand. 🤣

Transformer

What is Transformer

Transformer is mainly based on Query, Key, and Value, also called QKV.

In Transformer model, every token has its representing vector. And the Q, K, and V value of a token is calculated by multiplying its vector and the matrix $W_{Q}$ , $W_{K}$ , and $W_{V}$ .

What are Q, K, and V? Query selects the tokens around where more attention should be given. Key is the value directly questioned by Query. Value represents the real meaning of the token.

Then, here comes the main formula.

$A tt e n t i o n (Q, K, V) = S o f t ma x (\frac{Q \cdot K}{d _{K}}) \cdot V$