Advanced Artificial Intelligence

Advanced Artificial Intelligence
Lecture 5: Neural Networks

Outline Perceptron Introduction Deep Neural Network Structure
Backpropagation

Perceptron Introduction
Perceptron is Inspired by Neuron. It is a classifier. A diagram showing a perceptron updating its linear boundary as more training examples are added.

Single layer one-input Perceptron
𝑏 𝑎= 𝑤 1 𝑥 1 +𝑏 𝜃(𝑥)= 1, 𝑥≥0 0, 𝑥<0 𝑤 1 =−1.5,𝑏=−0.5 𝑌=𝜃(𝑎

Single layer multi-input Perceptron
𝑏 𝑎= 𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 +𝑏 𝑤 𝑖 = 𝑤 𝑖 +𝛼⋅𝐸⋅ 𝑥 𝑖 Single layer perceptron is a linear classifier 𝑌=𝜃(𝑎 𝑏=𝑏+𝛼⋅𝐸 𝐸= 𝑌 −𝑌 𝜃(𝑥)= 1, 𝑥≥0 0, 𝑥<0

Single Hidden layer multi-input Perceptron
Multiple inputs, single hidden node perceptron. Still a linear classifier, with a hyper-classify plane.

Non-linear activation Perceptron
𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑏 𝑎= 𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 +𝑏 𝑌=𝜎(𝑎

Non-linear activation Perceptron
With sigmoid activation function

Backpropagation

Deep Neural Network One neuron (perceptron): Linear separation
One hidden layer: Realization of convex regions Two hidden layers: Realization of non-convex regions Multi-hidden layers non-linear activation: All the complex shapes

Deep Structure Deep Neural Network can do almost all the classification and regression task

Why Deep and Thin Source of the slide:

Deep NN Structure Input Layer hidden Layer output Layer
Forward Propagation： Step 1 : input->hidden layer net ℎ1 = 𝑤 1 ∗ 𝑖 1 + 𝑤 2 ∗ 𝑖 2 + 𝑏 1 ∗1 𝑜𝑢𝑡 ℎ1 = 𝑒 − net ℎ1 Step 2 : hidden->output layer net 𝑜1 = 𝑤 5 ∗ 𝑜𝑢𝑡 ℎ1 + 𝑤 6 ∗ 𝑜𝑢𝑡 ℎ2 + 𝑏 2 ∗1 𝑜𝑢𝑡 𝑜1 = 𝑒 − net 𝑜1 Assume the activation function is sigmoid

Backpropagation

Backward Propagation weight Update
out(y) = 1 1+ 𝑒 −𝑦 net x = 𝑤 1 𝑥+ 𝑤 2 𝑥+𝑏 𝑤 1 = 𝑤 1 + 𝜂 𝜕𝐸 𝜕 𝑤 1 (𝜂 is the learning rate)

Output layer weight update
Backward Propagation： Step 1 : total cost 𝐸 𝑡𝑜𝑡𝑎𝑙 = 𝑛 (𝑡𝑟𝑎𝑔𝑒𝑡 −𝑜𝑢𝑡𝑝𝑢𝑡) 2 Step 2 : output ->hidden layer weights update 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑤 5 = 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑜𝑢𝑡 𝑜1 ∗ 𝜕 𝑜𝑢𝑡 𝑜1 𝜕 𝑛𝑒𝑡 𝑜1 * 𝜕 𝑛𝑒𝑡 01 𝜕 𝑤 5 ou𝑡(𝑦)= 1 1+ 𝑒 −𝑦 net x = 𝑤 1 𝑥+ 𝑤 2 𝑥+𝑏 𝑤 5 = 𝑤 5 + 𝜂 𝜕𝐸 𝜕 𝑤 5 (Similar to single perceptron)

Hidden layer weight update
Backward Propagation： Step 3 : hidden layer -> hidden layer weight update 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑤 1 = 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑜𝑢𝑡 ℎ1 ∗ 𝜕 𝑜𝑢𝑡 ℎ1 𝜕 𝑛𝑒𝑡 ℎ1 * 𝜕 𝑛𝑒𝑡 ℎ1 𝜕 𝑤 1 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑜𝑢𝑡 ℎ1 = 𝜕 𝐸 𝑜1 𝜕 𝑜𝑢𝑡 ℎ 𝜕 𝐸 𝑜2 𝜕 𝑜𝑢𝑡 ℎ1 𝜕 𝐸 𝑜1 𝜕 𝑜𝑢𝑡 ℎ1 = 𝜕 𝐸 𝑜1 𝜕 𝑛𝑒𝑡 𝑜1 ∗ 𝜕 𝑛𝑒𝑡 𝑜1 𝜕 𝑜𝑢𝑡 ℎ1 𝜕 𝐸 𝑡𝑜𝑡𝑎𝑙 𝜕 𝑤 1 = 𝑜 𝜕 𝐸 𝑜 𝜕 𝑛𝑒𝑡 𝑜 ∗ 𝜕 𝑛𝑒𝑡 𝑜 𝜕 𝑜𝑢𝑡 ℎ1 ∗ 𝜕 𝑜𝑢𝑡 ℎ1 𝜕 𝑛𝑒𝑡 ℎ1 ∗ 𝜕 𝑛𝑒𝑡 ℎ1 𝜕 𝑤 1

Gradient Descent Millions of parameters …… Network parameters
Starting Parameters …… A network can have millions of parameters. Millions of parameters …… To compute the gradients efficiently, we use back propagation. Source of the slide:

Chain Rule Case 1 Case 2 A network can have millions of parameters.
Source of the slide:

Backpropagation 𝐿 𝜃 = 𝑛=1 𝑁 𝑙 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝑙 𝑛 𝜃 𝜕𝑤 𝑥 1 𝑥 2
xn NN 𝜃 yn 𝑦 𝑛 𝑙 𝑛 𝐿 𝜃 = 𝑛=1 𝑁 𝑙 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝑙 𝑛 𝜃 𝜕𝑤 𝑦 1 𝑥 1 𝑥 2 𝑦 2 Source of the slide:

Backpropagation 𝑧 𝑤 1 …… 𝑥 1 …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑥 2
𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 Forward pass: Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝜕𝑙 𝜕𝑤 =? 𝜕𝑧 𝜕𝑤 𝜕𝑙 𝜕𝑧 Backward pass: (Chain rule) Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Source of the slide:

Backpropagation-Forward pass
Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝑧 𝑤 1 …… 𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 𝑥 1 𝜕𝑧 𝜕 𝑤 1 =? The value of the input connected by the weight 𝑥 2 𝜕𝑧 𝜕 𝑤 2 =? Source of the slide:

Backpropagation-Forward pass
Compute 𝜕𝑧 𝜕𝑤 for all parameters 0.98 1 2 0.86 3 1 -2 1 -1 -1 -2 -1 0.12 -2 -1 0.11 -1 1 -1 4 2 That’s it. We have done the forward pass. 𝜕𝑧 𝜕𝑤 =−1 𝜕𝑧 𝜕𝑤 =0.12 𝜕𝑧 𝜕𝑤 =0.11 Source of the slide:

Backpropagation-Backward pass
Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z 𝑤 1 𝑧 𝑎 𝑥 1 b 𝑎=𝜎 𝑧 𝑤 2 𝜎′ 𝑧 𝜎 𝑧 𝑥 2 𝜕𝑙 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝑙 𝜕𝑎 𝜎′ 𝑧 Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑤 3 𝑧′ 𝑤 1 𝑥 1 b 𝑧′=𝑎 𝑤 3 +⋯ 𝑎=𝜎 𝑧 𝑤 2 𝑤 4 𝑧’’ 𝑥 2 How to explain this chain rule 𝜕𝑙 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝑙 𝜕𝑎 𝜕𝑙 𝜕𝑎 = 𝜕𝑧′ 𝜕𝑎 𝜕𝑙 𝜕𝑧′ + 𝜕𝑧′′ 𝜕𝑎 𝜕𝑙 𝜕𝑧′′ (Chain rule) ? ? Assumed it’s known 𝑤 3 𝑤 4 Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑧′ 𝑤 1 𝑤 3 𝑥 1 𝜕𝑙 𝜕𝑧′ 𝜕𝑙 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ 𝑥 2 How to explain this chain rule 𝜕𝑙 𝜕𝑧′′ 𝜕𝑙 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝑙 𝜕𝑧′ + 𝑤 4 𝜕𝑙 𝜕𝑧′′ Source of the slide:

𝜎′ 𝑧 𝑤 3 𝜕𝑙 𝜕𝑧′ 𝜕𝑙 𝜕𝑧 𝑤 4 𝜎′ 𝑧 is a constant because z is already determined in the forward pass. 𝜕𝑙 𝜕𝑧′′ 𝜕𝑙 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝑙 𝜕𝑧′ + 𝑤 4 𝜕𝑙 𝜕𝑧′′ Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z 𝑤 1 𝑧 𝑎 𝑤 3 𝑧′ 𝑦 1 𝑥 1 𝜕𝑙 𝜕𝑧′ 𝜕𝑙 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ 𝑦 2 𝑥 2 𝜕𝑙 𝜕𝑧′′ Case 1. Output Layer 𝜕𝑙 𝜕𝑧′ = 𝜕 𝑦 1 𝜕𝑧′ 𝜕𝑙 𝜕 𝑦 1 𝜕𝑙 𝜕𝑧′′ = 𝜕 𝑦 2 𝜕𝑧′′ 𝜕𝑙 𝜕 𝑦 2 Done! Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer 𝑧′ …… 𝜕𝑙 𝜕𝑧′ 𝑧’’ …… 𝜕𝑙 𝜕𝑧′′ Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer 𝑧′ 𝑎′ 𝑤 5 𝑧 𝑎 𝜕𝑙 𝜕𝑧′ 𝜕𝑙 𝜕 𝑧 𝑎 𝑤 6 𝑧’’ 𝑧 𝑏 𝜕𝑙 𝜕𝑧′′ 𝜕𝑙 𝜕 𝑧 𝑏 Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer Compute 𝜕𝑙 𝜕𝑧 recursively 𝑧′ 𝑎′ 𝑤 5 𝑧 𝑎 𝜕𝑙 𝜕𝑧′ 𝜕𝑙 𝜕 𝑧 𝑎 𝜎′ 𝑧′ Until we reach the output layer …… 𝑤 6 𝑧’’ 𝑧 𝑏 𝜕𝑙 𝜕𝑧′′ 𝜕𝑙 𝜕 𝑧 𝑏 Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Compute 𝜕𝑙 𝜕𝑧 from the output layer 𝜕𝑙 𝜕 𝑧 1 𝜕𝑙 𝜕 𝑧 3 𝜕𝑙 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝑙 𝜕 𝑧 2 𝜕𝑙 𝜕 𝑧 4 𝜕𝑙 𝜕 𝑧 6 Source of the slide:

Compute 𝜕𝑙 𝜕𝑧 for all activation function inputs z Compute 𝜕𝑙 𝜕𝑧 from the output layer 𝜕𝑙 𝜕 𝑧 1 𝜕𝑙 𝜕 𝑧 3 𝜕𝑙 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 𝜎′ 𝑧 1 𝜎′ 𝑧 3 𝜎′ 𝑧 2 𝜎′ 𝑧 4 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝑙 𝜕 𝑧 2 𝜕𝑙 𝜕 𝑧 4 𝜕𝑙 𝜕 𝑧 6 Source of the slide:

Backpropagation-Summary
Forward Pass Backward Pass … … 𝑎 𝜕𝑧 𝜕𝑤 𝜕𝑙 𝜕𝑧 = 𝜕𝑙 𝜕𝑤 X =𝑎 for all w Source of the slide:

Backpropagation Implementation
Consider the bitmap character images in the figure. We’ll train a neural network to take the cells of this image as the input (35 independent cells) and activate one of ten output cells representing the recognized pattern. While any of the output cells could be activated, we’ll take the largest activation as the cell to use in a style called winner-takes-all. Source of the picture: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The neural network that we’ll use is called a winner-takes-all network in which we have a number of output nodes, and we’ll select the one that has the largest activation. The largest activation indicates the number that was recognized. Source of the picture: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The input layer consists of 35 input cells (for each pixel in the image input), with 10 cells in the hidden layer. The output layer consists of 10 cells, one for each potential classification. The network is fully interconnected, with 350 connections between the input and hidden layer, and another 350 connections between the hidden layer and output layer (for a total of 700 weights). Source of the picture: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Neural network representation (inputs, activations, and weights) #define INPUT_NEURONS 35 #define HIDDEN_NEURONS 10 #define OUTPUT_NEURONS 10 double inputs[INPUT_NEURONS+1]; double hidden[HIDDEN_NEURONS+1]; double outputs[OUTPUT_NEURONS]; double w_h_i[HIDDEN_NEURONS][INPUT_NEURONS+1]; double w_o_h[OUTPUT_NEURONS][HIDDEN_NEURONS+1]; Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Calculating the output activations with the feed_forward function void feed_forward( void ) { int i, j; /* Calculate outputs of the hidden layer */ for (i = 0 ; i < HIDDEN_NEURONS ; i++) { hidden[i] = 0.0; for (j = 0 ; j < INPUT_NEURONS+1 ; j++) { hidden[i] += (w_h_i[i][j] * inputs[j]); } hidden[i] = sigmoid( hidden[i] ); /* Calculate outputs for the output layer */ for (i = 0 ; i < OUTPUT_NEURONS ; i++) { outputs[i] = 0.0; for (j = 0 ; j < HIDDEN_NEURONS+1 ; j++) { outputs[i] += (w_o_h[i][j] * hidden[j] ); } outputs[i] = sigmoid( outputs[i] ); Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Updating the weights given the backpropagation algorithm void backpropagate_error( int test ) { int out, hid, inp; double err_out[OUTPUT_NEURONS]; double err_hid[HIDDEN_NEURONS]; /* Compute the error for the output nodes (Equation 8.6) */ for (out = 0 ; out < OUTPUT_NEURONS ; out++) { err_out[out] = ((double)tests[test].output[out] - outputs[out]) * sigmoid_d(outputs[out]); } Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Updating the weights given the backpropagation algorithm /* Compute the error for the hidden nodes (Equation 8.7) */ for (hid = 0 ; hid < HIDDEN_NEURONS ; hid++) { err_hid[hid] = 0.0; /* Include error contribution for all output nodes */ for (out = 0 ; out < OUTPUT_NEURONS ; out++) { err_hid[hid] += err_out[out] * w_o_h[out][hid]; } err_hid[hid] *= sigmoid_d( hidden[hid] ); Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Updating the weights given the backpropagation algorithm /* Adjust the weights from the hidden to output layer (Equation 8.9) */ for (out = 0 ; out < OUTPUT_NEURONS ; out++) { for (hid = 0 ; hid < HIDDEN_NEURONS ; hid++) { w_o_h[out][hid] += RHO * err_out[out] * hidden[hid]; } Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Updating the weights given the backpropagation algorithm /* Adjust the weights from the input to hidden layer (Equation 8.9) */ for (hid = 0 ; hid < HIDDEN_NEURONS ; hid++) { for (inp = 0 ; inp < INPUT_NEURONS+1 ; inp++) { w_h_i[hid][inp] += RHO * err_hid[hid] * inputs[inp]; } return; Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The training and test loop (main function) int main( void ) { double mse, noise_prob; int test, i, j; RANDINIT(); init_network(); /* Training Loop */ do { /* Pick a test at random */ test = RANDMAX(MAX_TESTS); Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The training and test loop (main function) /* Grab input image (with no noise) */ set_network_inputs( test, 0.0 ); /* Feed this data set forward */ feed_forward(); /* Backpropagate the error */ backpropagate_error( test ); /* Calculate the current MSE */ mse = calculate_mse( test ); } while (mse > 0.001); Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The training and test loop (main function) /* Now, let’s test the network with increasing amounts of noise */ test = RANDMAX(MAX_TESTS); /* Start with 5% noise probability, end with 25% (per pixel) */ noise_prob = 0.05; for (i = 0 ; i < 5 ; i++) { set_network_inputs( test, noise_prob ); feed_forward(); for (j = 0 ; j < INPUT_NEURONS ; j++) { if ((j % 5) == 0) printf(“\n”); printf(“%d “, (int)inputs[j]); } printf( “\nclassified as %d\n\n”, classifier() ); noise_prob += 0.05; return 0; Source of the code: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

The figure graphically illustrates the generalization capabilities of the network trained using error backpropagation. In both cases, once the error rate reaches 20%, the image is no longer recognizable. What’s shown in main is a common pattern for neural network training and use. Once a neural network has been trained, the weights can be saved off and used in the given application. Source of the words: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

Baseline Image for Six Recognized as Six Recognized as Six Recognized as Six Recognized as Four Recognized as Eight Baseline Image for One Recognized as One Recognized as One Recognized as One Recognized as Seven Recognized as Five Source of the picture: 《Artificial Intelligence: A Systems Approach——M.Tim Jones著》

MNIST Based on Keras Import library files
Source of the code:

MNIST Based on Keras Description of MNIST handwritten digit recognition problem This is a digital recognition task with 10 numbers (0 to 9) or 10 classes. Each image is a 28x28 pixel square (784 pixels total). Of these, 60,000 images were used to train the model, and a separate 10,000 image set was used to test the model. Source of the code:

MNIST Based on Keras Load image data
Keras深度学习库提供了加载MNIST数据集的便捷方法。数据集在第一次调用此函数时自动下载，并作为15MB文件存储在〜/.keras/datasets/mnist.npz的主目录中。这对于开发和测试深度学习模型非常方便。为了演示加载MNIST数据集是多么容易，我们将首先编写一个小脚本来下载和可视化训练数据集中的第1个图像。 Source of the code:

你应该能看出这个图上的数字是0。然后, 就是图片的本质, 图片本质上就是二维矩阵或者三维的张量。我们现在用到的图片是灰度图片, 没有颜色, 所以只需要一个二维矩阵即可。除了向上面那样查看图片外, 我们更多的是使用下面的方法。 Source of the code:

Source of the code:

MNIST Based on Keras Adjust the data format for easy calculation
我们的神经网络的输入为一个向量，因此我们需要对图片进行整形，以使每个28x28图像成为单个784维向量。我们还将数字缩放到[0-1]范围而不是[0-255] Source of the code:

MNIST Based on Keras Create a network 在这里，我们将做一个简单的3层全连接网络。
Source of the code:

MNIST Based on Keras Compile model
Keras构建在TensorFlow之上，这两个软件包允许您在Python中定义计算图，然后它们可以在CPU或GPU上高效编译和运行，而无需Python解释器的开销。在编写模型时，Keras会要求您指定损失函数和优化器。我们在这里使用的损失函数称为分类交叉熵(categorical_crossentropy)，并且是一种非常适合比较两个概率分布的损失函数。在这里，我们的预测是十个不同数字的概率分布（例如“我们80％确信这个图像是3, 10％确定它是8, 5％它是2，等等”），而观察值Y_train和Y_text是概率分配正确类别为100％，其他所有类别为0。交叉熵是衡量预测分布与观察值分布的差异的度量。优化器有助于确定模型学习的速度。我们不会过多详细讨论这个问题，但“adam”通常是一个不错的选择。 Source of the code:

MNIST Based on Keras Training model
这是有趣的部分：您可以将之前加载的训练数据提供给此模型，它将学习对数字进行分类 Source of the code:

MNIST Based on Keras Evaluate performance Score[0] : loss
Score[1] : accuracy Source of the code:

MNIST Based on Keras Check output
检查输出并确保一切看起来都很好，这总是一个好主意。在这里，我们将看一些正确的例子，以及一些错误的例子。 Source of the code:

Advanced Artificial Intelligence

Similar presentations

Presentation on theme: "Advanced Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

反馈

请登录

Auth with social network:

Advanced Artificial Intelligence

Similar presentations

Presentation on theme: "Advanced Artificial Intelligence"— Presentation transcript:

Similar presentations

About project

反馈