CS-E Deep Learning Session 4: Convolutional Networks

CS-E4050 - Deep Learning Session 4: Convolutional Networks Jyri Kivinen Aalto University 23 September 2015 Credits: Thanks to Tapani Raiko for slides material. CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 1 / 19

Some desiderata for pattern recognition systems (?) Stable and predictable response change to transforming (say shifted) input The ability to deal with varying-sized inputs (in terms of unit cardinality) and input patterns (in terms of their scale) Avoiding the need to learn properties that one can encode into the model easily Parsimonious representations. CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 2 / 19

Feed-forward networks for high-dimensional and highly variable content Imagine a set of 1000x1000 images (rather low-resolution images compared to what can be captured with even most mobile phone cameras these days) for training and to be analyzed by a regular feed-forward network, say an MLP. Suppose the model is to define a classifier, say classifying images to some predefined categories, like to interesting and to not interesting. One million (weight) connections per hidden unit in the first layer, even just for gray-scale images. Without e.g. parameter sharing, could well be a really hard problem to scale onto; e.g. too many parameters to fit well the data one has (curse of dimensionality). CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 3 / 19

Feed-forward networks to high-dimensional and highly variable content Imagine a set of 1000x1000 images (rather low-resolution images compared to what can be captured with even most mobile phone cameras these days) for training and to be analyzed by a regular feed-forward network, say an MLP. Suppose the model is to define a classifier, say classifying images to some predefined categories, like to interesting and to not interesting. How to deal with test images of (even slightly) different size? How would the unit activations be for an image and the image shifted? Good approaches for handling potential issues? CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 4 / 19

Convolutional (deterministic) feed-forward networks: On the main ingredients Grid-arrangement and position encoding: units in the input layer retain and hidden unit layer(s) are arranged to have stacked-grid-topology with co-ordinated position indexing. Local connectivity: each hidden unit receives input only from (its input layer) units with position within its so-called receptive field. Parameter sharing: Hidden unit layers consist of multiple grids of hidden units called feature planes, and parameters are shared within a feature plane; can be used to obtain translation equivariance (and then e.g. stability and predictability of response change due to shifted input). Pooling; used e.g. to obtain local translation invariance (e.g. to allow local spatial input deformations), reduce dimensions. CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 5 / 19

Convolutional networks: from full to local connections (example, credit: Tapani Raiko) CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 6 / 19

Convolutional networks: parameter sharing (example, credit: Tapani Raiko) CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 7 / 19

Convolutional networks: first layer without pooling (example, credit: Tapani Raiko) CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 8 / 19

Convolutional networks: first layer with pooling (example, credit: Tapani Raiko) CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 9 / 19

Convolutional networks: full network (example, credit: Tapani Raiko) CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 10 / 19

Convolutional vs. non-convolutional (Slide [credit]: Tapani Raiko) Number of weights (ignoring biases): 5 5 9 + 5 5 9 16 + 7 7 16 10 = 225 + 3600 + 7840 = 11665 Sizes of signals h: 28 28, 28 28 9, 14 14 16 = 784, 7056, 3136 Compare to an example non-convolutional example: Weights: 28 28 225 + 225 144 + 144 10 = 176400 + 32400 + 1440 = 210240 Signals: 784, 225, 144 Convolutional network has more signals but less params. Could we scale the networks to the 1000x1000 images? Would/should some of their properties change? CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 11 / 19

Some example application input types (Credit: Tapani Raiko) Tensor Single channel Multi-channel 1-D Raw audio (mono) Motion capture 2-D Audio + (S.T.) Fourier transform Game of Go 3-D Brain imaging Colour video CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 12 / 19

Some possible connectivity, parameter sharing, and pooling function variants Parameter sharing patterns per a convolutional layer: No sharing of biases Spatial parameter sharing patterns (e.g. tiled-convolutional) Input connectivity patterns of units within a feature plane: Overlapping (regular-convolutional) Non-overlapping (tiled-convolutional) Fixed pooling function variants: Linear functions: average-pooling Non-linear functions: max-pooling CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 13 / 19

But why called convolutional? As often in any F-F neural networks, the hidden units can be seen take as their input a linear combination of the activations by the units connecting to them (bias unit fixed to one, other units); they first linearly transform their input data, and then apply the activation function to get the unit activation/state. Implementing the linear transform computations per layer could be implemented in multiple ways. Computing convolutional-layer unit inputs can be computed via convolution/cross-correlation of the activations of the connecting-unit values by filter-kernels defined by the feature-plane specific weights (/filters). Highly parallellizable; don t even think about sliding windows! (Fast) Fourier-transform (FFT) - based approaches may be inconvenient/problematic (e.g. boundary handling issues). CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 14 / 19

Convolution-operator (Slide from Tapani Raiko) (w x)[t] = a w[a]x[t a] CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 15 / 19

Handling boundaries in convolutional layers Border units may need special treatment. Some boundary handling implementation variants, when doing the convolutions: Padd the inputs when computing the outputs: by zeroes or e.g. from input data: e.g. wrap-around borders Can be used to retain fixed grid size from input to output Don t padd inputs when computing the outputs: Each output unit receives the same number of connections from the input data; output grid dimensions are smaller. CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 16 / 19

Home exercises Familiarize with the Conv. Neural Networks (LeNet) tutorial: http: //deeplearning.net/tutorial/lenet.html. Reported task: Implement and experiment with gradient descent - based training of a convolutional (feed-forward) network; details on the next slide. CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 17 / 19

Home exercises Reported task: Implement and experiment with gradient descent - based training of a convolutional (feed-forward) network: Choose the details: data, network structure, etc., except: have at least one convolutional layer; have pooling. Experiment with the approach providing the following visualization: objective function evaluated on the training data as a function of the number of training epochs; could also plot further things such as learned network parameters, unit activations (as a response to some data), (other) important training diagnostics. Provide a description of your approach and the results in the report (providing the visualization(s) and most important lines of code). CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 18 / 19

Materials, references Gradients to inputs: Zeiler, Matthew D., and Rob Fergus. Visualizing and understanding convolutional networks. In Proc., European Conference on Computer Vision (ECCV), 2014. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86 (1998) 22782324 LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), (1989) 541-551 Waibel, A., Hanazawa, T., Hinton, G.E., Shikano, K., Lang, K.: Phoneme recognition using Time Delay Neural Networks. IEEE Trans. ASSP 37 (1989) 328339 CS-E4050 - Deep Learning Session 4: Convolutional Networks 23 September 2015 19 / 19