{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

My goal in this series is to deploy a neural network capable of identifying and localizing pedestrians in an image (the combination of image classification and localization is called object detection). In the first part of the series I will do this by downloading a pretrained model and using transfer learning to fine tune it for my problem. In the second part, I will try to create and deploy a model from scratch.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation and Applications" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

My main motivation for this project was simply to gain an understanding of deep learning, a field I knew nothing about prior to this project. Furthermore, object detection problems tend to involve much deeper networks than object classification, and the concepts involved here are highly transferable across other deep learning domains.

\n", "

Of course, object detection has some pretty cool applications in its own right. For example, security companies build on top of object detection models to do things like gait analysis and tracking people across multiple cameras. Self-driving cars need to be able to perform object detection to avoid hitting pedestrians and other cars. And one could imagine lots of fun home applications for object detection.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why Transfer Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Practically speaking, you will almost always use transfer learning when dealing with neural network. The reason for this is twofold:\n", "1. Using a pretrained model drastically cuts down the amount of resources needed to fine tune a network. By using a pretrained model you will require fewer training samples and less computing time for a comparable level of accuracy.\n", "2. It is unlikely that you will be creating a deep learning model that is completely new. Most new models are really variations of existing models, so it makes sense to take advantage of existing work\n", "Consequently, transfer learning is one of the most important skills you can have with respect to deep learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Background" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is a Neural Network" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To understand what a neural network is, we can break the concept down into two components, structure and learning. \n", "\n", "**Structure**: Although different model types can have additional components, all neural networks have an input layer, some number of hidden layers, and an output layer, as seen in the picture bellow.\n", "\n", "\n", "\n", "The layers in turn are made up of nodes called neurons. All a neuron is is a container that holds a single number. \n", "\n", "Breaking things down, the neurons in the input layer correspond to the data points we are analyzing. For example, if we are analyzing a picture, the input layer neurons correspond to the pixel values of that picture. Each neuron is then multiplied by some value, called a weight, and added together to produce a new value, which will become a neuron for the first hidden layer. So each neuron is the result of a different linear transformation on the previous layer. The neurons then undergo a non-linear transformation, called an activation function. For example, a common activation function is to map positive values to themselves and negative values to 0. The whole point of the activation function is to make our system capable of solving non-linear systems. This process is then repeated for each hidden layer, until we get to the output layer. The output layer represents the data we want. For example, if we have a dog classifier, the output layer would be 0 for \"no dog\" and 1 for \"dog\".\n", "\n", "To help visualize the process outlined, you can look at this simplified neural network:\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The initial neurons are (1,1) and the first transform is (0.712, 0.0112), so the initial value of the first neuron of the hidden layer is \n", "$$(1,1)*(0.712, 0.0112)= 0.712+0.0112 = 0.824 $$\n", "That value is then placed through the activation function. In this case, the activation function is called the Sigmoid function, $S(x)= \\frac{1}{e^{-x}}$ so as a final value we get:\n", "$$S(0.824) = 0.69$$\n", "That process is then repeated for every neuron in the first layer of the model. Then to get our output layer of one neuron, we put the hidden layer (0.69, 0.77, 0.68) and apply the transform (0.116, 0.329, 0.708) to get the output value of 0.69." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that you know how neural networks are structured, you're probably wondering why they are structured this way, since it just looks like a lot of wheel spinning on the surface. Again, we will break this down into parts, but the main benefit of this structure is that it allows us to model complex interactions within our data. Starting from the input layer, the weights allow all of our data to interact with each other. Some weights will be higher, which means some of the input data will dominate the neuron it's mapped to, and some of the weights may even be 0, meaning that some data won't have any effect on that neuron. Therefore each neuron actually represents some interaction in the data, i.e. a neuron is a feature that we would have to specifically program in a supervised learning model. The cool thing is that each additional layer allows the features from the previous layer to interact with one another to create another feature in the next hidden layer. As a concrete example, we may input an image, and the first layer may extract distinct edges from that image, and the next layer may extract distinct shapes. Finally, the activation function serves two purposes, it allows the model to capture non-linearity, and it bounds the possible value of the neurons, which is useful for training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Training**: While there is a lot of mysticism surrounding deep learning models in pop-sci, you can really just think of neural networks as non-linear, differential optimization problems. Like any other model, you have inputs, outputs, and targets and you want the model to produce outputs as close to the targets as possible (for whatever metric of closeness you have defined). In the case of a neural network, once the structure is locked in, the only way to change the outputs is to change the value of the weights in each layer. This is done through gradient descent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is Transfer Learning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Transfer learning is essentially repurposing an an existing deep learning model for a new but similar task. \n", "\n", "To understand when and why transfer learning is effective, remember that the layers in a trained model represent features that the model has learned to extract. Furthermore, the features extracted tend to become more particular to the problem as we go into deeper layers. For example, an animal classifier may have a layer that extracts basic shapes from the image, another layer that recognizes textures, and so on. As you can see from the example, the features a model learns to extract are often useful for similar tasks. So if we wanted to create a face detector, we could start from randomized layers and see which features emerged from our model. Or we could take advantage of the fact that the features extracted by our first model are useful for other image classification tasks, and use that as a starting point rather than reinvent the wheel.\n", "\n", "A slightly more technical way of looking at transfer learning is from an optimization perspective. Remember, a neural network is composed of layers of weights that transform the data in each layer. And when we train a model, we are moving those weights to a more optimal value through gradient descent. When we use a new model, those weights' values are completely random to start. However, if we assume similar input and output, then the weights of our pretrained model are likely closer to their optimum values than completely random weights. So our pretrained model requires less adjustment (training steps) to be optimized.\n", "\n", "We can break down\n", "transfer learning into the following steps:\n", "1. download a model that has already been trained on a dataset\n", "2. make whatever adjustments you want to test on the model (e.g. adjust the learning rate)\n", "3. strip last layers of the existing model and replace them with randomized layers\n", "4. fine-tune layers on new data\n", "5. Verify Results\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here I will go through the steps I went through to get my model up and running, from installing tensor flow, to preparing the data, to training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing TensorFlow and Object Detection API" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start, I highly recommend creating a virtual environment either through Anaconda or Python environments to help manage version requirements. Within your environment install the necessary packages." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pip install tensorflow\n", "pip install Cython\n", "pip install pillow\n", "pip install lxml\n", "pip install jupyter\n", "pip install matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then to install the Object Detection API, you need to download the tensorflow models repo by running the following command on your terminal" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "git clone https://github.com/tensorflow/models.git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then on the terminal CD into tensorflow/models/research/ and compile protobuf by running" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# From tensorflow/models/research/\n", "protoc object_detection/protos/*.proto --python_out=." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, add the libraries to your pythonpath by running:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# From tensorflow/models/research/\n", "export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "every time you open a new terminal window" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

The data I am using comes from https://www.kaggle.com/smeschke/pedestrian-dataset#crosswalk.csv. It consists of 3 videos of pedestrians using crosswalks in different situations as well as CSV files for each video which give the bounding box information for the pedestrians in the video in the following format, (x, y, height, width).

\n", "

My goal in this section is to break down each video into its component frames in jpg format. I then need to create a data frame which contains the following columns: file path, width, height, class, xmin, xmax, ymin, ymax

\n", "

It is also worth noting that all of the models in the object detection API are size agnostic. They perform all necessary image padding and scaling for you. However, I will go over how to do data standardization and augmentation in the next post.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract Images from Video" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the data comes as video files, I will use cv2 to read it and capture frames. The code bellow carries out the following steps:\n", "\n", "1. Gets a list of videos in the data directories\n", "2. Defines a function which\n", " 1. reads a video file\n", " 2. creates a directory named after the video if none exists\n", " 3. captures a frame and saves it as a jpeg to the directory\n", "3. applies the function to all files in the list" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [], "source": [ "import cv2\n", "import pandas as pd\n", "import numpy as np\n", "import os" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#get list of video files\n", "dataPath = 'data/'\n", "dataFiles = os.listdir(dataPath)\n", "videoFiles = [dataPath+file for file in dataFiles if file.endswith('.avi')]" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "#define function to turn video into images\n", "def frame_capture(path): \n", " \n", " cap = cv2.VideoCapture(path) \n", " currentFrame = 0\n", " directory = path.strip('.avi')\n", " try:\n", " if not os.path.exists(directory):\n", " os.makedirs(directory)\n", " except OSError:\n", " print ('Error: Creating directory of data')\n", " # checks whether frames were extracted \n", " success = 1\n", "\n", " while success:\n", " # Capture frame-by-frame\n", " success, frame = cap.read()\n", "\n", " # Saves image of the current frame in jpg file\n", " name = directory + '/frame' + str(currentFrame) + '.jpg'\n", " cv2.imwrite(name, frame)\n", "\n", " # To stop duplicate images\n", " currentFrame += 1\n", "\n", " # When everything done, release the capture\n", " cap.release()\n", " cv2.destroyAllWindows()\n", " " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#turn each video file into a directory of image files\n", "for file in videoFiles:\n", " frame_capture(file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create CSV files for training/testing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that I have my images, I need to combine my image file paths with my bounding box data in the format required by the object detection API. I also need to split my data into a training set and testing set." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [], "source": [ "boundingBoxFiles = ['data/night.csv', 'data/fourway.csv', 'data/crosswalk.csv'] " ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "#Reformat CSV data into required format\n", "pedestrian_labels = pd.DataFrame()\n", "for file in boundingBoxFiles:\n", " name = file.replace('/','.').split('.')[1]\n", " df = pd.read_csv(file)\n", " new_df = pd.DataFrame()\n", " #create columns for new dataframe\n", " new_df['filename'] = df.index.astype(str)\n", " new_df['filename'] = 'data/'+ name+ '/frame'+ new_df['filename']+ '.jpg'\n", " new_df['width'] = df.w\n", " new_df['height'] = df.h\n", " new_df['class'] = 'pedestrian'\n", " new_df['xmin'] = df.x\n", " new_df['ymin'] = df.y\n", " new_df['xmax'] = df.x + df.w\n", " new_df['ymax'] = df.y + df.h\n", " #store to central data frame\n", " pedestrian_labels = pedestrian_labels.append(new_df)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [], "source": [ "#split data into training and testing set, then save file\n", "train_labels = pedestrian_labels.sample(frac=0.8,random_state=17)\n", "test_labels = pedestrian_labels.drop(train_labels.index)\n", "pedestrian_labels.to_csv('data/pedestrian_labels.csv', index=False)\n", "train_labels.to_csv('data/train_labels.csv', index=False)\n", "test_labels.to_csv('data/test_labels.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
filenamewidthheightclassxminyminxmaxymax
118data/night/frame118.jpg156312pedestrian16314731787785
400data/night/frame400.jpg172345pedestrian11994841371829
439data/night/frame439.jpg145291pedestrian10404701185761
452data/night/frame452.jpg143286pedestrian9664691109755
455data/night/frame455.jpg141283pedestrian9464661087749
\n", "
" ], "text/plain": [ " filename width height class xmin ymin xmax \\\n", "118 data/night/frame118.jpg 156 312 pedestrian 1631 473 1787 \n", "400 data/night/frame400.jpg 172 345 pedestrian 1199 484 1371 \n", "439 data/night/frame439.jpg 145 291 pedestrian 1040 470 1185 \n", "452 data/night/frame452.jpg 143 286 pedestrian 966 469 1109 \n", "455 data/night/frame455.jpg 141 283 pedestrian 946 466 1087 \n", "\n", " ymax \n", "118 785 \n", "400 829 \n", "439 761 \n", "452 755 \n", "455 749 " ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#check to make sure data is in correct format\n", "train_labels.head()\n", "test_labels.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Convert CSV to TFR Format" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to use the models, tensorflow requires you to have your data in TFRecord format. Luckily, the object detection API has a script for converting the CSV files created in the previous step to TFR files. The script is called \"generate_tfrecord.py\" and it is located under the file path \"TensorFlow/models/research/object_detection/legacy\". But I'll include the code bellow so you can just copy and paste. I will also give a link to a video which does a good job of explaining the TFRecord format and how to use it.\n", "https://www.youtube.com/watch?v=oxrcZ9uUblI" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Usage:\n", "\n", "# Create train data:\n", "python generate_tfrecord.py --label=