Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (2024)

With TensorFlow 2.3, Amazon SageMaker Python SDK 2.5.x and Custom SageMaker Training & Serving Docker Containers

Juv Chan

Published in

Towards Data Science

juvchan/amazon-sagemaker-tensorflow-custom-containers

This project shows step-by-step guide on how to build a real-world flower classifier of 102 flower types using…

github.com

Below is the list of system, hardware, software and Python packages that are used to develop and test the project.

Ubuntu 18.04.5 LTS
Docker 19.03.12
Python 3.8.5
Conda 4.8.4
NVIDIA GeForce RTX 2070
NVIDIA Container Runtime Library 1.20
NVIDIA CUDA Toolkit 10.1
sagemaker 2.5.3
sagemaker-tensorflow-training 20.1.2
tensorflow-gpu 2.3.0
tensorflow-datasets 3.2.1
tensorflow-hub 0.9.0
tensorflow-model-server 2.3.0
jupyterlab 2.2.6
Pillow 7.2.0
matplotlib 3.3.1

TensorFlow Datasets (TFDS) is a collection of public datasets ready to use with TensorFlow, JAX and other machine learning frameworks. All TFDS datasets are exposed as tf.data.Datasets, which are easy to use for high-performance input pipelines.

There are a total of 195 ready-to-use datasets available in the TFDS to date. There are 2 flower datasets in TFDS: oxford_flowers102, tf_flowers

The oxford_flowers102 dataset is used because it has both larger dataset size and larger number of flower categories.

ds_name = 'oxford_flowers102'
splits = ['test', 'validation', 'train']
ds, info = tfds.load(ds_name, split = splits, with_info=True)
(train_examples, validation_examples, test_examples) = dsprint(f"Number of flower types {info.features['label'].num_classes}")
print(f"Number of training examples: {tf.data.experimental.cardinality(train_examples)}")
print(f"Number of validation examples: {tf.data.experimental.cardinality(validation_examples)}")
print(f"Number of test examples: {tf.data.experimental.cardinality(test_examples)}\n")print('Flower types full list:')
print(info.features['label'].names)
tfds.show_examples(train_examples, info, rows=2, cols=8)

Amazon SageMaker allows users to use training script or inference code in the same way that would be used outside SageMaker to run custom training or inference algorithm. One of the differences is that the training script used with Amazon SageMaker could make use of the SageMaker Containers Environment Variables, e.g. SM_MODEL_DIR, SM_NUM_GPUS, SM_NUM_CPUS in the SageMaker container.

Amazon SageMaker always uses Docker containers when running scripts, training algorithms or deploying models. Amazon SageMaker provides containers for its built-in algorithms and pre-built Docker images for some of the most common machine learning frameworks. You can also create your own container images to manage more advanced use cases not addressed by the containers provided by Amazon SageMaker.

The custom training script is as shown below:

TensorFlow Hub is a library of reusable pre-trained machine learning models for transfer learning in different problem domains. For this flower classification problem, we evaluate the pre-trained image feature vectors based on different image model architectures and datasets from TF-Hub as below for transfer learning on the oxford_flowers102 dataset.

In the final training script, the Inception V3 (iNaturalist) feature vector pre-trained model is used for transfer learning for this problem because it performs the best compared to the others above (~95% test accuracy over 5 epochs without fine-tune). This model uses the Inception V3 architecture and trained on the iNaturalist (iNat) 2017 dataset of over 5,000 different species of plants and animals from https://www.inaturalist.org/. In contrast, the ImageNet 2012 dataset has only 1,000 classes which has very few flower types.

TensorFlow Serving is a flexible, high-performance machine learning models serving system, designed for production environment. It is part of TensorFlow Extended (TFX), an end-to-end platform for deploying production Machine Learning (ML) pipelines. The TensorFlow Serving ModelServer binary is available in two variants: tensorflow-model-server and tensorflow-model-server-universal. The TensorFlow Serving ModelServer supports both gRPC APIs and RESTful APIs.

In the inference code, the tensorflow-model-server is used to serve the model via RESTful APIs from where it is exported in the SageMaker container. It is a fully optimized server that uses some platform specific compiler optimizations and should be the preferred option for users. The inference code is as shown below:

Amazon SageMaker utilizes Docker containers to run all training jobs and inference endpoints. Amazon SageMaker provides pre-built Docker containers that support machine learning frameworks such as SageMaker Scikit-learn Container, SageMaker XGBoost Container, SageMaker SparkML Serving Container, Deep Learning Containers (TensorFlow, PyTorch, MXNet and Chainer) as well as SageMaker RL (Reinforcement Learning) Container for training and inference. These pre-built SageMaker containers should be sufficient for general purpose machine learning training and inference scenarios.

There are some scenarios where the pre-built SageMaker containers are unable to support, e.g.

Using unsupported machine learning framework versions
Using third-party packages, libraries, run-times or dependencies which are not available in the pre-built SageMaker container
Using custom machine learning algorithms

Amazon SageMaker supports user-provided custom Docker images and containers for the advanced scenarios above. Users can use any programming language, framework or packages to build their own Docker image and container that are tailored for their machine learning scenario with Amazon SageMaker.

In this flower classification scenario, custom Docker image and containers are used for the training and inference because the pre-built SageMaker TensorFlow containers do not have the packages required for the training, i.e. tensorflow_hub and tensorflow_datasets. Below is the Dockerfile used to build the custom Docker image.

The Docker command below is used to build the custom Docker image used for both training and hosting with SageMaker for this project.

docker build ./container/ -t sagemaker-custom-tensorflow-container-gpu:1.0

After the Docker image is built successfully, use the Docker commands below to verify the new image is listed as expected.

docker images

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (4)

The SageMaker Python SDK supports local mode, which allows users to create estimators, train models and deploy them to their local environments. This is very useful and cost-effective for anyone who wants to prototype, build, develop and test his or her machine learning projects in a Jupyter Notebook with the SageMaker Python SDK on the local instance before running in the cloud.

The Amazon SageMaker local mode supports local CPU instance (single and multiple-instance) and local GPU instance (single instance). It also allows users to switch seamlessly between local and cloud instances (i.e. Amazon EC2 instance) by changing the instance_type argument for the SageMaker Estimator object (Note: This argument is previously known as train_instance_type in SageMaker Python SDK 1.x). Everything else works the same.

In this scenario, the local GPU instance is used by default if available, else fall back to local CPU instance. Note that the output_path is set to the local current directory (file://.) which will output the trained model artifacts to the local current directory instead of uploading onto Amazon S3. The image_uri is set to the local custom Docker image which is built locally so that SageMaker will not fetch from the pre-built Docker images based on framework and version. You can refer to the latest SageMaker TensorFlow Estimator and SageMaker Estimator Base API documentations for the full details.

In addition, hyperparameters can be passed to the training script by setting the hyperparameters of the SageMaker Estimator object. The hyperparameters that can be set depending on the hyperparameters used in the training script. In this case, they are ‘epochs’, ‘batch_size’ and ‘learning_rate’.

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (5)

After the SageMaker training job is completed, the Docker container that run that job will be exited. When the training is completed successfully, the trained model can be deployed to a local SageMaker endpoint by calling the deploy method of the SageMaker Estimator object and setting the instance_type to local instance type (i.e. local_gpu or local).

A new Docker container will be started to run the custom inference code (i.e the serve program), which runs the TensorFlow Serving ModelServer to serve the model for real-time inference. The ModelServer will serve in RESTful APIs mode and expect both the request and response data in JSON format. When the local SageMaker endpoint is deployed successfully, users can make prediction requests to the endpoint and get prediction responses in real-time.

tf_local_predictor = tf_local_estimator.deploy(initial_instance_count=1, 
 instance_type=instance_type)

To evaluate this flower classification model performance using the accuracy metric, different flower images from external sources which are independent of the oxford_flowers102 dataset are used. The main sources of these test images are from websites which provide high quality free images such as unsplash.com and pixabay.com as well as self-taken photos.

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (6)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (7)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (8)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (9)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (10)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (11)

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (12)

The final flower classification model is evaluated against a set of real-world flower images of different types from external sources to test how well it generalizes against unseen data. As a result, the model is able to classify all the unseen flower images correctly. The model size is approximately 80 MB, which could be considered as reasonably compact and efficient for edge deployment in production. In summary, the model seemed to be able to perform well on a given small set of unseen data and reasonably compact for production edge or web deployment.

Due to time and resources constraints, the solution here may not be providing the best practices or optimal designs and implementations. Here are some of the ideas which could be useful for anyone who is interested to contribute to improve the current solution.

Apply Data Augmentation, i.e. random (but realistic) transformations such as rotation, flip, crop, brightness and contrast etc. on the training dataset to increase its size and diversity.
Use Keras preprocessing layers. Keras provides preprocessing layers such as Image preprocessing layers and Image Data Augmentation preprocessing layers which can be combined and exported as part of a Keras SavedModel. As a result, the model can accept raw images as input.
Convert the TensorFlow model (SavedModel format) to a TensorFlow Lite model (.tflite) for edge deployment and optimization on mobile and IoT devices.
Optimize the TensorFlow Serving signature (SignatureDefs in SavedModel) to minimize the prediction output data structure and payload size. The current model prediction output returns the predicted class and score for all 102 flower types.
Use TensorFlow Profiler tools to track, analyze and optimize the performance of TensorFlow model.
Use Intel Distribution of OpenVINO toolkit for the model’s optimization and high-performance inference on Intel hardware such as CPU, iGPU, VPU or FPGA.
Optimize the Docker image size.
Add unit test for the TensorFlow training script.
Add unit test for the Dockerfile.

After the machine learning workflow has been tested working as expected in the local environment, the next step is to fully migrate this workflow to AWS Cloud with Amazon SageMaker Notebook Instance. In the next guide, I will demonstrate how to adapt this Jupyter notebook to run on SageMaker Notebook Instance as well as how to push the custom Docker image to the Amazon Elastic Container Registry (ECR) so that the whole workflow is fully hosted and managed in AWS.

It is always a best practice to clean up obsolete resources or sessions at the end to reclaim compute, memory and storage resources as well as to save cost if clean up on cloud or distributed environment. For this scenario, the local SageMaker inference endpoint as well as SageMaker containers are deleted as shown below.

tf_local_predictor.delete_endpoint()

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (13)

docker container ls -a

docker rm $(docker ps -a -q)
docker container ls -a

Build, Train and Deploy A Real-World Flower Classifier of 102 Flower Types (2024)

With TensorFlow 2.3, Amazon SageMaker Python SDK 2.5.x and Custom SageMaker Training & Serving Docker Containers

juvchan/amazon-sagemaker-tensorflow-custom-containers

This project shows step-by-step guide on how to build a real-world flower classifier of 102 flower types using…

References