skip to content
Alberto Mosconi
⇐ go back to all posts

What Butterfly Is This?

An explanation of how I built a mobile application that classifies photos of butterflies and moths into 100 different species, using the power of Artificial Intelligence.

Table of Contents

  1. Motivation
  2. Overview
  3. Machine Learning Model
  4. Mobile Application

Motivation

This project was developed to serve as final work for the course Advanced Computational Techniques for Big Imaging and Signal Data, in the second semester of my Master’s Degree in Artificial Intelligence for Science and Technology at University of Milano-Bicocca.

Overview

Upon opening the app a view of the camera feed is shown, with the option to take a picture or upload one from the phone’s gallery.

After the analysis is complete a page containing the results appears containing the species name, a brief description, a reference photo and a link to the wikipedia article.

In addition, the confidence of the prediction is also shown, to give the user more information.

app

Machine Learning Model

Below I first talk about the datasets used, then move to the model architectures and training procedure and finally to the evaluation of the performances.

To implement the models and train them I used the PyTorch library.

Dataset

The data for this project can be found on kaggle.com, and is composed by a main dataset of around 13k images of butterflies and moths, equally split into 100 classes.

Since there was no class without any butterfly or moth I had to create it myself, by merging it together with a sample of two smaller datasets, one with other insects and the last with just leaves.

All these images are 224x224 pixels in size and have 3 color channels (RGB).

Model Architectures

Being a computer vision problem I chose to employ two of the most famous architectures in this field: ResNet(2015) and Swin Transformers(2021).

The residual connections in ResNet models allow for very deep networks while mantaining low complexity, making them ideal to handle a high number of classes. More specifically I trained ResNet18, ResNet50 and ResNet152 models, where the number represents the number of layers.

On the other hand, Shifted Windows Vision Transformers are a more efficient version of the original Vision Transformer(2020) since they limit self-attention computation to non-overlapping local windows while also allowing for cross-window connection. In this case, the implementations I chose are Swin-t, Swin-s and Swin-b.

architectures

Training Procedure

During training the dataset is augmented by having a random change of horizontally flipping or rotating each image. By doing so the model should learn to be invariant to this kind of transformations.

augmentation

Finally, to reach better results I started from checkpoints pretrained and fine-tuned them on the dataset. The pretraining was done on the ImageNet dataset, a large visual database of more than 14 million images making up more than 20,000 categories.

Performance Evalutation

After all models were done training I wanted to determine which was the best for my use-case. Since the product is a mobile application on top of a high classification accuracy I was looking for high inference speed with the goal of having the best user experience possible by reducing the loading times.

By plotting the accuracy with respect to the CPU inference speed we can see that the optimal model is ResNet18.

performance

Measuring ResNet18’s performance on the test set (1.3k images) we reach the values reported in the table below.

accuracyprecisionrecall
93.02%93.25%93.25%

Mobile Application

The mobile application was developed with React Native through the Expo framework, so that it can be compiled for both Android and iOS while keeping a single codebase.

The image, after being taken, is sent to a Python server written in Flask which takes it and runs the inference to get the predicted class. This server was deployed on the AWS Elastic Cloud Compute (EC2) service.