Data Collection and Preparation: Join us through the ML Project Life Cycle

posted in tech
Photo by Drew Graham on Unsplash

Welcome back to our blog post series where you'll learn how we run machine learning projects. In the first part of this series you've already learned about the different phases of the project lifecycle. Today, we take a closer look at the first two very important phases: data collection and data preparation.

You should definitely read on if you plan to implement AI-supported tasks in your company, regardless of whether you are a project manager, engineer, or decision-maker.

The six phases of our machine learning life cycle

Data Collection

Defining the Requirements

The first step in any machine learning project is to collect enough samples with the correct metadata, as this is the basis for our model for learning and prediction. But what data do we actually need? To find that out, we need to define the requirements first together with our customer. Let's look at an example from our practice.

An umbrella federation for dance uses a Digital Asset Management system (DAM) as a central hub for images from all its members. These members are regional foundations and associations, professional dance schools, archives, stages, theatres, dance companies, and also single dancers - a diverse community which regularly uploads images from events, projects, and daily work to the platform. The images are used for the federation's website, marketing, and social media campaigns.

What the dance federation needs is an automatic curation of the hundreds of images uploaded to the DAM every day. Are these images suitable for presenting the work of their members in a visually appealing way? The goal is to find aesthetic images faster when selecting new marketing collateral, ideally by simply using a filter option for aesthetic images. Our task is to classify the images as aesthetic or unaesthetic.

We cannot emphasize enough that it's crucial to define the requirements in detail before starting to collect data, because this is the foundation for the entire project.

Evaluating and Collecting

Now that we know the requirements, it's time to gather the data for the training. We need two different categories of data:

  • Aesthetic images of dance and related events (eg. award ceremonies)
  • Unaesthetic images of dance and related events.

Question is: Where do we get the data from and how much data do we need? Of course, we try to get the needed data directly from our customer first. Let's say 10k images with labels are required. This should be sufficient to train a pretrained model via transfer learning.

In our case, the dance federation has 15,000 images whereof 1,500 have been already used as marketing collateral. We can regard these 1,500 images as aesthetic, and we label them accordingly. The other 13,500 images are unclear: they have not been used in any marketing activities. We don't know if they can be considered aesthetic or not. We take a subset of 5,000 images to add them to our dataset. To do so, we have to label them first.

These 5,000 images are labeled manually using workforce from Amazon Mechanical Turk. Since aesthetics is very subjective, it makes sense to have the images evaluated several times by different people and label the image according to the majority. After labeling is finished, we have 1,300 further images labeled with aesthetic and 3,700 images labeled with unaesthetic.

To make the data more balanced we need to get some more aesthetic data. We can search in dataset search engines like Academic Torrens or Google's Dataset Search. We must bear in mind that available datasets can only be used if the license of the material (images as well as their metadata) allows commercial use. Many datasets are only allowed to be used for academic and research purposes.

But there are also open-access datasets (check out this compilation of dataset finders and free datasets). In our case we've found a good dataset of 900 images. These are aesthetic dance photos labeled with the keywords aesthetic and dance. We add them to our training data.

We have to admit that we did not reach the targeted 10k images, but we have 7,4k images which is a good amount to start with.

Data sources and how the dataset is composed

Out resulting dataset is very well balanced, we have the same amount of aesthetic and unaesthetic images. But in most projects a fully balanced dataset is not in reach. In those cases we accept small imbalances in the data rather than throwing away valuable data. The imbalance should not exceed a certain degree.

Data Preparation

After we have collected all the data we need, we have to prepare it for training, which means cleaning the data and preprocessing it.

The collection process is quite complex and time-consuming, and the same applies to data preparation. Our data scientists spend most of their time and effort in a project on gathering the data, bringing it into the right shape, and refining it.

Data Cleaning

The complexity of the cleaning process highly depends on the amount of data sources we use, the amount of categories that need to be classified, and also on how complex those categories are.

In our example project we can mostly trust the data: the data is already very well balanced and we have trustable data sources. Therefore, only little cleaning steps are necessary. At first, we filter out and remove corrupt files (images which do not load). Secondly, we take a random look at the 900 images from the free dataset to check if they are labeled correctly. If we detect falsely labeled images, we remove them from the dataset.

Data Preprocessing

To normalize the data we scale images to the appropriate size and convert the images to the right format.

Additionally, we augment the dataset by generating modified versions of each image. The goal is to introduce common variations the model should be aware of. Since our model should predict aesthetics we may not make any changes that could alter the aesthetics of the image. For example we do not crop images to retain the image layout.

Data augmentation should help the model not to get involved in false evidence. Let's say we have many aesthetic images of a dancer in a red dress. This could be interpreted by the model in such a way that all images with red clothes are aesthetic. The solution would be to bring in some altered images by desaturating, changing colours, or adding black and white variants.

Now, we have collected all our data and we have prepared it for model training. Stay tuned for our next article of this blog post series where we dive deep into model evaluation and model training. Meanwhile, if you need someone helping you out with your AI project, just get in touch.

Related Content You Might Like