The best source of training data is the real world

Does data preparation really take up 80% of an AI application's development time? Numerous articles and studies have argued so. While, I cannot confirm or refute this data, we can all agree that it takes a lot of time and effort.

Training datasets are the knowledge and experience we feed into the model to make it smart. The quality of transmitted data, knowledge, and experience is critical for education. Where can you get the best educational information about what surrounds us? The answer is the real world.
Artificial intelligence is not as smart as us yet, but it has a considerable advantage at the same time - the limitless amounts of data it can remember and process. Once you feed the object into the model, it will learn to recognize it, as well as remember all the details, forever. The only thing you need is comprehensive training data which will take a lot of time to get. Unless you have an unlimited number of agents walking around, it's going to take quite some time to take pictures of everything. Do not forget to capture all the metadata about each object; you will need it during the annotation process.

Let's assume that we have developed a visual search application that invites your marketplace customers to find the products they want by image. And take a look at a few examples of the datasets that can be created by taking images of the real world.


Dishes are easy to collect. There are a lot of them. They can stand by themselves, unlike clothes, which must be laid out or hung. It is much easier to take many shots of the same object from different angles. The utensil's manufacturer is rarely important, although we collect this information whenever possible. If you are using data to train the visual search application, then most likely, the user needs a utensil of a similar shape and colour, not a particular manufacturer.
Lots of objects, little metadata.

Home accessories

Home accessories are a little more complicated. There are many of them around us, but there are also many categories of them. Therefore, if you only need images of wall lamps, it turns out that the choice around us is not that great. Just now I realized that I don't even have one in my home.

Home appliance

The main difficulty with home appliances is that there are not many of them. Imagine you need 1,000 images of coffee machines, and each model cannot appear more than five times. How many coffee machines do you have at home? It's good if there is at least one. Only one hundred ninety nine to go.

Fashion and footwear

On the contrary, there are lots of clothes and shoes. The fashion industry is rapidly growing, mainly over the Internet, and it requires a lot of data. You might gut your wardrobe or just visit friends and relatives. On the other hand, if you need to collect a lot of data: brand, category, gender, etc. you need to be very precise. In addition, if you need multiple images of each SKU, clothing is physically challenging to shoot. During the day, our collectors could do anywhere from one hundred to three hundred squats while collecting a dataset of clothes. And that's for just 100 SKU's. Imagine if you need tens of thousands of those images.

Dataset annotation

Dataset annotation is not easy. And it's not about the pixel precision that all dataset labelers are talking about. Along with images, a lot of metadata needs to be collected and transferred. One of the last datasets we collected consisted of 12,000 images and 5,000 unique SKUs. Each SKU appeared at least three times, and each image contained up to 5 objects. It was necessary to collect information regarding the brand, category of goods, SKU numbers and enter this information during the JSON file annotation. Additionally, of course, people make mistakes. At every stage of dataset creation, we double-check everything, from the accuracy of the bounding box and to the precision of additional attributes of the object.

Images from the real world

But all these efforts are worth it. There are several advantages to collecting data this way.

Diversity. If ten people take a picture of the same subject, you get ten different images. You don't have to worry about the fact that the same image appears several times in your training dataset.

Size. You are not working with just any dataset you can get your hands on—no need to think about web scraping or data augmentation. You simply get a dataset of the size you need to train the model well.

Quality. Images can be taken with professional cameras or regular smartphones.

Flexibility. You control all the aspects. Size, distance to the object, number of objects, angles, lighting, environment, the position of objects.

Metadata. Finally, the dataset can be supplemented with whatever metadata you need to work with.

Perhaps this data collection method is exactly what you need. Contact us if you have any questions, and we will be happy to assist you.