Data preparing for machine learning

ML uses models to make predictions. There are three aspects for building a successful AI model: data, algorithms and calculations. The implementation of accurate algorithms and high-precision computation is an important part of the machine learning process. However, data preparation is the foundation.

Steps of data preparation

Four steps can be identified before using the model for making predictions:

  1. Target, problems and the way to solve it

  2. Dataset creation

  3. Data transformation

  4. Model training based on output data

The full process consists of the following phases (the arrows show the most important relations between the phases, the biggest circle shows that this process has cyclic nature):

Common tasks within data preparation

Data selection and collection – founding the best suit data: excreting the information from existing libraries and/or creating new data.

Data cleaning – checking the data for relevance to the task.

Data validation – ensuring that data satisfies defined formats and other input criteria

Data transformation – standardizing formats, unification of data.

Privacy – deletion of personal information.

Data labeling – tagging, classification, segmentation, annotation and so on depending on requirements.

Data storing and delivering – storage and data's output format.
Data is a valuable resource once it has been cleaned up, tagged, annotated, and prepared. After the data has gone through various stages of testing, it is finally qualified for further processing. Various methods can be used to process-extract data into business intelligence tools, databases, develop algorithms for analysis models, data management tools, etc.


Data preparation takes 80% of AI project time. In the future, with high probability, it will be powered by machine learning in order to make it automated. Also, achieving greater user-friendliness transparency and interactivity will be the major goal in future data preparation approaches. There remains a lot of progress to be seen in ML.