Data sources could be databases from online stores, sensors, activities logged by social media platforms, or the data could have been generated. It is always worth knowing the origin and collection method of the data when analysing it. When collecting data, it is essential to inform the relevant actors in a given situation of the fact that their data is being collected, otherwise privacy or IP rights may be a concern.
It may happen that, due to an inadequate data set, the trained model appears to be highly accurate, but then performs poorly in real-world tests on images it has never seen before. For example, if the task is to recognize horses, but the data set was created using only photographs of horses taken in a field, with a clear blue sky, the model will not recognize an image taken in a forest or in fog. In this case, the model has overfitted.
In addition to the fact that a good dataset is diverse, containing samples with different cropping, angles and lighting conditions for photographs, it is also important to have roughly the same number of samples from each class. This makes a dataset balanced, which also contributes to accuracy.
Before training, the data should be divided into three groups. The training data set is the largest, and is used by the algorithm to continuously recalculate weights and biases during training. The smaller validation data is used to measure the accuracy on a separate group of data in each iteration. The third set is reserved only for testing, and is never seen by the algorithm during training. It is sufficient to allocate a few samples per class for testing, and about 20% of the remaining data is usually allocated for validation.