Pretrained models

Architecture

The architecture of a neural network defines how many layers the network consists of, what types of layers are used, and how they are connected to one another. Each architecture performs well at certain tasks, recognizes different patterns efficiently. For example, ResNet, VGG, and MobileNet are popular in image processing, while YOLO and variants of Faster R-CNN are widely used for object detection. The U-Net architecture is frequently used for image segmentation tasks, but SAM models have also become popular recently.

Training a neural network requires large amounts of data and computational power. That is why it is common to use models pre-trained by others. These models have already been trained on large-scale, general-purpose datasets (such as ImageNet or COCO), but models trained on very specific datasets are also available.

Backbone and head

The architecture of a neural network can usually be divided into two main parts. The backbone is responsible for feature extraction, typically most layers belong to this part. Successive layers can progressively recognize increasingly complex patterns. The last few layers of the neural network are called the head, which produces the output required for the specific task, for example, a probability distribution for classification, or object coordinates for localization.

Two or more heads:

A single neural network may be capable of performing multiple tasks simultaneously and producing multiple types of outputs. In such cases, a single shared backbone is connected to multiple separate heads. An object detection model, for example, simultaneously predicts the object class and the coordinates of the bounding box. One head performs the classification, while the other head handles the regression task. Both heads can utilize the features extracted by the shared backbone in different ways.

Neck:

When describing some models, a third component, called the neck, is distinguished due to the complexity of the model. A common example of this is the Feature Pyramid Network (FPN), which processes the features extracted by the backbone across multiple scales, enabling the detection of objects of varying sizes.

Transfer learning

Pre-trained models are already capable of recognizing basic patterns (edges, textures and shapes), so there is no need to train from scratch for a new task. By expanding the capabilities of a pre-trained model, the amount of data required and the training time are significantly reduced, and usually better accuracy can be achieved compared to training the same model from scratch. The transfer learning process freezes the backbone, then the new head designed for the new task trains on the new data over a few epochs.

Fine-tuning

Fine-tuning is an extension of transfer learning. Here, not only is the head trained, but the final few layers of the backbone are also unlocked, allowing training to continue on more layers using the new dataset. During fine-tuning, the model can better adapt to the specific task at hand, and accuracy can improve further.