Real-time operating system (RTOS)

Non-real-time OS disadvantages

Most operating systems (Linux, Windows, MacOS) do not guarantee that a given task will be executed within the expected time limit. This is not a problem for most everyday applications, but it is generally unacceptable for embedded systems to be non-real-time. If a sensor needs to take samples regularly, for example every 0.01 seconds, a non-real-time system may be delayed in processing because another task is running or something unexpected happens (for example garbage collection). This compromises the accuracy of the sampling, that can make the data unusable in many applications.

Tickrate

The temporal operation of the RTOS is controlled by an ISR, which is triggered by the internal clock. This is the basis for scheduling and task switching. The higher the tick rate, the more accurate the timing. If the value is 100, for example, the system checks every 10 milliseconds to see if any delays have expired, decides on task switching, and handles other RTOS services. Online multiplayer game servers work in a similar way, sending fresh data to players according to a tick rate (usually 32, 64, or 128).

Priority of tasks

In real-time systems, every task has a priority. A task with a higher priority can interrupt those with a lower priority. For example, if a motor controller wants to update the position of a motor, it cannot be delayed just because another task wants to link a status LED.

Semaphore

A semaphore is an essential tool in every RTOS, which is used to control how many tasks can use a resource at the same time. It can be of the counting type, where its value determines the number of users that are able to access a given resource at a time, or it can be binary, which can only take on the free or busy state. This is often used for simple signaling between tasks. If there is an I2C bus on which multiple tasks try to send data, a binary semaphore can be used to ensure that only one task can access it at a time. It is important to note that in the case of a semaphore, any task can take it or give it, as it does not keep track of who used it.

Mutex

A mutex (mutual exclusion) is another RTOS synchronization tool, but it is specifically designed to protect critical code that can only be executed by one task at a time. If multiple tasks want to write and read a global variable, a mutex could prevent the tasks from accessing it simultaneously. This avoids race conditions and data loss. An important feature of mutexes is that only the task that took them can give them, and they also usually support priority inheritance. In essence, they are stricter and more secure semaphores.

Priority inversion

Given three tasks, such as a high-priority motor control, a medium-priority display update, and a low-priority logging task, the logging may have taken a mutex to update the value of a shared variable. In this case, if the high-priority motor control also wants to use the variable, it must wait for the mutex to be released. However, the display update gets CPU time instead of logging because it has higher priority, effectively blocking the high-priority task and preventing the system from working properly.

Priority inheritance

The priority inheritance mechanism is designed to avoid priority inversion. If a high-priority task is waiting for a mutex that is occupied by a lower-priority task, the priority of the lower-priority task is temporarily raised to match that of the higher-priority task. The low-priority task therefore gets CPU time, then releases the mutex, and returns to its original lower priority.

The case of Mars Pathfinder (1997)

During the Mars Pathfinder mission in 1997, the system restarted several times due to priority inversion in its firmware. A low-priority task, which was controlling multiple sensors, took a mutex, while a high-priority task, which was sending scientific data back to Earth, waited for the same mutex to be given. As a medium-priority task performed regular background tasks, the high-priority task was unable to run. The error led to a delay in the mission, and if the watchdog hadn't performed restars, the probe could have been lost. Ultimately, NASA engineers activated the priority inheritance mechanism remotely.

Queue

With queues, tasks can securely send data to each other, so they do not need to communicate directly, which makes the system modular and secure. For example, a task that generates measurement data sends the data at its own pace to the queue, and the processing task processes the measurements at its own pace.