Using CPU Guarantees Secure, Survivable Embedded Systems
Traditionally, realtime operating systems (RTOSs) have used priority-based preemptive scheduling to determine which software task gets control of the CPU. While this approach helps ensure that higher-priority tasks always execute in preference to lower-priority tasks, it can lead to a condition known as task starvation.
For instance, let’s say a system contains two tasks, A and B, where A has a slightly higher priority than B. If A becomes swamped with work, it will lock out B (as well as any other lower-priority task) from accessing the CPU. In effect, the lower-priority task becomes starved of CPU time.
This problem occurs in all fields of embedded software. In an automobile, task A might be the navigation display and task B the MP3 player – if the navigation system consumes too many CPU cycles when performing a route calculation, it can starve the MP3 player and cause MP3s to skip. In an industrial control system, task A could be the robot-arm control loop and task B the human machine interface (HMI). In a network router, task A might be the TCP/IP stack and routing protocols, and task B the SNMP agent.
Task starvation makes embedded systems vulnerable to several problems, including:
Denial of Service (DoS) attacks – If the embedded system is connected to a network, a malicious user could bombard the system with requests that need to be handled by a high-priority task. That task will then overload the CPU and starve other tasks of CPU cycles, making the system unavailable to users or operators.
Integration headaches – Typically, the many subsystems, processes, and threads that make up a modern embedded system are developed in parallel by multiple development groups, each with its own performance goals, task-prioritization schemes, and approaches to runtime optimization. Given the parallel development paths, task starvation issues invariably arise when integrating the subsystems from each development group. Unfortunately, few designers or architects are capable of diagnosing and solving these problems at a system level. Designers must juggle task priorities, possibly change task behavior across the system, and then retest and refine their modifications. The entire process can easily take several calendar weeks.
Limited scalability and upgradeability – Often, adding new or upgraded software can push a system ‘over the brink’ and starve existing applications of CPU time. Applications or system services that were functioning in a timely manner no longer respond as expected or required. In many cases, the only solution is to either add new hardware or redesign the system’s software.
The case for partitioning
Recently, the concept of partitioning has gained mindshare as a way to address these problems. Briefly stated, this approach allows design teams to compartmentalize software into separate partitions, where each partition is allocated a guaranteed portion (or budget) of CPU time. Each partition provides a stable, known runtime environment that development teams can build and verify individually. This partitioning enforces resource budgets, either through hardware or software, to prevent tasks in any partition from monopolizing CPU cycles needed by tasks in other partitions.
Fixed partition schedulers
For instance, a few RTOSs offer a fixed partition scheduler that lets the system designer group tasks into partitions and allocate a percentage of CPU time to each partition. With this approach, no task in any given partition can consume more than the partition’s statically defined percentage of CPU time.
For instance, let’s say a partition is allocated 30% of the CPU. If a process in that partition subsequently becomes the target of a denial of service attack, it will consume no more than 30% of CPU time. This approach prevents the process under attack from consuming the entire CPU.
Fixed partition schedulers have their drawbacks, however. Since the scheduling algorithm is fixed, a partition can never use more CPU cycles than its fixed budget, even if the system contains unused CPU cycles. Meanwhile, partitions that aren’t busy waste time in an idle state. This approach squanders valuable (and available) CPU cycles and prevents the system from handling peak demands.
Because of this ‘use it or lose it’ approach, fixed cyclical schedulers can achieve only 70% CPU utilization. Manufacturers must, as a result, use more-expensive processors, tolerate a slower system, or limit the amount of functionality that the system can support. As a further problem, developers must modify code to implement or change each partition.
Another approach, called adaptive partitioning, addresses these drawbacks by providing a more dynamic scheduling algorithm. Like fixed partitioning, adaptive partitioning allows the system designer to reserve CPU cycles for a process or a group of processes. The designer can thus guarantee that the load on one subsystem or partition won’t affect the availability of other subsystems.
Unlike fixed approaches, however, adaptive partitioning recognizes that CPU utilization is sporadic and that one or more partitions can often have idle time available. Consequently, an adaptive partitioning scheduler will dynamically reallocate those idle CPU cycles to partitions that can benefit from the extra processing time. This approach, which was pioneered by QNX Software Systems, offers the best of both worlds: it can enforce CPU guarantees when the system runs out of excess cycles (for guaranteed availability of lower-priority services) and can dispense free CPU cycles when they become available (for maximum CPU utilization and performance).
Adaptive partitioning offers several advantages, including the ability to:
– provide CPU time guarantees when the system is heavily loaded – this ensures that all partitions receive their fair budget of CPU time
– use realtime, priority-based scheduling when the system is lightly loaded – this allows systems to use the same scheduling behavior that they do today
– make use of free CPU time from partitions that aren’t completely busy – this gives other partitions the extra processing time they need to handle peak demands and permits 100% processor utilization
– overlay the adaptive partitioning scheduler onto existing systems without code changes – tasks can simply be launched in a partition, and the scheduler will ensure that partitions receive their allocated budget
– guarantee that fault-recovery operations have the CPU cycles they need to repair system faults, thereby improving mean time to repair for high availability systems
– stop external systems from stealing all the CPU time through a DoS attack.
Partitioning plus performance
Embedded software is becoming so complex that, without some form of partitioning, system designers and software engineers will be hard-pressed to satisfy the conflicting demands for performance, security, time to market, innovative features, and system availability. An OS-controlled approach to partitioning goes a long way toward addressing these requirements by providing each subsystem with a guaranteed portion of CPU cycles, while still delivering the deterministic, realtime response that embedded systems require.
Manufacturers can, as a result, readily integrate subsystems from multiple software teams, add new software components without compromising the behavior of existing components, and protect their systems from denial of service attacks and other network-based exploits. If the partitioning model also provides a flexible, efficient scheduler that allows partitions under load to borrow unused CPU time from
other partitions, then manufacturers can realize these various benefits without having to incur the cost of faster, more expensive hardware.
Kerry Johnson is a senior product manager, Jason Clarke is a senior field application engineer, and Paul Leroux is a technology analyst with QNX Software Systems.