..
   # *******************************************************************************
   # Copyright (c) 2025 Contributors to the Eclipse Foundation
   #
   # See the NOTICE file(s) distributed with this work for additional
   # information regarding copyright ownership.
   #
   # This program and the accompanying materials are made available under the
   # terms of the Apache License Version 2.0 which is available at
   # https://www.apache.org/licenses/LICENSE-2.0
   #
   # SPDX-License-Identifier: Apache-2.0
   # *******************************************************************************

Fixed execution order framework (FEO)
#####################################

.. document:: Fixed execution order framework
   :id: doc__frameworks_feo
   :status: valid
   :security: NO
   :safety: ASIL_B
   :tags: feature_request, frameworks_feo


.. toctree::
   :hidden:

   requirements/feature_req.rst
   requirements/aou_req.rst
   requirements/chklst_req_inspection.rst
   architecture/feature_architecture
   architecture/chklst_arch_inspection.rst
   safety_planning/index.rst
   safety_analysis/feature_fmea.rst
   safety_analysis/feature_dfa.rst
   safety_analysis/chklst_safety_analysis_inspection.rst

Feature flag
============

To activate this feature, use the following feature flag:

`experimental_feo`

Abstract
========

This contribution request describes the fixed execution order and reprocessing
framework (FEO), which is intended to support data-driven or time-driven
applications. It provides a fixed execution order for activities and the
necessary infrastructure to reprocess activities in a simulated environment.


Motivation
==========

There are several automotive use-cases that require a fixed and deterministic computation of tasks.
This is particularly crucial for safety-critical applications where the execution order
of tasks is essential for the correct operation of the system. The FEO framework is designed
for applications supporting data-driven and time-driven applications
mainly in the ADAS domain, ensuring a fixed execution order and supporting reprocessing.
(See also :need:`stkh_req__app_architectures__support_data`, :
need:`stkh_req__app_architectures__support_time`, `stkh_req__dev_experience__reprocessing`)

Key aspects of S-CORE and FEO framework are:

* a framework for applications (not for platform services)
* for data-driven and time-driven applications (mainly in the ADAS domain)
* support fixed execution order
* supporting reprocessing

In the following we will explain and argue how and with which major components
these aspects can be implemented.


Applications
============

* The framework is used to build applications
* Multiple applications based on
  the framework can run in parallel on the same host machine
* Applications based on the framework can run in parallel with other
  applications not based on the framework
* The framework does not support
  communication between different applications (except via service activities,
  see below)


Activities
==========

* Applications consist of activities
* Activities are a means to structure applications into building blocks
* Activities have init(), step() and shutdown() entry points
* The framework provides the following APIs to the activities running on it:

  - Read time (feo::time)
  - Communicate to other activities (feo::com)
  - Log (feo::log)
  - Configuration parameters (feo::param)
  - Persistency (feo::pers)
  - Tracing (feo::tracing)

* There are two types of activities:

  - Application activities
  - Service activities

Application Activities
======================

* Application activities must only use APIs provided by the framework as defined above
* Application activities are single threaded, they can not run outside of their entry points,
  they must not spawn other threads or process
* Activities can be implemented in C++ or Rust, mixed systems with both
  C++ and Rust activities shall be supported.


Service Activities
==================

* Service activities are a means to interact with the outside world, e.g. via
  network communication, direct sensor input or direct actuator output
* Service activities may also use APIs external to the framework
  (e.g. networking APIs, reading from external sensor devices, writing HW I/O, etc.)
* Service activities run at the beginning ("input service activity") and at the end
  ("output service activity") of a task chain (see below)
* Input service activities provide the input values to the application activities
  within the task chain, by means of communication
* All input service activities must finish execution before the first application activity
  is run. This can be achieved by proper setup of the chain dependencies (see below)
* There must be at least one input service activity
* Output service activities consume output values from the application activities
  calculated within the task chain an provide them to the outside world
* All output service activities must run after all application service activities have
  finished execution. This is achieved by proper setup of the chain dependencies (see below)
* There must be at least one output service activity


Communication
=============

* Application type activities can only communicate to other activities within
  the same application and using the provided communication API
* Communication consists of sending and receiving messages on named topics
* The receiver of a message on a topic does not know the sender, instead it only
  relies on the message itself independent of the source of the message
* There can only be one sender per topic but multiple receivers
* Optional: there can be multiple senders per topic
* There is no publish/subscribe mechanism accessible to activities, instead
  the set of known communication topics and the assignment of which activity
  sends and receives to/from which topic is "runtime static"
* "runtime static" means "static after the startup phase", i.e. during startup, the
  framework can configure or build up communication connections, but as soon as the
  run phase starts (where the activities' step() functions are called), the connections
  are fixed and will not change any more.
* Communication relations are typically configured in configuration files
* Messages/topics are statically typed
* Only messages of the matching type can be sent/received on a specific topic
* The binary representation of messages is defined by the framework in order
  to support communication between activities implemented in different
  languages (C++/Rust)
* Message types may be primitive types or complex (nested) types
* Complex types can be built by using structs and arrays of types
* Sending a message by an activity involves the following steps:

  - Call API to acquire a handle to a message buffer for a certain topic
  - Fill data into the provided memory buffer
  - Call API to send the message

* Reception of a message by an activity involves the following steps:

  - Use API to receive message from a certain topic, this returns a handle to a data buffer
  - Read message data from data buffer

* The receiver can not modify the message, the framework will enforce this,
  for example by using read-only types or by configuring memory protect of the OS

Queuing of topics:

* Queuing can be enabled per topic, a queue of length N means that the last N messages are
  kept for a specific topic
* Receivers have access to the last N elements, reading an element from the
  queue by a receiver doesn't change the queue, i.e. doesn't remove it from the queue.
  instead all receiver will always see the last N elements
* Optional: a queue pointer to the element last read is maintained per receiver.
  However, the queue with its buffers still only exists once per topic. If one receiver
  receives an element from the queue, its queue pointer is incremented so that next
  time it reads the next element, this does not affect the queue pointers of other receivers
* Queue enable and queue length are "runtime static" configuration settings


Process/Thread/Activity Mapping
===============================

* An application consists of one or more processes
* One of the processes is the primary process
* If there is more than one process, the other processes are secondary processes
* There can be one or more threads per process
* The number of processes and threads is statically defined and
  does not change once the application has been started (runtime static)
* Activities are statically mapped to threads within processes within the application
* There can be multiple activities mapped to the same thread

* There is one executable per process, so an application may consist of multiple executables
* Each executable contains part of this framework as well as the activities mapped to the
  corresponding process
* It is assumed that an external entity starts all the executables belonging to the
  same application. The reason for this is, that for security reasons, only very
  specific entities should have the ability to create processes
* The executables belonging to an application are grouped (e.g. in the filesystem) so that
  it's clear that they belong together
* One reason for having multiple processes per application is to
  achieve Freedom From Interference for safety relevant applications


Static mapping of activities to threads
'''''''''''''''''''''''''''''''''''''''

As pointed out above, FEO activities are required to be mapped to threads in a static way.
The rationale behind this requirement is:

* Calling activity functions init(), step() and shutdown() from a single pre-defined thread
  allows implementations to make use of thread-local optimizations such as thread-local variables.
* Calling an activity's step() function from different threads in different iterations of the
  task chain may cause execution time jitter e.g. from unpredictable cache misses or different
  properties of the processor cores the respective threads may be assigned to.
* Most importantly, a dynamic assignment of activities to threads may result in non-deterministic
  variations of the task-chain execution time.

To understand how a dynamic thread assignment can cause execution time variations, consider
the following example (sub-) task chain.

|example_task_chain|

Here, activity 6 depends on, i.e. must be executed after activities 1 to 5. The length of the bars
is intended to indicate the relative computation time needed by the respective activity on a
single processor core. It is assumed that all of these activities will be executed in the same process.

In a simple approach, each of the activities 1 to 5 could be assigned to its own thread and
activity 6 could be executed subsequently in one of these threads as shown in the figure below.
Each blue "lane" indicates one thread.

|example_task_chain_5_threads|

If each thread runs on a separate core and execution is not interrupted by other tasks, the length of
the blue box is related to the total execution time of the task chain.

Approximately the same total execution time can be achieved with only 3 threads (on three cores), if the
tasks are assigned in an optimized way:

|example_task_chain_3_threads_optimized|

If, on the other hand,  activities are assigned to the same 3 threads in a dynamic way, the execution
time may vary unpredictably, because of the possibly varying execution sequence of activities,
as can be seen below.

|example_task_chain_3_threads_dynamic|


Lifecycle
=========

* The lifecycle of an application consists of 3 phases:

  - startup phase
  - run phase
  - shutdown phase

* During startup phase, the primary process connects with the secondary processes
  (if present), in order to:

  - Build up connections for communication (e.g. find shared memory segments
    provided/consumed)
  - Connect to the parameter service
  - Coordinate the init and later the shutdown process
  - Coordinate the execution of the task chain (see below)

* During the shutdown phase, the primary process coordinates the shutdown of
  all secondary processes
* The connection between primary and secondary processes is kept up as long as the
  application is running
* If the connection breaks down unexpectedly while the application is running,
  the involved processes terminate (either by a command from the primary process
  or by detecting connection loss to the primary process)

Activity Init:

* At the end of the startup phase, the framework will invoke the init() entry point
  of each activity
* The init() method will run in the thread assigned to the activity.
* The order in which init() is called for different activities is arbitrary, it may happen in parallel or sequentially.

Activity Shutdown:

* At the beginning of the shutdown phase, the framework will invoke the shutdown()
  entry point of each application
* The shutdown() method will run in the thread assigned to the activity.
* The order of invoking the shutdown() entry points across activities is not defined,
  invocation may happen in parallel or sequentially


Scheduling
==========

* Activities are arranged in a task chain
* There is exactly one task chain per application
* The task chain describes the execution order of the activities in the run phase
* Task chains run cyclically, e.g. every 30ms
* Optional: task chains can be triggered on event
* All activities are executed once per task chain run
* All activities finish within a single task chain run
* Running an activity means that the framework is calling its step() function
  within the process/thread it has been mapped to
* The execution order is defined by a dependency model:

  - Each activity can depend on N other activities in the same task chain
  - An activity's step() function gets called as soon as the step()
    functions of the activities it depends on have been called

* The framework takes care to run the activities in this order,
  independent of the thread/process the activity is mapped to
* While the order is guaranteed, there is no guarantee that an activity is
  run immediately after all its dependencies have finished.
  For example if two activities mapped to the same thread are ready to run
  at the same time, they can still only run one after the other
* Note however, that for a particular (static) setup of threads, processes
  and activity mapping, the invocation delay is deterministic
  (apart from differences in the activity execution times)
* The execution order and timing of an activity are independent of any communication that activity may perform.
* The dependencies should be defined by the application developer in a way so that
  processing results passed via communication are available when they are needed
  (if an activity needs an output of another activity it sets that other
  activity as its dependency and therefore will only run once the other one
  is finished and therefore has produced the results the first one needs)


Executor and Agents
===================

* The coordinating entity in the primary process is the "executor"
* The executor coordinates the invocation of the activities in the
  order as described above
* As a central entity, the executor is able to trace and monitor the
  system behavior as sequence of activity invocations (see below)
* The actual activity invocation is done by an "agent"
* The agent exists in each process belonging to an application
* The agent connects to the executor during the startup phase
* The agent takes invocation commands sent by the executor and
  executes them in its local process on behalf of the executor


Tracing
=======

* The framework can make use of the tracing API (feo::tracing) to trace
  the program flow, mainly for debugging purposes.
* The tracing events generated by the tracing API can be recorded for later
  inspection e.g. using a UI like Google Perfetto or Eclipse TraceCompass.


Performance
===========

The framework is designed to ensure deterministic execution order and
timing of activities, supporting safety-critical applications
in the automotive domain. In this domain the footprint of the framework is crucial especially
w.r.t impact of computation load and latency.


.. |example_task_chain| image:: _assets/example_task_chain.png

.. |example_task_chain_5_threads| image:: _assets/example_task_chain_5_threads.png

.. |example_task_chain_3_threads_optimized| image:: _assets/example_task_chain_3_threads_optimized.png

.. |example_task_chain_3_threads_dynamic| image:: _assets/example_task_chain_3_threads_dynamic.png


Error Handling
==============

Possible error cases during the different FEO life cycle states shall be handled as follows. For now, the
descriptions are focussed on the intended implementation for S-CORE v0.5. Potential adaptations for
S-CORE v1.0 have been noted down in the next Section.

* Independent of state
    - If the primary process dies, the external lifecycle management shall kill all dependent processes.
    - If a secondary process dies, the lifecycle management shall send a termination signal to the primary process.
      The primary process shall call the shutdown function of all remaining activities in arbitrary sequence and
      terminate itself.

* State: Lifecycle Manager creates all processes (primary & secondaries)
    - If not all secondaries connect to the primary in time, the primary will terminate itself.
      The startup functions shall not be triggered.

* State: Lifecycle Manager has created all processes (primary & secondaries), all secondaries have connected to the primary
    - If an error occurs during the execution of a startup function, the primary process shall abort calling startup functions
          and terminate itself. For all of the activities whose startup functions have already been called successfully,
          the corresponding shutdown functions shall be executed in arbitrary sequence.
    - During initialization (i.e. in the startup function of an activity), activities shall check for resource allocation
      and report an error to the executor in case of failure.
    - If a timeout occurs during startup, stepping or shutdown of an activity, the primary process shall shutdown
      all successfully started activities in arbitrary sequence and terminate itself.
    - If not all activities reach their initialized state within a certain period of time (startup timeout),
      the primary process shall shutdown all successfully started activities in arbitrary sequence and terminate itself.

* State: Lifecycle Manager has created all processes  (primary & secondaries), all secondaries have connected to the primary, all activities have been started up successfully
    - If an activity fails in the step function, the primary process shall call shutdown for all activities in
      arbitrary sequence and terminate itself.
    - If activities do not meet their intermediate (time/memory/cpu-) budgets the issue shall be detected and handled
      outside of FEO. (Resource supervision and quotas will be defined in a separate feature request, if needed.)

* State: Shutdown of activities
    - If an activity fails in the shutdown function, the primary process shall shutdown all remaining activities
      and terminate itself.


Extended features for S-CORE v1.0
=================================

The following features will not be implemented as part of S-CORE v0.5, but have been noted down as potential extensions
for v1.0. They shall be considered as drafts only.


External state
''''''''''''''

* Depending on the reprocessing scenario (see below) it might be necessary
  to put the activities into a well defined state. This can either be done
  by providing all the input to the activities which they need to get
  into that state (which could involve many task chain invocations).
  Another way is to let the framework record activity state just as it
  records communication messages
* External state is a means to make activity state recordable
* Using external state, activities don't hold their state in activity local
  variables (like C++ member variables) but in a state storage provided
  by the framework. This way, they "do not remember anything" from the
  last task chain invocation. Instead, on every new task chain invocation,
  they first read in the external state from the framework provided storage,
  then potentially manipulate the state based on their inputs and then
  store it back for the next task chain invocation


Recording
'''''''''

* As a central entity, the executor is able to record the system behavior as sequence
  of activity invocations.
* The framework can record all messages going over its communication topics
* For each message the recording includes:

  - topic
  - data
  - timestamp
  - sender [optional]

* The framework can record certain execution events:

  - task chain start/end
  - init/step/shutdown() entry point enter per activity
  - init/step/shutdown() entry point leave per activity

* For each event the recording includes:

  - type (e.g. step_enter)
  - context (e.g. activity name of step() entered)
  - timestamp


Reprocessing
''''''''''''

* There are multiple possible reprocessing scenarios, for example:

  - replay of one or many executions of a task chain
  - replay of one or many executions of a single activity

* In a replay scenario, the framework is used to reproduce the communication messages
  and other API behavior (e.g. time, parameters, persistency) as was
  recorded in a previous run
* In case a whole task chain is reprocessed, the outputs of the input service activities
  will be reproduced
* In case only a single activity is reprocessed, the outputs of the predecessors
  in the task chain will be reproduced
* Outputs of application activities are typically not replayed but
  freshly calculated by the activities running during the replay
* The framework supports reprocessing by

  - Starting a task chain at the same point in time as recorded
  - Replaying communication data as recorded
  - Providing time via its time API as recorded


Extended Error Handling
'''''''''''''''''''''''

* Independent of state
    - If the primary process dies, the external lifecycle management shall kill all dependent processes.
    - If a secondary process dies, the lifecycle management shall send a termination signal to the primary process.
      The primary process shall call the shutdown function of all remaining activities in arbitrary sequence and
      terminate itself.

* State: Lifecycle Manager creates all processes (primary & secondaries)
    - If not all secondaries connect to the primary in time, the primary will not terminate, but report an error to the
      lifecycle/health management. The startup functions shall not be triggered.

* State: Lifecycle Manager has created all processes (primary & secondaries), all secondaries have connected to the primary
    - If an error occurs during the execution of a startup function, the primary process shall abort calling startup
      functions and terminate itself. For all of the activities whose startup functions have already been called successfully,
      the corresponding shutdown functions shall be executed in arbitrary sequence.
      In addition, the primary process shall report the issue to health management.
    - During initialization (i.e. in the startup function of an activity), activities shall check for resource allocation
      and report an error to the executor in case of failure.
    - If a timeout occurs during startup, stepping or shutdown of an activity, the primary process shall shutdown all
      successfully started activities in arbitrary sequence and terminate itself.
      In addition, the primary process shall report the issue to health management.
    - If not all activities reach their initialized state within a certain period of time (startup timeout), the
      primary process shall shutdown all successfully started activities in arbitrary sequence and terminate itself.
      In addition, the primary process shall report the issue to health management.

* State: Lifecycle Manager has created all processes  (primary & secondaries), all secondaries have connected to the primary, all activities have been started up successfully
    - If an activity fails in the step function, the primary process shall call shutdown for all activities in
      arbitrary sequence and terminate itself.
      In addition, a logical waypoint error shall be reported to health management.
    - If activities do not meet their intermediate (time/memory/cpu-) budgets the issue shall be detected and handled
      outside of FEO. (Resource supervision and quotas will be defined in a separate feature request, if needed.)

* State: Shutdown of activities
    - If an activity fails in the shutdown function, the primary process shall shutdown all remaining activities and
      terminate itself.
      In addition, a logical waypoint error shall be reported to health management.