..
   # *******************************************************************************
   # Copyright (c) 2024 Contributors to the Eclipse Foundation
   #
   # See the NOTICE file(s) distributed with this work for additional
   # information regarding copyright ownership.
   #
   # This program and the accompanying materials are made available under the
   # terms of the Apache License Version 2.0 which is available at
   # https://www.apache.org/licenses/LICENSE-2.0
   #
   # SPDX-License-Identifier: Apache-2.0
   # *******************************************************************************

Health Monitor
##############

The :term:`Health Monitor` is library, that together with the :term:`Launch Manager` provide a way to monitor
the application health in similar fashion as the AUTOSAR `Platform Health Manager` (PHM).

The main features of the :term:`Health Monitor` are the following monitoring functions:

- Alive supervision
  - Periodic monitoring of checkpoints, which must fit the pre-configured expected number of notifications in the given interval (not too many, not too few)
  - Protecting from running checks too often or too rarely
- Deadline supervision
  - Timing requirement between two checkpoints.
- Logical
  - Specifies in which order two or more checkpoints must be called

The :term:`Health Monitor` itself is monitored via the :term:`Launch Manager` with via alive supervision only.


Overview
========

- Inter process communication (IPC) only needed for the alive monitoring between :term:`Health Monitor` and :term:`Launch Manager`
- Easier configuration
    - The monitoring rules can be configured dynamically on demand basis
    - The monitoring can be started and stopped dynamically
- Debugging of problems possibly easier
    - Mapping of the events from a single process vs. the monitored application and the monitor

Drawbacks over classical external process monitoring
----------------------------------------------------

- Harder safety argumentation. The following chapter describes the issues and the possible solutions.

Safety
------

As the :term:`Health Monitor` is linked as part of the monitored application, it raises the following concerns
with respect to safety:

- How can it be ensured, that the monitored application does not interfere with the monitoring functionality?
- How can it be ensured, that the :term:`Health Monitor` does not incorrectly report alive to the :term:`Launch Manager` when it has detected
  a supervision error?


These concerns are valid, but can be addressed using the following techniques:

- Hiding of the internal data from the user:
    - For example, if the monitoring is implemented in a thread, the thread ID must not be exposed to the calling application.
    - Hide the implementation for example with the pImpl-approach
- Protecting the memory of the library by using guard pages where the application memory is located, and protect it with mprotect()
   - possibly with a help of a custom allocator
- Protecting the data with a checksum and possibly a sequence counter
   - The internal data of the library can be checksum'ed every operation cycle, and by adding for example, a sequence
     counter (or some more complex mathematical function), further checkpoints for detecting misbehavior can be implemented
- Using a safe programming language, which does not allow a raw pointer access
- Testing the application with Valgrid etc.
- Redundant monitoring, if the self monitoring with above is not sufficient, an another library in another memory location can detect sporadic corruption of the other.


Error Reactions
---------------

- When the :term:`Health Monitor` detects a failed supervision, it shall stop triggering alive notifications to the :term:`Launch Manager`.
- Additionally, when the error occurs, the :term:`Health Monitor` triggers a failure notification to the :term:`Launch Manager` to reduce the time
  to react on the error. This obviously will only work if the :term:`Health Monitor` is still working and correctly scheduled. Thus the
  worst case reaction time calculations must be made on the monitoring rules specified in the :term:`Launch Manager` for the monitored application.


Deadline Monitor API
====================

Static Architecture
-------------------


.. logic_arc_int:: Deadline Monitor API
   :id: logic_arc_int__lifecycle__deadline_monitor_if
   :security: YES
   :safety: ASIL_B
   :status: valid
   :fulfils: feat_req__com__interfaces

   .. needarch::
      :scale: 50
      :align: center

      {{ draw_interface(need(), needs) }}

.. logic_arc_int_op:: configure_minimum_time
   :id: logic_arc_int_op__lifecycle__min_time
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__alive_if

.. logic_arc_int_op:: configure_maximum_time
   :id: logic_arc_int_op__lifecycle__max_time
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: link_condition
   :id: logic_arc_int_op__lifecycle__link_cond_dl
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: mark_start
   :id: logic_arc_int_op__lifecycle__start
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: mark_end
   :id: logic_arc_int_op__lifecycle__end
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: on_timer_expiry
   :id: logic_arc_int_op__lifecycle__timer_expiry
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: enable_monitoring
   :id: logic_arc_int_op__lifecycle__enable_mon
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: disable_monitoring
   :id: logic_arc_int_op__lifecycle__disable_mon
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if

.. logic_arc_int_op:: check_configuration
   :id: logic_arc_int_op__lifecycle__check_cfg
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__deadline_monitor_if


Dynamic Architecture
--------------------

.. feat_arc_dyn:: Application health monitoring
   :id: feat_arc_dyn__lifecycle__app_health_moni
   :security: YES
   :status: invalid
   :safety: ASIL_B
   :fulfils: feat_req__lifecycle__process_monitoring

   .. uml:: _assets/application_health_monitoring_dynamic.puml
      :scale: 50
      :align: center

The most important interactions are the following:

.. list-table:: Sequence diagram Description
   :widths: 10 90
   :header-rows: 1

   * - Sequence number
     - Description
   * - 001
     - :term:`Launch Manager` configuration for the alive monitoring of the `Monitored application` is parsed. This contains for example, what is the expected interval of alive notifications,
       how long grace period is given before failing to a missed (never received) alive notification etc.
   * - 002
     - Start the startup grace period timer to allow the application to startup, before timing out to a missed alive notification
   * - 003
     - The `Monitored application` is started. (To simplify, no startup checks drawn here)
   * - 004
     - The `Monitored application` instantiate and configure the HealthMonitor
   * - 006
     - Cyclic reporting aliveness to the monitor.
   * - 007
     - HealthMonitor waking up and checking if the checkpoint(s) have been called
   * - 008
     - Report aliveness to the LM's application specific supervision, observing the health of the HealthMonitor itself
   * - 009
     - Checkpoint sent, but not on time
   * - 010
     - Wake up and check if the checkpoint(s) have been triggered. In this case it was not, and thus actions 011 and 012 are triggered.
   * - 011
     - Trigger a failure event to the Launch Manager. This event allows the monitor react faster than waiting for the timeout to expire.
   * - 012
     - Additionally, triggering alive must be stopped


Logical Monitor API
===================

Static Architecture
-------------------
.. logic_arc_int:: Logical Monitor API
   :id: logic_arc_int__lifecycle__logical_monitor_if
   :security: YES
   :safety: ASIL_B
   :status: valid
   :fulfils: feat_req__com__interfaces

   .. needarch::
      :scale: 50
      :align: center

      {{ draw_interface(need(), needs) }}

.. logic_arc_int_op:: add_entry_point
   :id: logic_arc_int_op__lifecycle__entry_point
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: add_exit_point
   :id: logic_arc_int_op__lifecycle__exit_point
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: add_allowed_transition
   :id: logic_arc_int_op__lifecycle__allowed_trans
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: link_condition
   :id: logic_arc_int_op__lifecycle__link_cond_lg
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: record_checkpoint
   :id: logic_arc_int_op__lifecycle__rec_checkpoint
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: enable
   :id: logic_arc_int_op__lifecycle__enable
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: disable
   :id: logic_arc_int_op__lifecycle__disable
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if

.. logic_arc_int_op:: verify
   :id: logic_arc_int_op__lifecycle__verify
   :security: YES
   :safety: ASIL_B
   :status: valid
   :included_by: logic_arc_int__lifecycle__logical_monitor_if


Dynamic Architecture
--------------------

.. feat_arc_dyn:: Logical control flow monitoring
   :id: feat_arc_dyn__lifecycle__app_ctrl_flow_moni
   :security: YES
   :status: invalid
   :safety: ASIL_B
   :fulfils: feat_req__lifecycle__process_monitoring

   .. uml:: _assets/logical_sup.puml
      :scale: 50
      :align: center