Health Monitor#

The Health Monitor is library, that together with the Launch Manager provide a way to monitor the application health in similar fashion as the AUTOSAR Platform Health Manager (PHM).

The main features of the Health Monitor are the following monitoring functions:

  • Alive supervision - Periodic monitoring of checkpoints, which must fit the pre-configured expected number of notifications in the given interval (not too many, not too few) - Protecting from running checks too often or too rarely

  • Deadline supervision - Timing requirement between two checkpoints.

  • Logical - Specifies in which order two or more checkpoints must be called

The Health Monitor itself is monitored via the Launch Manager with via alive supervision only.

Overview#

  • Inter process communication (IPC) only needed for the alive monitoring between Health Monitor and Launch Manager

  • Easier configuration
    • The monitoring rules can be configured dynamically on demand basis

    • The monitoring can be started and stopped dynamically

  • Debugging of problems possibly easier
    • Mapping of the events from a single process vs. the monitored application and the monitor

Drawbacks over classical external process monitoring#

  • Harder safety argumentation. The following chapter describes the issues and the possible solutions.

Safety#

As the Health Monitor is linked as part of the monitored application, it raises the following concerns with respect to safety:

  • How can it be ensured, that the monitored application does not interfere with the monitoring functionality?

  • How can it be ensured, that the Health Monitor does not incorrectly report alive to the Launch Manager when it has detected a supervision error?

These concerns are valid, but can be addressed using the following techniques:

  • Hiding of the internal data from the user:
    • For example, if the monitoring is implemented in a thread, the thread ID must not be exposed to the calling application.

    • Hide the implementation for example with the pImpl-approach

  • Protecting the memory of the library by using guard pages where the application memory is located, and protect it with mprotect()
    • possibly with a help of a custom allocator

  • Protecting the data with a checksum and possibly a sequence counter
    • The internal data of the library can be checksum’ed every operation cycle, and by adding for example, a sequence counter (or some more complex mathematical function), further checkpoints for detecting misbehavior can be implemented

  • Using a safe programming language, which does not allow a raw pointer access

  • Testing the application with Valgrid etc.

  • Redundant monitoring, if the self monitoring with above is not sufficient, an another library in another memory location can detect sporadic corruption of the other.

Error Reactions#

  • When the Health Monitor detects a failed supervision, it shall stop triggering alive notifications to the Launch Manager.

  • Additionally, when the error occurs, the Health Monitor triggers a failure notification to the Launch Manager to reduce the time to react on the error. This obviously will only work if the Health Monitor is still working and correctly scheduled. Thus the worst case reaction time calculations must be made on the monitoring rules specified in the Launch Manager for the monitored application.

Deadline Monitor API#

Static Architecture#

configure_minimum_time
status: valid
security: YES
safety: ASIL_B
configure_maximum_time
status: valid
security: YES
safety: ASIL_B
mark_start
status: valid
security: YES
safety: ASIL_B
mark_end
status: valid
security: YES
safety: ASIL_B
on_timer_expiry
status: valid
security: YES
safety: ASIL_B
enable_monitoring
status: valid
security: YES
safety: ASIL_B
disable_monitoring
status: valid
security: YES
safety: ASIL_B
check_configuration
status: valid
security: YES
safety: ASIL_B

Dynamic Architecture#

Application health monitoring
status: invalid
security: YES
safety: ASIL_B

The most important interactions are the following:

Table 11 Sequence diagram Description#

Sequence number

Description

001

Launch Manager configuration for the alive monitoring of the Monitored application is parsed. This contains for example, what is the expected interval of alive notifications, how long grace period is given before failing to a missed (never received) alive notification etc.

002

Start the startup grace period timer to allow the application to startup, before timing out to a missed alive notification

003

The Monitored application is started. (To simplify, no startup checks drawn here)

004

The Monitored application instantiate and configure the HealthMonitor

006

Cyclic reporting aliveness to the monitor.

007

HealthMonitor waking up and checking if the checkpoint(s) have been called

008

Report aliveness to the LM’s application specific supervision, observing the health of the HealthMonitor itself

009

Checkpoint sent, but not on time

010

Wake up and check if the checkpoint(s) have been triggered. In this case it was not, and thus actions 011 and 012 are triggered.

011

Trigger a failure event to the Launch Manager. This event allows the monitor react faster than waiting for the timeout to expire.

012

Additionally, triggering alive must be stopped

Logical Monitor API#

Static Architecture#

add_entry_point
status: valid
security: YES
safety: ASIL_B
add_exit_point
status: valid
security: YES
safety: ASIL_B
add_allowed_transition
status: valid
security: YES
safety: ASIL_B
record_checkpoint
status: valid
security: YES
safety: ASIL_B
enable
status: valid
security: YES
safety: ASIL_B
disable
status: valid
security: YES
safety: ASIL_B
verify
status: valid
security: YES
safety: ASIL_B

Dynamic Architecture#

Logical control flow monitoring
status: invalid
security: YES
safety: ASIL_B