Health Monitor#

The Health Monitor is library, that together with the Launch Manager provide a way to monitor the application health in similar fashion as the AUTOSAR Platform Health Manager (PHM).

The main features of the Health Monitor are the following monitoring functions:

Alive supervision - Periodic monitoring of checkpoints, which must fit the pre-configured expected number of notifications in the given interval (not too many, not too few) - Protecting from running checks too often or too rarely
Deadline supervision - Timing requirement between two checkpoints.
Logical - Specifies in which order two or more checkpoints must be called

The Health Monitor itself is monitored via the Launch Manager with via alive supervision only.

Overview#

Inter process communication (IPC) only needed for the alive monitoring between Health Monitor and Launch Manager
Easier configuration
- The monitoring rules can be configured dynamically on demand basis
- The monitoring can be started and stopped dynamically
Debugging of problems possibly easier
- Mapping of the events from a single process vs. the monitored application and the monitor

Drawbacks over classical external process monitoring#

Harder safety argumentation. The following chapter describes the issues and the possible solutions.

Safety#

As the Health Monitor is linked as part of the monitored application, it raises the following concerns with respect to safety:

How can it be ensured, that the monitored application does not interfere with the monitoring functionality?
How can it be ensured, that the Health Monitor does not incorrectly report alive to the Launch Manager when it has detected a supervision error?

These concerns are valid, but can be addressed using the following techniques:

Hiding of the internal data from the user:
- For example, if the monitoring is implemented in a thread, the thread ID must not be exposed to the calling application.
- Hide the implementation for example with the pImpl-approach
Protecting the memory of the library by using guard pages where the application memory is located, and protect it with mprotect()
- possibly with a help of a custom allocator
Protecting the data with a checksum and possibly a sequence counter
- The internal data of the library can be checksum’ed every operation cycle, and by adding for example, a sequence counter (or some more complex mathematical function), further checkpoints for detecting misbehavior can be implemented
Using a safe programming language, which does not allow a raw pointer access
Testing the application with Valgrid etc.
Redundant monitoring, if the self monitoring with above is not sufficient, an another library in another memory location can detect sporadic corruption of the other.

Error Reactions#

When the Health Monitor detects a failed supervision, it shall stop triggering alive notifications to the Launch Manager.
Additionally, when the error occurs, the Health Monitor triggers a failure notification to the Launch Manager to reduce the time to react on the error. This obviously will only work if the Health Monitor is still working and correctly scheduled. Thus the worst case reaction time calculations must be made on the monitoring rules specified in the Launch Manager for the monitored application.

Deadline Monitor API#

Static Architecture#

Deadline Monitor API		status: valid security: YES safety: ASIL_B
fulfils: feat_req__com__interfaces implemented by: comp__lifecycle_healthmonitor included by: feat__lifecycle includes: logic_arc_int_op__lifecycle__max_time, logic_arc_int_op__lifecycle__link_cond_dl, logic_arc_int_op__lifecycle__start, logic_arc_int_op__lifecycle__end, logic_arc_int_op__lifecycle__timer_expiry, logic_arc_int_op__lifecycle__enable_mon, logic_arc_int_op__lifecycle__disable_mon, logic_arc_int_op__lifecycle__check_cfg

logic_arc_int__lifecycle__deadline_monitor_if		Logical Interface & Feature Interface View

configure_minimum_time		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__alive_if

logic_arc_int_op__lifecycle__min_time		Logical Interface Operation

configure_maximum_time		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__max_time		Logical Interface Operation

link_condition		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__link_cond_dl		Logical Interface Operation

mark_start		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__start		Logical Interface Operation

mark_end		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__end		Logical Interface Operation

on_timer_expiry		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__timer_expiry		Logical Interface Operation

enable_monitoring		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__enable_mon		Logical Interface Operation

disable_monitoring		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__disable_mon		Logical Interface Operation

check_configuration		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__deadline_monitor_if

logic_arc_int_op__lifecycle__check_cfg		Logical Interface Operation

Dynamic Architecture#

Application health monitoring		status: invalid security: YES safety: ASIL_B
belongs to: feat__lifecycle fulfils: feat_req__lifecycle__process_monitoring

feat_arc_dyn__lifecycle__app_health_moni		Feature Sequence Diagram

The most important interactions are the following:

Table 11 Sequence diagram Description#
Sequence number	Description
001	Launch Manager configuration for the alive monitoring of the Monitored application is parsed. This contains for example, what is the expected interval of alive notifications, how long grace period is given before failing to a missed (never received) alive notification etc.
002	Start the startup grace period timer to allow the application to startup, before timing out to a missed alive notification
003	The Monitored application is started. (To simplify, no startup checks drawn here)
004	The Monitored application instantiate and configure the HealthMonitor
006	Cyclic reporting aliveness to the monitor.
007	HealthMonitor waking up and checking if the checkpoint(s) have been called
008	Report aliveness to the LM’s application specific supervision, observing the health of the HealthMonitor itself
009	Checkpoint sent, but not on time
010	Wake up and check if the checkpoint(s) have been triggered. In this case it was not, and thus actions 011 and 012 are triggered.
011	Trigger a failure event to the Launch Manager. This event allows the monitor react faster than waiting for the timeout to expire.
012	Additionally, triggering alive must be stopped

Logical Monitor API#

Static Architecture#

Logical Monitor API		status: valid security: YES safety: ASIL_B
fulfils: feat_req__com__interfaces implemented by: comp__lifecycle_healthmonitor included by: feat__lifecycle includes: logic_arc_int_op__lifecycle__entry_point, logic_arc_int_op__lifecycle__exit_point, logic_arc_int_op__lifecycle__allowed_trans, logic_arc_int_op__lifecycle__link_cond_lg, logic_arc_int_op__lifecycle__rec_checkpoint, logic_arc_int_op__lifecycle__enable, logic_arc_int_op__lifecycle__disable, logic_arc_int_op__lifecycle__verify

logic_arc_int__lifecycle__logical_monitor_if		Logical Interface & Feature Interface View

add_entry_point		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__entry_point		Logical Interface Operation

add_exit_point		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__exit_point		Logical Interface Operation

add_allowed_transition		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__allowed_trans		Logical Interface Operation

link_condition		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__link_cond_lg		Logical Interface Operation

record_checkpoint		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__rec_checkpoint		Logical Interface Operation

enable		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__enable		Logical Interface Operation

disable		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__disable		Logical Interface Operation

verify		status: valid security: YES safety: ASIL_B
included by: logic_arc_int__lifecycle__logical_monitor_if

logic_arc_int_op__lifecycle__verify		Logical Interface Operation

Dynamic Architecture#

Logical control flow monitoring		status: invalid security: YES safety: ASIL_B
belongs to: feat__lifecycle fulfils: feat_req__lifecycle__process_monitoring

feat_arc_dyn__lifecycle__app_ctrl_flow_moni		Feature Sequence Diagram