Health Monitor#
The Health Monitor is library, that together with the Launch Manager provide a way to monitor the application health in similar fashion as the AUTOSAR Platform Health Manager (PHM).
The main features of the Health Monitor are the following monitoring functions:
Alive supervision - Periodic monitoring of checkpoints, which must fit the pre-configured expected number of notifications in the given interval (not too many, not too few) - Protecting from running checks too often or too rarely
Deadline supervision - Timing requirement between two checkpoints.
Logical - Specifies in which order two or more checkpoints must be called
The Health Monitor itself is monitored via the Launch Manager with via alive supervision only.
Overview#
Inter process communication (IPC) only needed for the alive monitoring between Health Monitor and Launch Manager
- Easier configuration
The monitoring rules can be configured dynamically on demand basis
The monitoring can be started and stopped dynamically
- Debugging of problems possibly easier
Mapping of the events from a single process vs. the monitored application and the monitor
Drawbacks over classical external process monitoring#
Harder safety argumentation. The following chapter describes the issues and the possible solutions.
Safety#
As the Health Monitor is linked as part of the monitored application, it raises the following concerns with respect to safety:
How can it be ensured, that the monitored application does not interfere with the monitoring functionality?
How can it be ensured, that the Health Monitor does not incorrectly report alive to the Launch Manager when it has detected a supervision error?
These concerns are valid, but can be addressed using the following techniques:
- Hiding of the internal data from the user:
For example, if the monitoring is implemented in a thread, the thread ID must not be exposed to the calling application.
Hide the implementation for example with the pImpl-approach
- Protecting the memory of the library by using guard pages where the application memory is located, and protect it with mprotect()
possibly with a help of a custom allocator
- Protecting the data with a checksum and possibly a sequence counter
The internal data of the library can be checksum’ed every operation cycle, and by adding for example, a sequence counter (or some more complex mathematical function), further checkpoints for detecting misbehavior can be implemented
Using a safe programming language, which does not allow a raw pointer access
Testing the application with Valgrid etc.
Redundant monitoring, if the self monitoring with above is not sufficient, an another library in another memory location can detect sporadic corruption of the other.
Error Reactions#
When the Health Monitor detects a failed supervision, it shall stop triggering alive notifications to the Launch Manager.
Additionally, when the error occurs, the Health Monitor triggers a failure notification to the Launch Manager to reduce the time to react on the error. This obviously will only work if the Health Monitor is still working and correctly scheduled. Thus the worst case reaction time calculations must be made on the monitoring rules specified in the Launch Manager for the monitored application.
Deadline Monitor API#
Static Architecture#
Deadline Monitor API
|
status: valid
security: YES
safety: ASIL_B
|
||||
|
|||||
configure_minimum_time
|
status: valid
security: YES
safety: ASIL_B
|
||||
configure_maximum_time
|
status: valid
security: YES
safety: ASIL_B
|
||||
link_condition
|
status: valid
security: YES
safety: ASIL_B
|
||||
mark_start
|
status: valid
security: YES
safety: ASIL_B
|
||||
mark_end
|
status: valid
security: YES
safety: ASIL_B
|
||||
on_timer_expiry
|
status: valid
security: YES
safety: ASIL_B
|
||||
enable_monitoring
|
status: valid
security: YES
safety: ASIL_B
|
||||
disable_monitoring
|
status: valid
security: YES
safety: ASIL_B
|
||||
check_configuration
|
status: valid
security: YES
safety: ASIL_B
|
||||
Dynamic Architecture#
Application health monitoring
|
status: invalid
security: YES
safety: ASIL_B
|
||||
|
|||||
The most important interactions are the following:
Sequence number |
Description |
|---|---|
001 |
Launch Manager configuration for the alive monitoring of the Monitored application is parsed. This contains for example, what is the expected interval of alive notifications, how long grace period is given before failing to a missed (never received) alive notification etc. |
002 |
Start the startup grace period timer to allow the application to startup, before timing out to a missed alive notification |
003 |
The Monitored application is started. (To simplify, no startup checks drawn here) |
004 |
The Monitored application instantiate and configure the HealthMonitor |
006 |
Cyclic reporting aliveness to the monitor. |
007 |
HealthMonitor waking up and checking if the checkpoint(s) have been called |
008 |
Report aliveness to the LM’s application specific supervision, observing the health of the HealthMonitor itself |
009 |
Checkpoint sent, but not on time |
010 |
Wake up and check if the checkpoint(s) have been triggered. In this case it was not, and thus actions 011 and 012 are triggered. |
011 |
Trigger a failure event to the Launch Manager. This event allows the monitor react faster than waiting for the timeout to expire. |
012 |
Additionally, triggering alive must be stopped |
Logical Monitor API#
Static Architecture#
Logical Monitor API
|
status: valid
security: YES
safety: ASIL_B
|
||||
|
|||||
add_entry_point
|
status: valid
security: YES
safety: ASIL_B
|
||||
add_exit_point
|
status: valid
security: YES
safety: ASIL_B
|
||||
add_allowed_transition
|
status: valid
security: YES
safety: ASIL_B
|
||||
link_condition
|
status: valid
security: YES
safety: ASIL_B
|
||||
record_checkpoint
|
status: valid
security: YES
safety: ASIL_B
|
||||
enable
|
status: valid
security: YES
safety: ASIL_B
|
||||
disable
|
status: valid
security: YES
safety: ASIL_B
|
||||
verify
|
status: valid
security: YES
safety: ASIL_B
|
||||
Dynamic Architecture#
Logical control flow monitoring
|
status: invalid
security: YES
safety: ASIL_B
|
||||
|
|||||