Disaster Detection
Singularity can be configured to automatically detect 'disaster' scenarios based on a number of indicators and can react in such a way as too limit further stress on the cluster while it is recovering.
Disaster detection can be enabled by adding enabled: true
to the disasterDetection
portion of you Singularity config yaml. There are a number of other fields, explained below, that can control the behavior and thresholds for disaster detection.
enabled
- set totrue
to start running disaster detection and collecting stats abotu lost tasks, lost agents, and task lagrunEveryMillis
- Run the poller on this interval (defaults to every30
seconds)
Task Lag
checkLateTasks
- Use late tasks (aka task lag) as a metric for determining if a disaster is in progress (defaults totrue
)criticalAvgTaskLagMillis
- If the average time past due for all pending tasks is greater than this, a disaster is in progress (likely due to a severe lack of resources in the cluster), defaults to4 minutes (240000)
criticalOverdueTaskPortion
- If the portion of tasks taht are considered overdue is this fraction of the total running tasks in the cluster, a disaster is in progress. Defaults to0.1
or one tenth of tasks are pending and overdue.
Lost Agents
checkLostSlaves
- Use lost agents as a metric for determining if a disaster is in progress. Disaster detection only counts agents that have transitioned fromACTIVE
toDEAD
. Agents that are gracefully decommissioned and removed won't trigger a disaster. (defaults totrue
)criticalLostSlavePortion
- If, during the past run of the poller, this portion of the total active agents in the clsuter have transitioned fromACTIVE
toDEAD
a disaster is in progress. Defaults to0.2
or one fifth of the agents in the cluster
Lost Tasks
checkLostTasks
- Use lost tasks as a metric for determining if a disaster is in progress (defaults totrue
)lostTaskReasons
- Consider status updates matching these reasons towards the lost tasks for disaster detection. This is a list of mesosReason
enum values (org.apache.mesos.Protos.TaskStatus.Reason
) and defaults to[REASON_INVALID_OFFERS, REASON_AGENT_UNKNOWN, REASON_AGENT_REMOVED, REASON_AGENT_RESTARTED, REASON_MASTER_DISCONNECTED]
criticalLostTaskPortion
- If this portion of the total active tasks in the cluster have transitioned toLOST
for one of the above reasons in the last run of the poller, a disaster is in progress. Defaults to0.2
Disabled Actions
Singularity also supports globally disabling certain actions, which can aid in maintenance or cluster recovery after an outage. These can be added and removed manually on the /disasters
UI page. You can also specify a list of actions that can automatically be disabled when a disaster is detected by specifying disableActionsOnDisaster
in the disasterDetection
portion of you Singularity config yaml. When a disaster is detected, any action specified will automatically be disabled, and will be enabled again when the disaster has cleared. If during runtime you want to stop the disaster detector from disabling actions (for example, it keeps detecting a false positive), you can disable the automated actions in the UI or POST
to the /api/disasters/disable
endpoint.