Disaster Detection
Singularity can be configured to automatically detect 'disaster' scenarios based on a number of indicators and can react in such a way as too limit further stress on the cluster while it is recovering.
Disaster detection can be enabled by adding enabled: true to the disasterDetection portion of you Singularity config yaml. There are a number of other fields, explained below, that can control the behavior and thresholds for disaster detection.
enabled- set totrueto start running disaster detection and collecting stats abotu lost tasks, lost agents, and task lagrunEveryMillis- Run the poller on this interval (defaults to every30seconds)
Task Lag
checkLateTasks- Use late tasks (aka task lag) as a metric for determining if a disaster is in progress (defaults totrue)criticalAvgTaskLagMillis- If the average time past due for all pending tasks is greater than this, a disaster is in progress (likely due to a severe lack of resources in the cluster), defaults to4 minutes (240000)criticalOverdueTaskPortion- If the portion of tasks taht are considered overdue is this fraction of the total running tasks in the cluster, a disaster is in progress. Defaults to0.1or one tenth of tasks are pending and overdue.
Lost Agents
checkLostSlaves- Use lost agents as a metric for determining if a disaster is in progress. Disaster detection only counts agents that have transitioned fromACTIVEtoDEAD. Agents that are gracefully decommissioned and removed won't trigger a disaster. (defaults totrue)criticalLostSlavePortion- If, during the past run of the poller, this portion of the total active agents in the clsuter have transitioned fromACTIVEtoDEADa disaster is in progress. Defaults to0.2or one fifth of the agents in the cluster
Lost Tasks
checkLostTasks- Use lost tasks as a metric for determining if a disaster is in progress (defaults totrue)lostTaskReasons- Consider status updates matching these reasons towards the lost tasks for disaster detection. This is a list of mesosReasonenum values (org.apache.mesos.Protos.TaskStatus.Reason) and defaults to[REASON_INVALID_OFFERS, REASON_AGENT_UNKNOWN, REASON_AGENT_REMOVED, REASON_AGENT_RESTARTED, REASON_MASTER_DISCONNECTED]criticalLostTaskPortion- If this portion of the total active tasks in the cluster have transitioned toLOSTfor one of the above reasons in the last run of the poller, a disaster is in progress. Defaults to0.2
Disabled Actions
Singularity also supports globally disabling certain actions, which can aid in maintenance or cluster recovery after an outage. These can be added and removed manually on the /disasters UI page. You can also specify a list of actions that can automatically be disabled when a disaster is detected by specifying disableActionsOnDisaster in the disasterDetection portion of you Singularity config yaml. When a disaster is detected, any action specified will automatically be disabled, and will be enabled again when the disaster has cleared. If during runtime you want to stop the disaster detector from disabling actions (for example, it keeps detecting a false positive), you can disable the automated actions in the UI or POST to the /api/disasters/disable endpoint.