Singularity Configuration
Singularity (Service) is configured by DropWizard via a YAML file referenced on the command line. Top-level configuration elements reside at the root of the configuration file alongside DropWizard configuration.
Root Configuration
Common Configuration
These are settings that are more likely to be altered.
General
Parameter |
Default |
Description |
Type |
allowRequestsWithoutOwners |
true |
If false, submitting a request without at least one owner will return a 400 |
boolean |
commonHostnameSuffixToOmit |
null |
If specified, will remove this hostname suffix from all taskIds |
string |
defaultSlavePlacement |
GREEDY |
See Agent Placement |
enum / string [GREEDY, OPTIMISTIC, SEPARATE (deprecated), SEPARATE_BY_DEPLOY, SEPARATE_BY_REQUEST, SPREAD_ALL_SLAVES] |
defaultValueForKillTasksOfPausedRequests |
true |
When a task is paused, the API allows for the tasks of that request to optionally not be killed. If that parameter is not set in the pause request, this value is used |
boolean |
deltaAfterWhichTasksAreLateMillis |
30000 (30 seconds) |
The amount of time after a task's schedule time that Singularity will classify it (in state API and dashboard) as a late task |
long |
deployHealthyBySeconds |
120 |
Default amount of time to allow pending deploys to run for before transitioning them into active deploys. If more than this time passes before a deploy can be considered healthy (all of its tasks either make it to TASK_RUNNING or pass healthchecks), then the deploy will be rejected |
long |
killNonLongRunningTasksInCleanupAfterSeconds |
86400 (1 day) |
Kills scheduled and one-off tasks after this amount of time if they have been scheduled for cleaning (a new deploy succeeds, the underlying agent is decomissioned) |
long |
hostname |
null |
Hostname of this Singularity instance |
string |
Healthchecks and New Task Checks
Parameter |
Default |
Description |
Type |
considerTaskHealthyAfterRunningForSeconds |
5 |
Tasks which make it to TASK_RUNNING and run for at least this long (that are not health-checked) are considered healthy |
long |
healthcheckIntervalSeconds |
5 |
Default amount of time to wait in between attempting task healthchecks |
int |
healthcheckTimeoutSeconds |
5 |
Default amount of time to wait for healthchecks to return before considering them failed |
int |
killAfterTasksDoNotRunDefaultSeconds |
600 (10 minutes) |
Amount of time after which new tasks (that are not part of a deploy) will be killed if they do not enter TASK_RUNNING |
long |
healthcheckMaxRetries |
|
Default max number of time to retry a failed healthcheck for a task before considering the task to be unhealthy |
int |
startupDelaySeconds |
|
By default, wait this long before starting any healthchecks on a task |
int |
startupTimeoutSeconds |
45 |
If a healthchecked task has not responded with a valid http response in startupTimeoutSeconds consider it unhealthy |
int |
startupIntervalSeconds |
2 |
In the startup period (before a valid http response has been received) wait this long between healthcheck attempts |
int |
healthcheckFailureStatusCodes |
[] |
If any of these status codes is received during a healthcheck, immediately consider the task unhealthy, do not retry the check |
List |
Deploys
Parameter |
Default |
Description |
Type |
defaultDeployStepWaitTimeMs |
0 |
If using an incremental deploy, wait this long between deploy steps if not specified in the deploy |
int |
defaultDeployMaxTaskRetries |
0 |
Allow this many tasks to fail and be retried before failing a new deploy |
int |
allowDeployOfPausedRequests |
false |
If true, paused requests can be deployed without unpausing or starting new tasks at deploy time |
boolean |
Limits
Parameter |
Default |
Description |
Type |
maxDeployIdSize |
50 |
Deploy ids over this size will cause deploy requests to fail with 400 |
int |
maxRequestIdSize |
100 |
Request ids over this size will cause new requests to fail with 400 |
int |
Cooldown
Cooldown is divided into 2 types, fast and slow. These are essentially two sets of differing thresholds for cooldown, meant to act quickly for cases where there are rapid failures, but still provide a notification/signal for cases where there are slow but repeated failures
Parameter |
Default |
Description |
Type |
fastFailureCooldownCount/slowFailureCooldownCount |
3/5 |
The number of sequential failures after which a request is placed into system cooldown |
int |
fastFailureCooldownMs/slowFailureCooldownMs |
30000/600000 |
The time window during which ...CooldownCount failures must occur |
long |
fastCooldownExpiresMinutesWithoutFailure/slowCooldownExpiresMinutesWithoutFailure |
5/5 |
If there are no failures after this time period, the request will exit cooldown |
int |
cooldownMinScheduleSeconds |
120 |
When a request enters cooldown, new tasks are delayed by at least this long |
long |
Load Balancer API
Parameter |
Default |
Description |
Type |
loadBalancerQueryParams |
null |
Additional query parameters to pass to the Load Balancer API |
Map |
loadBalancerRequestTimeoutMillis |
2000 |
The timeout for making API calls to the Load Balancer API (these will be retried) |
long |
loadBalancerUri |
null |
The URI of the Load Balancer API (Baragon) |
string |
deleteRemovedRequestsFromLoadBalancer |
false |
If a request is removed from Singularity, issue a DELETE to the load balancer for that service |
boolean |
User Interface
Parameter |
Default |
Description |
Type |
sandboxDefaultsToTaskId |
false |
If true, the Singularity API will return the sandbox view of root/taskId when queried without a path (Useful when using SingularityExecutor) |
boolean |
enableCorsFilter |
false |
If true, provides a Bundle which will enable CORS |
boolean |
Internal Scheduler Configuration
These settings are less likely to be changed, but were included in the configuration instead of hardcoding values.
Pollers
Parameter |
Default |
Description |
Type |
checkDeploysEverySeconds |
5 |
Check the status (health) of pending deploys, promoting them to active or removing them on this interval |
long |
checkNewTasksEverySeconds |
5 |
Check the health of new (non-deployed, non-healthchecked) tasks to make sure they eventually get to running on this interval |
long |
checkSchedulerEverySeconds |
5 |
Runs scheduler checks (processes decommissions and pending queue) on this interval (these tasks also run when an offer is received) |
long |
checkWebhooksEveryMillis |
10000 (10 seconds) |
Will check for and send new queued webhooks on this interval |
long |
cleanupEverySeconds |
5 |
Will cleanup request, task, and other queues on this interval |
long |
persistHistoryEverySeconds |
3600 (1 hour) |
Moves stale historical task data from ZooKeeper into the database, setting to 0 will disable history persistence |
long |
saveStateEverySeconds |
60 |
State about this Singularity instance is saved (available over API) on this interval |
long |
checkJobsEveryMillis |
600000 (10 mins) |
Check for jobs running longer than the expected time on this interval |
long |
checkExpiringUserActionEveryMillis |
45000 |
Check for expiring actions that should be expired on this interval |
long |
Mesos
Parameter |
Default |
Description |
Type |
checkReconcileWhenRunningEveryMillis |
30000 (30 seconds) |
When reconciling tasks, will re-request task updates on this interval until reconciliation finishes |
long |
startNewReconcileEverySeconds |
600 (10 minutes) |
Starts a new reconciliation cycle (if one is not currently running) on this interval (A relatively costly operation that detects updates Mesos failed to deliver) |
long |
askDriverToKillTasksAgainAfterMillis |
300000 (5 minutes) |
Amount of time to wait before instruction mesos to kill a task which has been killed by Singularity but is still running |
long |
Thread Pools
Parameter |
Default |
Description |
Type |
checkNewTasksScheduledThreads |
3 |
Max number of threads to use to check new tasks |
int |
healthcheckStartThreads |
3 |
Max number of threads to use to start healthchecks |
int |
logFetchMaxThreads |
15 |
Max number of threads to use to fetch log directories from Mesos REST API |
int |
Operational
Parameter |
Default |
Description |
Type |
closeWaitSeconds |
5 |
Will wait at least this many seconds when shutting down thread pools |
long |
compressLargeDataObjects |
true |
Will compress larger objects inside of ZooKeeper and the database |
boolean |
maxHealthcheckResponseBodyBytes |
8192 |
Number of bytes to save from healthcheck responses (displayed in UI) |
int |
maxQueuedUpdatesPerWebhook |
50 |
Max number of updates to queue for a given webhook url, after which some webhooks will not be delivered |
int |
zookeeperAsyncTimeout |
5000 |
Milliseconds for ZooKeeper timeout. Calls to ZooKeeper which take over this timeout will cause the operations to fail and Singularity to abort |
long |
cacheStateForMillis |
30000 (30 seconds) |
Amount of time to cache internal state for when requested over API |
long |
sandboxHttpTimeoutMillis |
5000 (5 seconds) |
Sandbox HTTP calls will timeout after this amount of time (fetching logs for emails / UI) |
newTaskCheckerBaseDelaySeconds |
1 |
Added to the the amount of deploy to wait before checking a new task |
long |
allowTestResourceCalls |
false |
If true, allows calls to be made to the test resource, which can test internal methods |
boolean |
deleteDeploysFromZkWhenNoDatabaseAfterHours |
336 (14 days) |
Delete deploys from zk when they are older than this if we are not using a database |
long |
maxStaleDeploysPerRequestInZkWhenNoDatabase |
infinite (disabled) |
Delete oldest deploys from zk when there are more than this number for a given request, if we're not already persisting them to a database |
int |
deleteStaleRequestsFromZkWhenNoDatabaseAfterHours |
336 (14 days) |
Delete stale requests after this amount of time if we are not using a database |
long |
maxRequestsWithHistoryInZkWhenNoDatabase |
infinite (disabled) |
Delete history of oldest requests from zk when there are more than this number of requests, if we're not already persisting them to a database |
int |
deleteTasksFromZkWhenNoDatabaseAfterHours |
168 (7 days) |
Delete old tasks from zk after this amount of time if we are not using a database |
long |
maxStaleTasksPerRequestInZkWhenNoDatabase |
infinite (disabled) |
Delete oldest tasks from zk when there are more than this number for a given request, if we're not already persisting them to a database |
int |
taskPersistAfterStartupBufferMillis |
60000ms (1 min) |
Wait this long after a task starts before persisting it in history |
long |
deleteDeadSlavesAfterHours |
168 (7 days) |
Remove dead agents from the list after this amount of time |
long |
deleteUndeliverableWebhooksAfterHours |
168 (7 days) |
Delete (and stop retrying) failed webhooks after this amount of time |
long |
waitForListeners |
true |
If true, the event system waits for all listeners having processed an event. |
boolean |
warnIfScheduledJobIsRunningForAtLeastMillis |
86400000 (1 day) |
Warn if a scheduled job has been running for this long |
long |
warnIfScheduledJobIsRunningPastNextRunPct |
200 |
Warn if a scheduled job has run this much past its next scheduled run time (e.g. 200 => ran through next two run times) |
int |
pendingDeployHoldTaskDuringDecommissionMillis |
600000ms (10 minutes) |
Don't kill tasks on a decommissioning agent that are part of a pending deploy for this amount of time to allow the deploy to complete |
long |
defaultBounceExpirationMinutes |
60 |
Expire a bounce after this many minutes if an expiration is not provided in the request to bounce |
int |
cacheOffers |
false |
Hold on to unused offers for up to cacheOffersForMillis |
boolean |
cacheOffersForMillis |
If cacheOffers is true, decline offers after this amount of time if they ahve not been used |
long |
offerCacheSize |
The maximum number of offers to cache at once |
int |
Mesos Configuration
These settings should live under the "mesos" field inside the root configuration.
Framework
Parameter |
Default |
Description |
Type |
master |
null |
A comma separated list of mesos master http(s)://user:password@host:port user and password are optional, http is used if no protocol is provided |
String |
frameworkName |
null |
|
String |
frameworkId |
null |
|
String |
frameworkFailoverTimeout |
0.0 |
|
double |
frameworkRole |
null |
Specify framework's desired role when Singularity registers with the master |
String |
checkpoint |
true |
|
boolean |
credentialPrincipal |
|
Used to enable authorization based on the authenticated principal |
String |
Resource Limits
Parameter |
Default |
Description |
Type |
defaultCpus |
1 |
Number of CPUs to request for a task if none are specified |
int |
defaultMemory |
64 |
MB of memory to request for a task if none is specified |
int |
defaultDisk |
1024 |
MB of disk to request for a task if none is specified |
int |
maxNumInstancesPerRequest |
25 |
Max instances (tasks) to allow for a request (requests using over this will return a 400) |
int |
maxNumCpusPerInstance |
50 |
Max number of CPUs allowed on a given task |
int |
maxNumCpusPerRequest |
900 |
Max number of CPUs allowed for a given request (cpus per task * task instance) |
int |
maxMemoryMbPerInstance |
24000 |
Max MB of memory allowed on a given task |
int |
maxMemoryMbPerRequest |
450000 |
Max MB of memory allowed for a given request (memoryMb per task * task instances) |
int |
Racks
Parameter |
Default |
Description |
Type |
rackIdAttributeKey |
rackid |
The Mesos agent attribute to denote a rack |
string |
defaultRackId |
DEFAULT |
The rackId to assign to a agent if no rackId attribute value is present |
string |
Agents
Parameter |
Default |
Description |
Type |
agentHttpPort |
5051 |
The port to talk to agents on |
int |
agentHttpsPort |
absent |
The HTTPS port to talk to agents on |
Integer (Optional) |
Offers
Parameter |
Default |
Description |
Type |
allocatedResourceWeight |
0.5 |
This portion of an offer's score depends on the amount of resources currently allocated by mesos on the mesos agent |
double |
inUseResourceWeight |
0.5 |
This portion of an offer's score depends on the currently used resources on a mesos agent as reported by the agent statistics endpoint |
double |
cpuWeight |
0.4 |
The weight the agent's cpu carries when scoring an offer |
double |
memWeight |
0.4 |
The weight the agent's memory carries when scoring an offer |
double |
diskWeight |
0.2 |
The weight the agent's disk carries when scoring an offer |
double |
Database
Network Configuration
These settings should live under the "network" field of the root configuration.
Parameter |
Default |
Description |
Type |
defaultPortMapping |
false |
If no port mapping is provided, map all Mesos-provided ports to the host |
boolean |
History Purging
These settings live under the "historyPuring" field in the root configuration
Parameter |
Default |
Description |
Type |
deleteTaskHistoryAfterDays |
365 |
Purge tasks older than this many days |
int |
deleteTaskHistoryAfterTasksPerRequest |
10000 |
Purge oldest tasks when there are more than this many associated with a single request |
int |
deleteTaskHistoryBytesInsteadOfEntireRow |
true |
Only delete the taskHistoryBytes instead of the entire record of the task (e.g. to save space) |
boolean |
checkTaskHistoryEveryHours |
24 |
Run the purge every x hours |
int |
enabled |
false |
Should we run the database purge |
boolean |
S3
These settings live under the "s3" field in the root configuration. If using the SingularityS3Uploader, this section will need to be provided in order to view lists of and download s3 logs from the SingularityUI.
Parameter |
Default |
Description |
Type |
maxS3Thread |
3 |
Max threads to run for fetching logs from s3 |
int |
waitForS3ListSeconds |
5 |
Timeout in seconds for fetching list of s3 logs |
int |
waitForS3LinksSeconds |
1 |
Timeout in seconds for creating new s3 links |
int |
expireS3LinksAfterMillis |
86400000 (1 day) |
Expire generated s3 log links after this amount of time |
long |
s3Bucket |
|
S3 bucket to search for logs |
String |
groupOverrides |
|
Extra s3 configurations provided such that individual requests may use separate s3 buckets. Each S3GroupOverrideConfiguration has a name specified by the Map key and consists of an s3Bueckt, s3AccessKey, and s3SecretKey |
Map |
s3KeyFormat |
|
Search for logs with keys in this format, should be the same as the key format set in the SingularityS3Uploader |
String |
s3AccessKey |
|
aws access key for the specified s3 bucket |
String |
s3SecretKey |
|
aws secret key for the specified s3 bucket |
String |
missingTaskDefaultS3SearchPeriodMillis |
259200000ms (3 days) |
Search over this many days for s3 logs when no task data is found |
long |
Sentry
These settings live under the "sentry" field in the root config and enable Singularity error reporting to sentry.
Parameter |
Default |
Description |
Type |
dsn |
|
Sentry DSN (Data Source Name) |
String |
prefix |
"" |
Prefix string for event culprit naming and messages |
String |
SMTP
These settings live under the "smtp" field in the root config.
Parameter |
Default |
Description |
Type |
username |
|
smtp username |
String |
password |
|
smtp password |
String |
taskLogLength |
512 |
Send this many lines of a tasks log in emails |
int |
host |
localhost |
Host for smtp session |
String |
port |
25 |
Port for smtp session |
int |
from |
"singularity-no-reply@example.com" |
Send emails form this address |
String |
mailMaxThreads |
3 |
max threads for email sending process |
int |
admins |
[] |
List of admin user emails |
List\ |
rateLimitAfterNotifications |
5 |
Rate limit email sending after this many notifications have been sent in rateLimitPeriodMillis |
int |
rateLimitPeriodMillis |
60000 (10 mins) |
time period for rateLimitAfterNotifications |
long |
rateLimitCooldownMillis |
3600000 (1 hour) |
Cooldown time before rate limiting is removed |
long |
taskEmailTailFiles |
[stdout, stderr] |
Send the tail of these files in messages about tasks |
List\ |
emails |
See below |
See below |
Map\> |
subjectPrefix |
unset |
String prepended to the email subject line |
String |
ssl |
false |
Connect to SMTP host over ssl |
boolean |
You may need libmail-java
installed on your Singularity master host in order to connect to your smtp server.
Emails List
The emails list determines what emails to send notifications to and for what events. You can specify a map of EmailType
to a list of EmailDestination
s
EmailType
corresponds to different events that could trigger emails such as TASK_LOST
or TASK_FAILED
EmailDestination
corresponds to one of OWNERS
(as listed on the Singularity Request), ACTION_TAKER
(user who triggered the action causing the email update), or ADMINS
(specified in config as seen above)
An email list might look something like
smtp:
emails:
TASK_LOST:
- OWNERS
TASK_FAILED:
- OWNERS
TASK_FAILED_DECOMISSIONED:
- OWNERS
TASK_KILLED:
- OWNERS
TASK_KILLED_DECOMISSIONED:
- OWNERS
TASK_KILLED_UNHEALTHY:
- OWNERS
TASK_SCHEDULED_OVERDUE_TO_FINISH:
- OWNERS
TASK_FINISHED_ON_DEMAND:
- OWNERS
TASK_FINISHED_RUN_ONCE:
- OWNERS
TASK_FINISHED_SCHEDULED:
- OWNERS
TASK_FINISHED_LONG_RUNNING:
- OWNERS
UI Configuration
These settings live under the "ui" field in the root config.
Parameter |
Default |
Description |
Type |
title |
"Singularity" |
Title shown in the left of the menu bar in ui |
String |
navColor |
"" |
Color for nav bar |
String |
baseUrl |
|
Base url where the ui will be hosted (e.g. http://localhost:7099/singularity) |
String |
runningTaskLogPath |
stdout |
Generate link to this log for running tasks on the request page |
String |
finishedTaskLogPath |
stdout |
Generate link to this log for finished tasks on the request page |
String |
hideNewDeployButton |
false |
Don't show the 'New Deploy' button |
boolean |
hideNewRequestButton |
false |
Don't show the 'New Request' button |
boolean |
rootUrlMode |
INDEX_CATCHALL |
INDEX_CATCHALL : UI is served off of / using a catchall resource. UI_REDIRECT : UI is served off of /ui, path and index redirects there. DISABLED : UI is served off of /ui and the root resource is not served at all |
enum / String INDEX_CATCHALL , UI_REDIRECT , DISABLED |
Zookeeper
These settings live under the "zookeeper" field in the root config.
Parameter |
Default |
Description |
Type |
quorum |
|
Comma separated host:port list of zk hosts |
String |
sessionTimeoutMillis |
600_000 |
zookeeper session timeout |
int |
connectTimeoutMillis |
60_000 |
Connect to zookeeper timeout |
int |
retryBaseSleepTimeMilliseconds |
1_000 |
Wait time between zookeeper connection retries |
int |
retryMaxTries |
3 |
Max retries to obtain a zookeeper connection before aborting |
int |
zkNamespace |
|
Path under which to store Singularity data in zk (e.g. /singularity) |
String |