As with any distributed computation system, taking full advantage of Dask distributed sometimes requires configuration. Some options can be passed as API parameters and/or command line options to the various Dask executables. However, some options can also be entered in the Dask configuration file.
Dask accepts some configuration options in a configuration file, which by
default is a
.dask/config.yaml file located in your home directory.
The file path can be overriden using the
DASK_CONFIG environment variable.
The file is written in the YAML format, which allows for a human-readable hierarchical key-value configuration. All keys in the configuration file are optional, though Dask will create a default configuration file for you on its first launch.
Here is a synopsis of the configuration file:
logging: distributed: info distributed.client: warning bokeh: critical # Scheduler options bandwidth: 100000000 # 100 MB/s estimated worker-worker bandwidth allowed-failures: 3 # number of retries before a task is considered bad pdb-on-err: False # enter debug mode on scheduling error transition-log-length: 100000 # Worker options multiprocessing-method: forkserver # Communication options compression: auto tcp-timeout: 30 # seconds delay before calling an unresponsive connection dead default-scheme: tcp require-encryption: False # whether to require encryption on non-local comms tls: ca-file: myca.pem scheduler: cert: mycert.pem key: mykey.pem worker: cert: mycert.pem key: mykey.pem client: cert: mycert.pem key: mykey.pem #ciphers: #ECDHE-ECDSA-AES128-GCM-SHA256 # Bokeh web dashboard bokeh-export-tool: False
We will review some of those options hereafter.
This key configures the desired compression scheme when transferring data over the network. The default value, “auto”, applies heuristics to try and select the best compression scheme for each piece of data.
The communication scheme used by default. You can
override the default (“tcp”) here, but it is recommended to use explicit URIs
for the various endpoints instead (for example
tls:// if you want to
enable TLS communications).
Whether to require that all non-local communications be encrypted. If true, then Dask will refuse establishing any clear-text communications (for example over TCP without TLS), forcing you to use a secure transport such as TLS.
The default “timeout” on TCP sockets. If a remote endpoint is unresponsive (at the TCP layer, not at the distributed layer) for at least the specified number of seconds, the communication is considered closed. This helps detect endpoints that have been killed or have disconnected abruptly.
This key configures TLS communications. Several sub-keys are recognized:
ca-fileconfigures the CA certificate file used to authenticate and authorize all endpoints.
ciphersrestricts allowed ciphers on TLS communications.
Each kind of endpoint has a dedicated endpoint sub-key:
client. Each endpoint sub-key also supports several
certconfigures the certificate file for the endpoint.
keyconfigures the private key file for the endpoint.
The number of retries before a “suspicious” task is considered bad. A task is considered “suspicious” if the worker died while executing it.
The estimated network bandwidth, in bytes per second, from worker to worker. This value is used to estimate the time it takes to ship data from one node to another, and balance tasks and data accordingly.
This key configures the logging settings. There are two possible formats.
The simple, recommended format configures the desired verbosity level
for each logger. It also sets default values for several loggers such
distributed unless explicitly configured.
A more extended format is possible following the
Configuration dictionary schema.
To enable this extended format, there must be a
version sub-key as
mandated by the schema. The extended format does not set any default values.
logging module uses a hierarchical logger tree.
For example, configuring the logging level for the
logger will also affect its children such as
unless explicitly overriden.