Information about the current state of the network helps to track progress, identify performance issues, and debug failures.
Dask.distributed includes a web interface to help deliver this information over
a normal web page in real time. This web interface is launched by default
wherever the scheduler is launched if the scheduler machine has Bokeh
conda install bokeh -c bokeh).
List of Servers¶
There are a few sets of diagnostic pages served at different ports:
- Main Scheduler pages at
http://scheduler-address:8787. These pages, particularly the
/statuspage are the main page that most people associate with Dask. These pages are served from a separate standalone Bokeh server application running in a separate process.
- Debug Scheduler pages at
http://scheduler-address:8788. These pages have more detailed diagnostic information about the scheduler. They are more often used by developers than by users, but may still be of interest to the performance-conscious. These pages run from inside the scheduler process, and so compete for resources with the main scheduler.
- Debug Worker pages for each worker at
http://worker-address:8789. These pages have detailed diagnostic information about the worker. Like the diagnostic scheduler pages they are of more utility to developers or to people looking to understand the performance of their underlying cluster. If port 8789 is unavailable (for example it is in use by another worker) then a random port is chosen. A list of all ports can be obtained from looking at the service ports for each worker in the result of calling
The rest of this document will be about the main pages at
The available pages are
<page> is one of
status: a stream of recently run tasks, progress bars, resource use
tasks: a larger stream of the last 100k tasks
workers: basic information about workers and their current load
The following plots show a trace of the following computation:
from distributed import Client from time import sleep import random def inc(x): sleep(random.random() / 10) return x + 1 def dec(x): sleep(random.random() / 10) return x - 1 def add(x, y): sleep(random.random() / 10) return x + y client = Client('127.0.0.1:8786') incs = client.map(inc, range(100)) decs = client.map(dec, range(100)) adds = client.map(add, incs, decs) total = client.submit(sum, adds) del incs, decs, adds total.result()
The interface shows the progress of the various computations as well as the exact number completed.
Each bar is assigned a color according to the function being run. Each bar has a few components. On the left the lighter shade is the number of tasks that have both completed and have been released from memory. The darker shade to the right corresponds to the tasks that are completed and whose data still reside in memory. If errors occur then they appear as a black colored block to the right.
Typical computations may involve dozens of kinds of functions. We handle this visually with the following approaches:
- Functions are ordered by the number of total tasks
- The colors are assigned in a round-robin fashion from a standard palette
- The progress bars shrink horizontally to make space for more functions
- Only the largest functions (in terms of number of tasks) are displayed
Counts of tasks processing, waiting for dependencies, processing, etc.. are displayed in the title bar.
The interface shows the relative memory use of each function with a horizontal bar sorted by function name.
The title shows the number of total bytes in use. Hovering over any bar tells you the specific function and how many bytes its results are actively taking up in memory. This does not count data that has been released.
The task stream plot shows when tasks complete on which workers. Worker cores are on the y-axis and time is on the x-axis. As a worker completes a task its start and end times are recorded and a rectangle is added to this plot accordingly.
If data transfer occurs between workers a red bar appears preceding the task bar showing the duration of the transfer. If an error occurs than a black bar replaces the normal color. This plot show the last 1000 tasks. It resets if there is a delay greater than 10 seconds.
For a full history of the last 100,000 tasks see the
The resources plot show the average CPU and Memory use over time as well as
average network traffic. More detailed information on a per-worker basis is
available in the
Connecting to Web Interface¶
dask-scheduler prints out the address of the web interface:
INFO - Bokeh UI at: http://10.129.39.91:8787/status ... INFO - Starting Bokeh server on port 8787 with applications at paths ['/status', '/tasks']
The machine hosting the scheduler runs an HTTP server serving at that address.
Some clusters restrict the ports that are visible to the outside world. These
ports may include the default port for the web interface,
8787. There are
a few ways to handle this:
- Open port
8787to the outside world. Often this involves asking your cluster administrator.
- Use a different port that is publicly accessible using the
--bokeh-port PORToption on the
- Use fancier techniques, like Port Forwarding
Running distributed on a remote machine can cause issues with viewing the web UI – this depends on the remote machines network configuration.
If you have SSH access then one way to gain access to a blocked port is through SSH port forwarding. A typical use case looks like the following:
local$ ssh -L 8000:localhost:8787 user@remote remote$ dask-scheduler # now, the web UI is visible at localhost:8000 remote$ # continue to set up dask if needed -- add workers, etc
It is then possible to go to
localhost:8000 and see Dask Web UI. This same approach is
not specific to dask.distributed, but can be used by any service that operates over a
network, such as Jupyter notebooks. For example, if we chose to do this we could
forward port 8888 (the default Jupyter port) to port 8001 with
ssh -L 8001:localhost:8888 user@remote.