7.3. Profiling and logging¶

7.3.1. Profiling memory consumption with `memray`¶

MIRGE-Com automatically tracks overall memory consumption on host and devices via logpyle, but memray can be used to gain a finer-grained understanding of how much memory is allocated in which parts of the code.

MIRGE-Com allocates two types of memory during execution:

Python host memory for numpy data, Python lists and dicts, etc. This memory is always heap-allocated via malloc() calls.
OpenCL device memory for the mesh, etc. At the time of this writing, this memory is by default allocated via OpenCL’s Shared Virtual Memory (SVM) mechanism and uses a pyopencl memory pool. When running with pocl on the CPU, the SVM memory is allocated via malloc() calls. When running with pocl on Nvidia GPUs, the SVM memory is allocated using CUDA’s managed (unified) memory, via cuMemAllocManaged().

After installing memray (via e.g. $ conda install memray), memory consumption can be profiled on Linux or MacOS in the following way:

# Collect the trace:
$ python -m memray run --native -m mpi4py examples/wave.py --lazy
[...]
# Create a flamegraph HTML
$ python -m memray flamegraph memray-wave.py.44955.bin
[...]
# Open the HTML file
$ open memray-flamegraph-wave.44955.html

Note

The flamegraph analysis (as well as other analysis tools) needs to be run on the same system where the trace was collected, as it needs access to the symbols from the machine’s binaries. The resulting HTML files can be opened on any system.

Note

Although tracing the allocations has a low performance overhead, the resulting trace files and flamegraphs can reach sizes of hundreds of MBytes. memray releases after 1.6.0 will include an option (--aggregate) to reduce the sizes of these files.

Warning

For the reasons outlined in the next subsection, we highly recommend running the analysis when running on CPUs, not GPUs.

7.3.1.1. Common issues¶

Incorrectly low memory consumption when running with pocl-cuda on GPUs

When running with pocl-cuda on Nvidia GPUs, the memory consumption will appear to be much lower than when running the same analysis on the CPU. The reason for this is that we use unified memory on Nvidia GPUs, in which case the SVM memory allocations will not be counted against the running application, but against the CUDA driver and runtime, thus hiding the memory consumption from tools such as ps or memray. The overall consumption can still be estimated by looking at the system memory via e.g. free.
High virtual memory consumption with an installed pocl-cuda

When pocl-cuda initializes, it consumes a large amount of virtual memory (~100 GByte) just due to the initialization. To make the output of memray easier to understand (e.g., memray sizes the flamegraph according to virtual memory consumed), we recommend disabling or uninstalling pocl-cuda for profiling memory consumption, via e.g. $ conda uninstall pocl-cuda.

7.3.2. Profiling kernel execution¶

You can use mirgecom.profiling.PyOpenCLProfilingArrayContext instead of PyOpenCLArrayContext to profile kernel executions. In addition to using this array context, you also need to enable profiling in the underlying pyopencl.CommandQueue, like this:

queue = cl.CommandQueue(cl_ctx,
         properties=cl.command_queue_properties.PROFILING_ENABLE)

Note that profiling has a performance impact (~20% at the time of this writing).

class mirgecom.profiling.PyOpenCLProfilingArrayContext(queue, allocator=None, logmgr=None)[source]¶

An array context that profiles OpenCL kernel executions.

Parameters:: logmgr (LogManager | None)

tabulate_profiling_data()[source]¶

Return a pytools.Table with the profiling results.

Return type:: Table

call_loopy(t_unit, **kwargs)[source]¶

Execute the loopy kernel and profile it.

Return type:: dict

get_profiling_data_for_kernel(kernel_name)[source]¶

Return profiling data for kernel kernel_name.

Parameters:: kernel_name (str)
Return type:: MultiCallKernelProfile

reset_profiling_data_for_kernel(kernel_name)[source]¶

Reset profiling data for kernel kernel_name.

Parameters:: kernel_name (str)
Return type:: None

Inherits from arraycontext.PyOpenCLArrayContext.

Note

Profiling of pyopencl kernels (that is, kernels that do not get called through call_loopy()) is restricted to a single instance of this class. If there are multiple instances, only the first one created will be able to profile these kernels.

class mirgecom.profiling.SingleCallKernelProfile(time, flops, bytes_accessed, footprint_bytes)[source]¶

Class to hold the results of a single kernel execution.

Parameters:

time (int)
flops (int)
bytes_accessed (int)
footprint_bytes (int | None)

class mirgecom.profiling.MultiCallKernelProfile(num_calls, time, flops, bytes_accessed, footprint_bytes)[source]¶

Class to hold the results of multiple kernel executions.

Parameters:

num_calls (int)
time (StatisticsAccumulator)
flops (StatisticsAccumulator)
bytes_accessed (StatisticsAccumulator)
footprint_bytes (StatisticsAccumulator)

7.3.3. Time series logging¶

Mirgecom supports logging of simulation and profiling quantities with the help of logpyle. Logpyle requires classes to describe how quantities for logging are calculated. For MIRGE-Com, these classes are described below.

class mirgecom.logging_quantities.StateConsumer(extract_vars_for_logging)[source]¶

Base class for quantities that require a state for logging.

Parameters:: extract_vars_for_logging (Callable)

__init__(extract_vars_for_logging)[source]¶

Store the function to extract state variables.

Parameters:

extract_vars_for_logging(dim – Returns a dict(quantity_name: values) of the state vars for a particular state.
state – Returns a dict(quantity_name: values) of the state vars for a particular state.
eos) – Returns a dict(quantity_name: values) of the state vars for a particular state.
extract_vars_for_logging (Callable)

set_state_vars(state_vars)[source]¶

Update the state vector of the object.

Parameters:: state_vars (ndarray)
Return type:: None

class mirgecom.logging_quantities.DiscretizationBasedQuantity(dcoll, quantity, op, extract_vars_for_logging, units_logging, name=None, axis=None, dd=DOFDesc(domain_tag=VolumeDomainTag(tag=<class 'grudge.dof_desc.VTAG_ALL'>), discretization_tag=<class 'grudge.dof_desc.DISCR_TAG_BASE'>))[source]¶

Logging support for physical quantities.

Possible rank aggregation operations (op) are: min, max, L2_norm.

Parameters:

dcoll (DiscretizationCollection)
quantity (str)
op (str)
name (str | None)
axis (int | None)

class mirgecom.logging_quantities.KernelProfile(actx, kernel_name)[source]¶

Logging support for statistics of the OpenCL kernel profiling (num_calls, time, flops, bytes_accessed, footprint).

All statistics except num_calls are averages.

Parameters:

actx (PyOpenCLArrayContext) – The array context from which to collect statistics. Must have profiling enabled in the OpenCL command queue.
kernel_name (str) – Name of the kernel to profile.

class mirgecom.logging_quantities.PythonMemoryUsage(name=None)[source]¶

Logging support for Python memory usage (RSS, host).

Uses psutil to track memory usage. Virtually no overhead.

Parameters:: name (str | None)

class mirgecom.logging_quantities.DeviceMemoryUsageCUDA(name=None)[source]¶

Logging support for Nvidia CUDA GPU memory usage.

Parameters:: name (str | None)

class mirgecom.logging_quantities.DeviceMemoryUsageAMD(dev, name=None)[source]¶

Logging support for AMD GPU memory usage.

Parameters:

dev (Device)
name (str | None)

mirgecom.logging_quantities.initialize_logmgr(enable_logmgr, filename=None, mode='wu', mpi_comm=None)[source]¶

Create and initialize a mirgecom-specific logpyle.LogManager.

Parameters:

enable_logmgr (bool)
filename (str | None)
mode (str)

Return type:

LogManager | None

mirgecom.logging_quantities.logmgr_add_cl_device_info(logmgr, queue)[source]¶

Add information about the OpenCL device to the log.

Parameters:

logmgr (LogManager)
queue (CommandQueue)

Return type:

None

mirgecom.logging_quantities.logmgr_add_simulation_info(logmgr, sim_info)[source]¶

Add some user-defined information to the logpyle output.

Parameters:

logmgr (LogManager)
sim_info (Dict[str, Any])

Return type:

None

mirgecom.logging_quantities.logmgr_add_device_memory_usage(logmgr, queue)[source]¶

Add the OpenCL device memory usage to the log.

Parameters:

logmgr (LogManager)
queue (CommandQueue)

Return type:

None

mirgecom.logging_quantities.logmgr_add_many_discretization_quantities(logmgr, dcoll, dim, extract_vars_for_logging, units_for_logging, dd=DOFDesc(domain_tag=VolumeDomainTag(tag=<class 'grudge.dof_desc.VTAG_ALL'>), discretization_tag=<class 'grudge.dof_desc.DISCR_TAG_BASE'>))[source]¶

Add default discretization quantities to the logmgr.

Parameters:: logmgr (LogManager)
Return type:: None

mirgecom.logging_quantities.logmgr_add_mempool_usage(logmgr, pool)[source]¶

Add the memory pool usage to the log.

Parameters:

logmgr (LogManager)
pool (MemoryPool | SVMPool)

Return type:

None

mirgecom.logging_quantities.add_package_versions(mgr, path_to_version_sh=None)[source]¶

Add the output of the emirge version.sh script to the log.

Parameters:

mgr (LogManager) – The logpyle.LogManager to add the versions to.
path_to_version_sh (str | None) – Path to emirge’s version.sh script. The function will attempt to find this script automatically if this argument is not specified.

Return type:

None

mirgecom.logging_quantities.set_sim_state(mgr, dim, state, eos)[source]¶

Update the simulation state of all StateConsumer of the log manager.

Parameters:: mgr (LogManager) – The logpyle.LogManager whose StateConsumer quantities will receive state.
Return type:: None

mirgecom.logging_quantities.logmgr_set_time(mgr, steps, time)[source]¶

Set the (current/initial) time/step count explicitly (e.g., for restart).

Parameters:

mgr (LogManager)
steps (int)
time (float)

Return type:

None

An overview of how to use logpyle is given in the Logpyle documentation.

7.3. Profiling and logging¶

7.3.1. Profiling memory consumption with memray¶

7.3.1.1. Common issues¶

7.3.2. Profiling kernel execution¶

7.3.3. Time series logging¶

7.3.1. Profiling memory consumption with `memray`¶