
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetrics | Towards Data Science
The metric collection is an integral part of each machine learning project, which allows us to track the performance of the model and control the progress of learning. Ideally, the metrics should be collected and calculated without any additional overhead costs in the learning process. However, like other components of the learning loop, ineffective metric calculations can introduce unnecessary overhead costs, increase training time and inflate training costs.
This post is the seventh in our series about the profiling and optimization of performance in Pytorch. The series was aimed at emphasizing the critical role of productivity analysis and optimization in the development of machine learning. Each post was focused at different stages of the training pipeline, demonstrating practical tools and methods for analyzing and increasing the use of resources and the effectiveness of the execution time.
In this issue, we are focused on the collection of metric. We will demonstrate how a naive implementation of a metric fee can negatively affect the performance of execution and investigate tools and methods for its analysis and optimization.
To implement our metric collection, we will use Torchmetrics a popular library designed to simplify and standardize metric indicators in Pytorch. Our goals will be:
- Demonstrate overhead costs for execution It is caused by a naive implementation of a metric fee.
- Use Pytorch profiles To determine the narrow places of performance presented by metric calculations.
- Demonstrate optimization methods To reduce the overhead costs of a metric fee.
To facilitate our discussion, we will determine the Toy Pytorch model and evaluate how the metric collection can affect its implementation. We will conduct our experiments on the NVIDIA A40 graphic processor, with Pytorch 2.5.1 Docker Image and Torchmetrics 1.6.1.
It is important to note that the behavior of the metrics collection can vary greatly depending on the equipment, the implementation and architecture of the model. Fragments of the code presented in this post are intended only for demonstrative purposes. Please do not interpret our mention of which tool or technique as approval for its use.
Resnet toy model
In the code below, we determine the simple model of the classification of images with the base of Resnet-18.
import time
import torch
import torchvision
device = "cuda"
model = torchvision.models.resnet18().to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())
We define a synthetic set of data that we will use to teach our toy model.
from torch.utils.data import Dataset, DataLoader
# A dataset with random images and labels
class FakeDataset(Dataset):
def __len__(self):
return 100000000
def __getitem__(self, index):
rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
label = torch.tensor(data=index % 1000, dtype=torch.int64)
return rand_image, label
train_set = FakeDataset()
batch_size = 128
num_workers = 12
train_loader = DataLoader(
dataset=train_set,
batch_size=batch_size,
num_workers=num_workers,
pin_memory=True
)
We determine the collection of standard metrics from Torchmetrics, as well as the control flag to enable or disable metrics.
from torchmetrics import (
MeanMetric,
Accuracy,
Precision,
Recall,
F1Score,
)
# toggle to enable/disable metric collection
capture_metrics = False
if capture_metrics:
metrics = {
"avg_loss": MeanMetric(),
"accuracy": Accuracy(task="multiclass", num_classes=1000),
"precision": Precision(task="multiclass", num_classes=1000),
"recall": Recall(task="multiclass", num_classes=1000),
"f1_score": F1Score(task="multiclass", num_classes=1000),
}
# Move all metrics to the device
metrics = {name: metric.to(device) for name, metric in metrics.items()}
Then we determine the Pytorch Profiler copy, as well as the control flag that allows us to turn on or disable profiling. To obtain a detailed textbook on the use of Pytorch Profiler, please refer to the first message in this series.
from torch import profiler
# toggle to enable/disable profiling
enable_profiler = True
if enable_profiler:
prof = profiler.profile(
schedule=profiler.schedule(wait=10, warmup=2, active=3, repeat=1),
on_trace_ready=profiler.tensorboard_trace_handler("./logs/"),
profile_memory=True,
with_stack=True
)
prof.start()
Finally, we determine the standard stage of training:
model.train()
t0 = time.perf_counter()
total_time = 0
count = 0
for idx, (data, target) in enumerate(train_loader):
data = data.to(device, non_blocking=True)
target = target.to(device, non_blocking=True)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if capture_metrics:
# update metrics
metrics["avg_loss"].update(loss)
for name, metric in metrics.items():
if name != "avg_loss":
metric.update(output, target)
if (idx + 1) % 100 == 0:
# compute metrics
metric_results = {
name: metric.compute().item()
for name, metric in metrics.items()
}
# print metrics
print(f"Step {idx + 1}: {metric_results}")
# reset metrics
for metric in metrics.values():
metric.reset()
elif (idx + 1) % 100 == 0:
# print last loss value
print(f"Step {idx + 1}: Loss = {loss.item():.4f}")
batch_time = time.perf_counter() - t0
t0 = time.perf_counter()
if idx > 10: # skip first steps
total_time += batch_time
count += 1
if enable_profiler:
prof.step()
if idx > 200:
break
if enable_profiler:
prof.stop()
avg_time = total_time/count
print(f'Average step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} images/sec')
Metric collection above the head
To measure the influence of a metric fee on temporary training, we launched our training scenario both with a metric and without indicator. The results are summarized in the following table.

Our naive metric collection has led to a decrease in performance during execution by almost 10% !! Although metrics are necessary for the development of machine learning, it usually includes relatively simple mathematical operations and is unlikely to require such significant overhead costs. What’s happening?!!
Identification of efficiency problems with Pytorch Profiler
To better understand the source of performance degradation, we repeat the training scenario using Pytorch Profiler. The resulting trace is shown below:

The trace reveals the repeated operations “CudastreamSynchronize”, which coincide with noticeable falls in the use of graphic processors. These types of events of the “Synchronization of the CP-GP-GP” were discussed in detail in the second part of our series. At the typical stage of training, the processor and the graphic processor work in parallel: the CP controls such tasks as data transfer to the graphic processor and loading the nucleus, and the graphic processor performs the model on the input data and updates its weight. Ideally, we would like to minimize the synchronization points between the processor and the graphic processor in order to maximize performance. Here, however, we see that the fees of the metric caused the synchronization event by performing the processor for copying GPU data. This requires that the processor suspend its processing until the graphic processor does not catch up, which, in turn, leads to the fact that the graphic processor will wait until the processor resumes the loading of subsequent kernel operations. The bottom line is that these synchronization points lead to ineffective use of both the CPU and the graphic processor. Our intelligence of the metric collection adds eight such synchronization events to each stage of training.
A more thorough study of tracer shows that synchronization events come from the call of updating the average tabula. For an experienced profiling expert, this may be sufficient to determine the main reason, but we will take another step and use the Torch.profler.Repord_Function utility to determine the exact line of code violation.
Profile using Record_Function
To determine the exact source of the synchronization event, we expanded the middle class and redistributed the update method using the Record_function context blocks. This approach allows us to profile individual operations within the framework of the method and identify narrow places of performance.
class ProfileMeanMetric(MeanMetric):
def update(self, value, weight = 1.0):
# broadcast weight to value shape
with profiler.record_function("process value"):
if not isinstance(value, torch.Tensor):
value = torch.as_tensor(value, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
with profiler.record_function("broadcast weight"):
weight = torch.broadcast_to(weight, value.shape)
with profiler.record_function("cast_and_nan_check"):
value, weight = self._cast_and_nan_check_input(value, weight)
if value.numel() == 0:
return
with profiler.record_function("update value"):
self.mean_value += (value * weight).sum()
with profiler.record_function("update weight"):
self.weight += weight.sum()
Then we updated our AVG_Loss metric to use the recently created ProfileMeanmeric and a repeated learning scenario.

The updated track shows that the synchronization event comes from the following line:
weight = torch.as_tensor(weight, dtype=self.dtype, device=self.device)
This operation transforms the scalar value by default weight=1.0
In the tensor of Paturkh and puts it on a graphic processor. The synchronization event occurs because this action triggers a copy of the CPU-GPU data, which requires the CPU to wait for the graphic processor to process a copied value.
Optimization 1: Indicate weight value
Now that we have found a source of the problem, we can easily overcome it by indicating weight value in ours update call. This prevents the transformation of a scaling by default weight=1.0
In the tensor on the graphic processor, avoiding synchronization events:
# update metrics
if capture_metric:
metrics["avg_loss"].update(loss, weight=torch.ones_like(loss))
After applying this change, the scenario shows that we managed to eliminate the initial synchronization event … only to reveal the new one, this time, received with the function _casce_and_nan_check_input:

Profile using Record_function – Part 2
To study our new SYNC event, we expanded our user metric using additional profiling probes and rethink our script.
class ProfileMeanMetric(MeanMetric):
def update(self, value, weight = 1.0):
# broadcast weight to value shape
with profiler.record_function("process value"):
if not isinstance(value, torch.Tensor):
value = torch.as_tensor(value, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
with profiler.record_function("broadcast weight"):
weight = torch.broadcast_to(weight, value.shape)
with profiler.record_function("cast_and_nan_check"):
value, weight = self._cast_and_nan_check_input(value, weight)
if value.numel() == 0:
return
with profiler.record_function("update value"):
self.mean_value += (value * weight).sum()
with profiler.record_function("update weight"):
self.weight += weight.sum()
def _cast_and_nan_check_input(self, x, weight = None):
"""Convert input ``x`` to a tensor and check for Nans."""
with profiler.record_function("process x"):
if not isinstance(x, torch.Tensor):
x = torch.as_tensor(x, dtype=self.dtype,
device=self.device)
with profiler.record_function("process weight"):
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
nans = torch.isnan(x)
if weight is not None:
nans_weight = torch.isnan(weight)
else:
nans_weight = torch.zeros_like(nans).bool()
weight = torch.ones_like(x)
with profiler.record_function("any nans"):
anynans = nans.any() or nans_weight.any()
with profiler.record_function("process nans"):
if anynans:
if self.nan_strategy == "error":
raise RuntimeError("Encountered `nan` values in tensor")
if self.nan_strategy in ("ignore", "warn"):
if self.nan_strategy == "warn":
print("Encountered `nan` values in tensor."
" Will be removed.")
x = x[~(nans | nans_weight)]
weight = weight[~(nans | nans_weight)]
else:
if not isinstance(self.nan_strategy, float):
raise ValueError(f"`nan_strategy` shall be float"
f" but you pass {self.nan_strategy}")
x[nans | nans_weight] = self.nan_strategy
weight[nans | nans_weight] = self.nan_strategy
with profiler.record_function("return value"):
retval = x.to(self.dtype), weight.to(self.dtype)
return retval
The resulting trace is captured below:

The trace indicates directly to the violation line:
anynans = nans.any() or nans_weight.any()
This operation checks for NaN
Values in the input tensors, but it introduces an expensive synchronization event of the CP-GP-GPU, since the operation includes copying data from the graphic processor to the CPU.
With a more careful consideration of the Torchmetric BasegGeagrator class, we find several options for processing the NAN values that go through the code violation line. Nevertheless, for our use option – calculating the average loss indicator – this check is not needed and does not justify the fine in the implementation environment.
Optimization 2: Disconnect check checks Nan
To eliminate overhead costs, we offer to disable NaN
Checks of the value, redesigning _cast_and_nan_check_input
The function instead of static overwording, we have introduced a dynamic solution that can be flexible to any descendants of the BaseagGerator class.
from torchmetrics.aggregation import BaseAggregator
def suppress_nan_check(MetricClass):
assert issubclass(MetricClass, BaseAggregator), MetricClass
class DisableNanCheck(MetricClass):
def _cast_and_nan_check_input(self, x, weight=None):
if not isinstance(x, torch.Tensor):
x = torch.as_tensor(x, dtype=self.dtype,
device=self.device)
if weight is not None and not isinstance(weight, torch.Tensor):
weight = torch.as_tensor(weight, dtype=self.dtype,
device=self.device)
if weight is None:
weight = torch.ones_like(x)
return x.to(self.dtype), weight.to(self.dtype)
return DisableNanCheck
NoNanMeanMetric = suppress_nan_check(MeanMetric)
metrics["avg_loss"] = NoNanMeanMetric().to(device)
Publishing the results of optimization: Success
After the implementation of two optimizations – indications of weight value and shutdown NaN
Checks – we find a step performance and the use of a graphic processor in order to correspond to the definition of our basic experiment. In addition, Pytorch Profiler’s resulting tracer shows that all additional events of CudastreamSynchronize, which were associated with the collection of metrics, were eliminated. With several small changes, we reduced the cost of training by ~ 10% without any changes in the behavior of the metric collection.
In the next section, we will consider the additional optimization of the metric collection collection.
Example 2: Optimization of the placement of the metric device
In the previous section, the values of the metric were on the graphic processor, which makes it logical to store and calculate the metric on the graphic processor. However, in the scenarios, when the values that we want to live on the processor can be preferable to store metrics on the processor to avoid unnecessary gears of the device.
In the code below, we change our scenario in order to calculate the average step of the step using the average indicator on the processor. This change does not affect the implementation of our training stage:
avg_time = NoNanMeanMetric()
t0 = time.perf_counter()
for idx, (data, target) in enumerate(train_loader):
# move data to device
data = data.to(device, non_blocking=True)
target = target.to(device, non_blocking=True)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if capture_metrics:
metrics["avg_loss"].update(loss)
for name, metric in metrics.items():
if name != "avg_loss":
metric.update(output, target)
if (idx + 1) % 100 == 0:
# compute metrics
metric_results = {
name: metric.compute().item()
for name, metric in metrics.items()
}
# print metrics
print(f"Step {idx + 1}: {metric_results}")
# reset metrics
for metric in metrics.values():
metric.reset()
elif (idx + 1) % 100 == 0:
# print last loss value
print(f"Step {idx + 1}: Loss = {loss.item():.4f}")
batch_time = time.perf_counter() - t0
t0 = time.perf_counter()
if idx > 10: # skip first steps
avg_time.update(batch_time)
if enable_profiler:
prof.step()
if idx > 200:
break
if enable_profiler:
prof.stop()
avg_time = avg_time.compute().item()
print(f'Average step time: {avg_time}')
print(f'Throughput: {batch_size/avg_time:.2f} images/sec')
The problem arises when we are trying to expand our scenario to support distributed learning. To demonstrate the problem, we changed the definition of our model to use DistributedDataparallel (DDP):
# toggle to enable/disable ddp
use_ddp = True
if use_ddp:
import os
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "29500"
dist.init_process_group("nccl", rank=0, world_size=1)
torch.cuda.set_device(0)
model = DDP(torchvision.models.resnet18().to(device))
else:
model = torchvision.models.resnet18().to(device)
# insert training loop
# append to end of the script:
if use_ddp:
# destroy the process group
dist.destroy_process_group()
DDP modification leads to the following error:
RuntimeError: No backend type associated with device type cpu
By default, metrics in distributed learning are programmed for synchronization on all devices used. Nevertheless, the synchronization bacund used does not support the metrics stored in the processor.
One way to solve this is to disconnect the synchronization of metric metric with inter-sech.
avg_time = NoNanMeanMetric(sync_on_compute=False)
In our case, when we measure the average time, this decision is acceptable. Nevertheless, in some cases, metric synchronization is necessary, and we may have a different choice, except to move the metric to the graphic process:
avg_time = NoNanMeanMetric().to(device)
Unfortunately, this situation gives a new synchronization event of the CP-GPU, emanating from the renewal function.

This synchronization event is unlikely to be a surprise – in the end, we update the metric of the graphic processor with the value located on the processor, which should require a copy of the memory. However, in the case of scalar metric, you can completely avoid this data transmission using simple optimization.
Optimization 3: Perform metric updates with tensors instead of scalars
The solution is simple: instead of updating the metric using a floating value, we convert to a tensor before calling update
Field
batch_time = torch.as_tensor(batch_time)
avg_time.update(batch_time, torch.ones_like(batch_time))
This insignificant change costs the problematic line of code, eliminates the synchronization event and restores a step towards basic performance.
At first glance, this result may seem amazing: we expect that updating the metric of the graphic processor with the processor tensor should still require a copy of the memory. Nevertheless, Pytorch optimizes operations on scalar tensors using the selected core, which performs the addition without obvious data transfer. This avoids the expensive synchronization event, which otherwise would happen.
Brief content
In this post, we investigated how a naive approach to Torchmetrics can represent the events of the CPU-GPU synchronization and significantly worsen the effectiveness of Pytorch training. Using Pytorch Profiler, we have determined the lines of the code responsible for these synchronization events and the targeted optimization was used to eliminate them:
- Obviously indicate the tensor of weight when calling
MeanMetric.update
The function instead of relying on the default value. - Disconnect the checks on Nan in the database
Aggregator
class or replace them with a more effective alternative. - Carefully control the placement of the device of each metric to minimize unnecessary translations.
- Disconnect the synchronization of the metric metric metric when not required.
- Когда метрика проживает на графическом процессоре, преобразуйте скаляры с плавающей точкой в тензоры, прежде чем передавать их в
update
Function to avoid implicit synchronization.
We have created a selected request for attracting GitHub Torchmetrics on the page, covering some optimization discussed in this post. Please do not hesitate to make your own improvements and optimization!
Source link