Skip to content

Conversation

@xyuzh
Copy link
Contributor

@xyuzh xyuzh commented Dec 8, 2025

This PR adds a new example for running GPU health and communication diagnostics on Ray clusters.

Overview

This example demonstrates how to use independent Ray actors to diagnose GPU and NCCL communication issues in distributed training systems. It's designed to run "stop-time diagnostics" when training jobs encounter failures.

Features

  • GPU Health Checks: Validates GPU accessibility via nvidia-smi and runs simple CUDA compute tests
  • Intra-node Communication Tests: Tests GPU-to-GPU communication within a single node using NCCL all-to-all
  • Inter-node Communication Tests: Tests GPU communication across multiple nodes using NCCL all-gather
  • Independent Diagnostic Actors: Ray actors that can be spawned on any GPU without depending on application state
  • Detailed Reporting: Comprehensive test results including node IPs, GPU IDs, metrics, and error traces

Components

  • diagnostics/actor.py: Independent Ray actor for running GPU and NCCL tests
  • diagnostics/runner.py: Orchestrator that spawns diagnostic actors and aggregates results
  • main.py: Standalone entry point for running diagnostics as a Ray job
  • job.yaml: Ray job configuration for multi-node GPU clusters
  • pyproject.toml: Dependencies managed via uv

Usage

# Run diagnostics on 4 GPUs (default)
uv run --isolated main.py

# Run on 8 GPUs with custom timeout
uv run --isolated main.py --num-gpus 8 --timeout 180

Use Case

This diagnostic tool helps identify:

  • Faulty GPUs that fail CUDA operations
  • NCCL communication issues between GPUs
  • Node-specific hardware or network problems
  • Root causes of distributed training failures

@xyuzh xyuzh force-pushed the add_stop_time_diagnostics branch 6 times, most recently from 03b823c to 66ba3f5 Compare December 9, 2025 19:03
@xyuzh xyuzh force-pushed the add_stop_time_diagnostics branch from 66ba3f5 to 4121179 Compare December 9, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants