Multi-GPU Parallel Processing in SD.Next

This document describes the modifications made to enable parallel GPU processing across multiple graphics cards in SD.Next. These changes allow you to utilize all available GPUs as a single logical unit, combining their VRAM and processing power for generating high-resolution images.

💡 Tip
If you run into issues, check out the official FAQ as well as Troubleshooting and Debugging guides.

Overview

The modifications enhance SD.Next's device management to support data parallelism using PyTorch's DataParallel. When enabled, models (particularly UNet and VAE) are automatically distributed across all available GPUs. This approach:


Modified Files & Key Changes

1. modules/devices.py Additions

The following functions and variables were added to the end of the file, preserving all original functions:

New Global Variables

_parallel_enabled = False          # Tracks if parallel mode is active
_parallel_device_ids = []           # List of GPU IDs being used
_parallel_models = {}               # Stores wrapped parallel models

Core Control Functions

FunctionDescription
enable_parallel_gpus(device_ids=None)Activates parallel mode. If device_ids is None, it uses all detected GPUs. Returns True on success.
disable_parallel_gpus()Deactivates parallel mode and clears GPU memory.
is_parallel_enabled()Returns the current status of parallel mode.
get_parallel_device_ids()Returns the list of GPU IDs currently used in parallel.

Model Parallelization

FunctionDescription
parallelize_model(model, model_name="default", device_ids=None)Wraps a given PyTorch model (torch.nn.Module) with DataParallel, distributing it across the specified (or all) GPUs.
parallelize_unet(sd_model)Specifically targets a loaded Stable Diffusion model, automatically parallelizing its UNet and VAE components.

Extended Original Functions

2. webui.py Integration

Two key sections were added to the main launch script to automate the process:

A. Automatic Activation on Startup

Inside the start_common() function, we added a block that detects available GPUs and enables parallel mode if two or more are found:

def start_common():
    log.debug('Entering start sequence')
    
    # === AUTO-ENABLE PARALLEL GPUS ===
    try:
        gpu_count = modules.devices.get_device_count()
        if gpu_count >= 2:
            modules.devices.enable_parallel_gpus()
            log.info(f"🎮 {gpu_count} GPU parallel mode active!")
        else:
            log.info(f"🎮 {gpu_count} GPU(s) detected, running in single-GPU mode")
    except Exception as e:
        log.debug(f"Parallel GPU initialization failed: {e}")
    # === END OF PARALLEL GPU SETUP ===
    
    # ... rest of the original function ...
📝 Note: The threshold is set to >=2 GPUs. If you have 5, all will be used.

B. Model Parallelization After Loading

After the model is loaded in the webui() function, we automatically parallelize its core components:

def webui(restart=False):
    # ... existing code ...
    load_model()
    
    # === PARALLELIZE LOADED MODEL ===
    try:
        if modules.devices.is_parallel_enabled() and shared.sd_model is not None:
            modules.devices.parallelize_unet(shared.sd_model)
            log.info("Model UNet/VAE converted to parallel mode")
    except Exception as e:
        log.debug(f"Model parallelization error: {e}")
    # === END OF MODEL PARALLELIZATION ===
    
    # ... rest of the original function ...

How It Works Technically

  1. Detection: During SD.Next startup, get_device_count() queries PyTorch for the number of available CUDA GPUs.
  2. Activation: If the count meets the threshold, enable_parallel_gpus() is called. This function:
    • Stores the list of GPU IDs.
    • Performs an initial cache cleanup on each GPU.
    • Logs the name and VRAM of each detected GPU.
  3. Model Wrapping: When a model is loaded (e.g., a Stable Diffusion checkpoint), parallelize_unet(sd_model) is invoked. It:
    • Extracts the UNet and VAE submodules.
    • Moves them to the primary GPU (the first one in the list).
    • Wraps them with torch.nn.DataParallel, which handles splitting the input data across GPUs and gathering the results.
  4. Inference: During image generation, the wrapped UNet automatically distributes its workload. PyTorch handles the communication and synchronization between GPUs.
  5. Memory Management: The modified torch_gc function ensures that when garbage collection is triggered (e.g., after generation), it clears VRAM on all parallel GPUs, preventing memory fragmentation.

Benefits & Expected Outcomes


Important Considerations

⚠️ VRAM Usage: While total available memory increases, the model itself is replicated on each GPU. The memory saving comes from being able to process larger tensors (like bigger latent images) that are split across cards.
⚠️ Communication Overhead: DataParallel synchronizes gradients and scatters/gathers data via the PCIe bus. For very fast single-GPU inference, this can introduce a small overhead. The benefit is most pronounced for large images or batches.
⚠️ Batch Size: To fully utilize parallel GPUs, increase your generation batch size. The batch is automatically split among the GPUs. A batch size of 5 on 5 GPUs would send one item per GPU.

Further Reading & Resources

For more general information about SD.Next features, configuration, and advanced usage, please refer to the official documentation:

➡️ SD.Next Official Documentation

Key sections that might be helpful after enabling multi-GPU: