Multi-GPU Parallel Processing in SD.Next

This document describes the modifications made to enable parallel GPU processing across multiple graphics cards in SD.Next. These changes allow you to utilize all available GPUs as a single logical unit, combining their VRAM and processing power for generating high-resolution images.

💡 Tip
If you run into issues, check out the official FAQ as well as Troubleshooting and Debugging guides.

Overview

The modifications enhance SD.Next's device management to support data parallelism using PyTorch's DataParallel. When enabled, models (particularly UNet and VAE) are automatically distributed across all available GPUs. This approach:

Combines total VRAM of all GPUs, enabling generation of much larger images than any single card could handle.
Splits the computational workload (batches/tensors) across GPUs, potentially increasing processing speed.
Works seamlessly with the existing SD.Next codebase without altering the behavior of original functions.

Modified Files & Key Changes

1. `modules/devices.py` Additions

The following functions and variables were added to the end of the file, preserving all original functions:

New Global Variables

_parallel_enabled = False          # Tracks if parallel mode is active
_parallel_device_ids = []           # List of GPU IDs being used
_parallel_models = {}               # Stores wrapped parallel models

Core Control Functions

Function	Description
`enable_parallel_gpus(device_ids=None)`	Activates parallel mode. If `device_ids` is `None`, it uses all detected GPUs. Returns `True` on success.
`disable_parallel_gpus()`	Deactivates parallel mode and clears GPU memory.
`is_parallel_enabled()`	Returns the current status of parallel mode.
`get_parallel_device_ids()`	Returns the list of GPU IDs currently used in parallel.

Model Parallelization

Function	Description
`parallelize_model(model, model_name="default", device_ids=None)`	Wraps a given PyTorch model (`torch.nn.Module`) with `DataParallel`, distributing it across the specified (or all) GPUs.
`parallelize_unet(sd_model)`	Specifically targets a loaded Stable Diffusion model, automatically parallelizing its UNet and VAE components.

Extended Original Functions

torch_gc: The original garbage collection function was extended to also clear cache on all parallel GPUs.
autocast: The original autocast context was extended to work correctly with the primary GPU in parallel mode.

2. `webui.py` Integration

Two key sections were added to the main launch script to automate the process:

A. Automatic Activation on Startup

Inside the start_common() function, we added a block that detects available GPUs and enables parallel mode if two or more are found:

def start_common():
    log.debug('Entering start sequence')
    
    # === AUTO-ENABLE PARALLEL GPUS ===
    try:
        gpu_count = modules.devices.get_device_count()
        if gpu_count >= 2:
            modules.devices.enable_parallel_gpus()
            log.info(f"🎮 {gpu_count} GPU parallel mode active!")
        else:
            log.info(f"🎮 {gpu_count} GPU(s) detected, running in single-GPU mode")
    except Exception as e:
        log.debug(f"Parallel GPU initialization failed: {e}")
    # === END OF PARALLEL GPU SETUP ===
    
    # ... rest of the original function ...

📝 Note: The threshold is set to >=2 GPUs. If you have 5, all will be used.

B. Model Parallelization After Loading

After the model is loaded in the webui() function, we automatically parallelize its core components:

def webui(restart=False):
    # ... existing code ...
    load_model()
    
    # === PARALLELIZE LOADED MODEL ===
    try:
        if modules.devices.is_parallel_enabled() and shared.sd_model is not None:
            modules.devices.parallelize_unet(shared.sd_model)
            log.info("Model UNet/VAE converted to parallel mode")
    except Exception as e:
        log.debug(f"Model parallelization error: {e}")
    # === END OF MODEL PARALLELIZATION ===
    
    # ... rest of the original function ...

How It Works Technically

Detection: During SD.Next startup, get_device_count() queries PyTorch for the number of available CUDA GPUs.
Activation: If the count meets the threshold, enable_parallel_gpus() is called. This function:
- Stores the list of GPU IDs.
- Performs an initial cache cleanup on each GPU.
- Logs the name and VRAM of each detected GPU.
Model Wrapping: When a model is loaded (e.g., a Stable Diffusion checkpoint), parallelize_unet(sd_model) is invoked. It:
- Extracts the UNet and VAE submodules.
- Moves them to the primary GPU (the first one in the list).
- Wraps them with torch.nn.DataParallel, which handles splitting the input data across GPUs and gathering the results.
Inference: During image generation, the wrapped UNet automatically distributes its workload. PyTorch handles the communication and synchronization between GPUs.
Memory Management: The modified torch_gc function ensures that when garbage collection is triggered (e.g., after generation), it clears VRAM on all parallel GPUs, preventing memory fragmentation.

Benefits & Expected Outcomes

Higher Resolution Generation: By combining VRAM, you can generate images with resolutions that would normally cause out-of-memory errors on a single GPU. For example, with 5x 8GB GPUs, you have effectively 40GB of VRAM for processing.
Potentially Faster Processing: For large batch sizes, the workload is split, which can reduce overall generation time. The speedup depends on the PCIe bandwidth between GPUs and the specific model architecture.
Seamless Integration: No changes are required to any UI elements, scripts, or other modules. The parallelization happens automatically behind the scenes.

Important Considerations

⚠️ VRAM Usage: While total available memory increases, the model itself is replicated on each GPU. The memory saving comes from being able to process larger tensors (like bigger latent images) that are split across cards.

⚠️ Communication Overhead: DataParallel synchronizes gradients and scatters/gathers data via the PCIe bus. For very fast single-GPU inference, this can introduce a small overhead. The benefit is most pronounced for large images or batches.

⚠️ Batch Size: To fully utilize parallel GPUs, increase your generation batch size. The batch is automatically split among the GPUs. A batch size of 5 on 5 GPUs would send one item per GPU.

Multi-GPU Parallel Processing in SD.Next

Overview

Modified Files & Key Changes

1. modules/devices.py Additions

New Global Variables

Core Control Functions

Model Parallelization

Extended Original Functions

2. webui.py Integration

A. Automatic Activation on Startup

B. Model Parallelization After Loading

How It Works Technically

Benefits & Expected Outcomes

Important Considerations

Further Reading & Resources

1. `modules/devices.py` Additions

2. `webui.py` Integration