Getting Started ================ This guide will walk you through creating a complete training setup using AccelHydra, using the MNIST classification task as an example. Project Structure ----------------- The example project has the following structure: .. code-block:: text my_project/ ├── train_launcher.py # Optional: Custom TrainLauncher (if needed) ├── model.py # Model definition ├── trainer.py # Custom Trainer class ├── data.py # Dataset definitions └── configs/ ├── train.yaml # Main configuration file └── basic.yaml # Base configuration Only ``configs/train.yaml`` is necessary to launch training. You can use the built-in :class:`accel_hydra.TrainLauncher` directly, or create a custom one if you need to customize the training setup. For other components like models and datasets, the definition location is not important. The only requirement is that their configuration in ``configs/train.yaml`` can be instantiated by ``hydra.utils.instantiate()``. Step 1: Define Your Model ------------------------- Create ``model.py`` and define your PyTorch model: .. code-block:: python import torch import torch.nn as nn class Model(nn.Module): def __init__(self) -> None: super().__init__() self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3) self.pool1 = nn.MaxPool2d(kernel_size=2, stride=2) self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=5) self.pool2 = nn.MaxPool2d(kernel_size=2, stride=2) self.dropout = nn.Dropout2d(p=0.1) self.adaptive_pool = nn.AdaptiveMaxPool2d((1, 1)) self.flatten = nn.Flatten() self.linear1 = nn.Linear(64, 128) self.relu = nn.ReLU() self.linear2 = nn.Linear(128, 10) def forward(self, x: torch.Tensor) -> torch.Tensor: x = self.conv1(x) x = self.pool1(x) x = self.conv2(x) x = self.pool2(x) x = self.dropout(x) x = self.adaptive_pool(x) x = self.flatten(x) x = self.linear1(x) x = self.relu(x) x = self.linear2(x) return x Step 2: Define Your Datasets ----------------------------- Create ``data.py`` to define datasets. Here we directly use ``torchvision.datasets.MNIST`` for convenience: .. code-block:: python import torchvision from torchvision import transforms def train_dataset(data_root: str): transform = transforms.Compose([transforms.ToTensor()]) dataset = torchvision.datasets.MNIST( root=data_root, train=True, download=True, transform=transform ) return dataset def val_dataset(data_root: str): transform = transforms.Compose([transforms.ToTensor()]) dataset = torchvision.datasets.MNIST( root=data_root, train=False, download=True, transform=transform ) return dataset .. _getting-started-step-3: Step 3: Define Your Trainer ---------------------------- Create ``trainer.py`` and define your Trainer class. You must implement ``training_step`` and ``validation_step``: .. code-block:: python from dataclasses import dataclass from accel_hydra import Trainer @dataclass(kw_only=True) class MnistTrainer(Trainer): def training_step(self, batch, batch_idx): features, labels = batch preds = self.model(features) loss = self.loss_fn(preds, labels) # Log learning rate lr = self.optimizer.param_groups[0]["lr"] self.accelerator.log({"train/lr": lr}, step=self.step) return loss def on_validation_start(self): # Initialize metric accumulators self.validation_stats = {"accurate": 0, "num_elems": 0} def validation_step(self, batch, batch_idx): features, labels = batch preds = self.model(features) predictions = preds.argmax(dim=-1) # Gather predictions from all processes (important for distributed training) output = {"predictions": predictions, "labels": labels} output = self.accelerator.gather_for_metrics(output) # Accumulate metrics accurate_preds = (output["predictions"] == output["labels"]) self.validation_stats["accurate"] += accurate_preds.long().sum() self.validation_stats["num_elems"] += accurate_preds.shape[0] def get_val_metrics(self): return { "accuracy": self.validation_stats["accurate"].item() / self.validation_stats["num_elems"] } def on_validation_end(self): eval_metric = self.validation_stats["accurate"].item() / \ self.validation_stats["num_elems"] self.accelerator.print( f"epoch[{self.epoch}] --> eval_metric= {100 * eval_metric:.2f}%" ) self.accelerator.log({"val/accuracy": eval_metric}, step=self.step) Step 4: Write Configuration YAML(s) ------------------------------------ Create ``configs/basic.yaml`` for base configuration: .. code-block:: yaml defaults: - _self_ - override hydra/hydra_logging: disabled - override hydra/job_logging: disabled data_root: YOUR_DATA_ROOT hydra: output_subdir: null run: dir: . This is optional for better disentanglement. You can also put everything in the following ``configs/train.yaml``. Create ``configs/train.yaml`` for the main training configuration: .. code-block:: yaml defaults: - basic - _self_ seed: 42 exp_name: mnist exp_dir: experiments/${exp_name} epochs: 10 train_dataloader: _target_: torch.utils.data.DataLoader dataset: _target_: data.train_dataset data_root: ${data_root} batch_size: 128 shuffle: true num_workers: 4 val_dataloader: _target_: torch.utils.data.DataLoader dataset: _target_: data.val_dataset data_root: ${data_root} batch_size: 128 optimizer: _target_: torch.optim.AdamW lr: 0.0001 lr_scheduler: _target_: torch.optim.lr_scheduler.OneCycleLR max_lr: 0.0025 epochs: ${epochs} model: _target_: model.Model loss_fn: _target_: torch.nn.CrossEntropyLoss trainer: _target_: trainer.MnistTrainer project_dir: ${exp_dir} logging_config: _target_: accel_hydra.trainer.LoggingConfig report_to: tensorboard project: runs save_dir: ${exp_dir} name: ${exp_name} gradient_accumulation_steps: 4 epochs: ${epochs} epoch_length: Null save_every_n_steps: 500 save_every_n_epochs: 1 save_last_k: 2 metric_monitor: _target_: accel_hydra.MetricMonitor metric_name: accuracy mode: max The configuration uses Hydra's ``_target_`` pattern to specify class instantiation. This allows flexible configuration without hardcoding imports. However, this also makes it difficult to figure out returned classes in the code, so we recommend using typing annotations to improve the readability of the code. Step 5: Launch Training (Optional: Customize with TrainLauncher) ----------------------------------------------------------------- AccelHydra provides a common training launcher that handles the standard training launch workflow (reading configuration, instantiating datasets and models, start training). You can directly use it for basic projects. Option 1: Use the Default TrainLauncher (Recommended for Simple Cases) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If your project follows the standard training workflow, you can directly use the built-in :class:`accel_hydra.TrainLauncher`: .. code-block:: bash accelerate launch \ --num_processes 2 \ --num_machines 1 \ --mixed_precision fp16 \ -m accel_hydra.train_entry \ -c configs/train.yaml Option 2: Create a Custom TrainLauncher (For Custom Requirements) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you need to customize the training setup (e.g., custom dataloader initialization, custom resolver registration), you can create your own launcher by inheriting from :class:`accel_hydra.TrainLauncher`: Create ``train_launcher.py``: .. code-block:: python from typing import Callable from accel_hydra import TrainLauncher def register_my_resolvers(): # Your custom OmegaConf resolver register function ... class MyCustomLauncher(TrainLauncher): @staticmethod def get_register_resolver_fn() -> Callable: """Override to register custom OmegaConf resolvers.""" return register_my_resolvers def get_dataloaders(self): """Override if you need custom dataloader initialization logic.""" # You can customize dataloader creation here ... Then launch training with your custom launcher: .. code-block:: bash accelerate launch \ --num_processes 2 \ --num_machines 1 \ --mixed_precision fp16 \ -m accel_hydra.train_entry \ --launcher train_launcher.MyCustomLauncher \ -c configs/train.yaml The :class:`accel_hydra.TrainLauncher` base class handles: - Configuration loading from command line - Model, optimizer, LR scheduler, and loss function instantiation - Trainer setup and training launch - Resume from checkpoint support By inheriting from :class:`accel_hydra.TrainLauncher`, you can override specific methods to customize the parts you need. Command Line Overrides ~~~~~~~~~~~~~~~~~~~~~~ You can override any configuration value from the command line, following Hydra syntax: .. code-block:: bash accelerate launch \ --num_processes 2 \ --num_machines 1 \ --mixed_precision fp16 \ -m accel_hydra.train_entry \ -c configs/train.yaml \ -o data_root=/path/to/custom/data/root epochs=20 optimizer.lr=0.0005 For more details on Hydra syntax and overrides, see the :doc:`Configuration Guide `. Next Steps ----------- - Learn about :doc:`Trainer hooks and customization ` - Understand :doc:`Hydra configuration patterns ` - Explore :doc:`helper functions and utilities `