DDLSim-Lab is an open-source research project designed to simulate, analyze, and optimize distributed deep learning systems using AI-driven control mechanisms.
The project enables experimentation with large-scale, heterogeneous, and failure-prone environments, including edge-cloud hybrid infrastructures. It provides researchers with a reproducible, cost-free environment to test novel algorithms, scheduling strategies, and fault-tolerance mechanisms without requiring access to expensive physical testbeds.
To achieve scientifically valid results, DDLSim-Lab must run on bare-metal infrastructure. Virtualization layers introduce non-deterministic noise that undermines the fidelity of network and performance measurements. Bare-metal deployment ensures: