In our latest research, we've enhanced DDLSim-Lab's fault injection capabilities to support more sophisticated failure models. Beyond traditional latency and packet loss, we now simulate Byzantine failures where nodes exhibit arbitrary behavior, network partitions that split the cluster into isolated segments, and cascading failures where an initial fault triggers a chain reaction across the system.
These advanced fault models enable researchers to test the resilience of distributed training algorithms under realistic adverse conditions. Our evaluation shows that traditional synchronization-based approaches fail under Byzantine faults, while newer asynchronous and gradient correction methods demonstrate improved robustness.
The project provides detailed metrics on training convergence, accuracy degradation, and recovery time, allowing for comprehensive comparison of fault-tolerance strategies.
We've implemented a novel reinforcement learning framework within DDLSim-Lab that dynamically allocates computational and network resources across heterogeneous edge-cloud infrastructures. The system treats each node as an agent in a multi-agent reinforcement learning environment, where agents learn to optimize task placement based on real-time measurements of compute availability, network latency, and energy consumption.
Our experiments show that the AI-driven scheduler reduces average training completion time by 35% compared to static partitioning, while decreasing energy consumption by 22% through intelligent workload placement. The scheduler adapts rapidly to changing network conditions, demonstrating robustness in volatile environments typical of wireless edge networks.
This work bridges the gap between theoretical AI scheduling algorithms and practical deployment in real-world distributed AI systems.
DDLSim-Lab now supports realistic simulation of edge AI scenarios where thousands of constrained devices with intermittent connectivity participate in federated learning processes. Our project models device-level characteristics including limited computational power, memory constraints, battery levels, and volatile wireless connections.
Researchers can define custom device profiles that represent real-world IoT deployments, from smart sensors with millisecond-range processing capabilities to more powerful gateways. The simulation captures the straggler effect where slow devices delay global model aggregation, and evaluates various mitigation strategies such as asynchronous updates, device clustering, and prediction-based aggregation.
Our findings show that traditional federated averaging (FedAvg) performs poorly under high device heterogeneity, while adaptive weighting schemes and momentum-based corrections significantly improve convergence speed and final model accuracy.