Block device incidents are frequent and frightening in production. To achieve high availability, distributed databases usually have fault tolerant strategies to handle these incidents. As the system gets more complicated, it’s even hard to prove whether these strategies are working. For example, if the execution and distribution of scheduling commands also depend on the hanging storage, it will have no effect.
In order to help developers verify the performance of their databases under storage disasters, we designed a Linux IO scheduler called IOEM to emulate the properties of block devices.
IOEM allows developers to specify latency and IOPS per block device and process, making it possible to emulate a complex cluster with limited resources. Developers can emulate a wide range of properties of storage devices with little overhead. Besides the incidents, IOEM can also be used to measure the performance of databases under low-end storage devices.
In this talk, Yang Keao will show a real-world example where the fault-tolerant strategies failed, and how they reproduced them in the development environment with the help of IOEM. He will also introduce the structure and implementation of IOEM, compare it with other implementation of block device latency injection, and describe the convenience it will bring to the development of databases.