why disk I/O matters?
disk I/O is seen as the speed at which data is read from or written to storage. poor disk performance directly affects response times, causing bottlenecks even in high-end systems.
as technology is scaling, we observe that slow response times aren't always due to high CPU usage or memory bottlenecks. often, the root cause is disk I/O latency.
Netflix during its early days of cloud migration to AWS, faced challenges with disk I/O performance, specifically the disk latency and IOPS weren't sufficient to handle peak streaming demands.
some of the challenges they encountered was:
π΄ I/O bottlenecks in the cloud - before moving to an SSD instance, limited disk throughput was a major hurdle for running I/O intensive apps in the cloud.
π΄ strained disk performance - Netflix's Cassandra on earlier EC2 types (m2.4xlarge) struggled with I/O limits.
π΄ reliance on caching - to keep up with read demand, they deployed a large Memcached tier and leveraged ample RAM on Cassandra nodes to cache data, reducing disk hits.
so, they took a benchmarking approach where they first ran low-level disk benchmarks on AWS's new hi.4xlarge SSD instance to measure new performance. with the test conclusion, the instance could sustain over 100000 IOPS and ~1GB/s throughput with very low latency. (20 to 60 microseconds)
key findings from the benchmark:
✅ SSD lived up to the hype and validated AWS's claims
✅ ample I/O Headroom
✅ bottleneck shifted to CPU since SSD provided so much I/O capacity that the cluster's performance was now constrained by processing power rather than disk speed
✅ dramatically lower latency
π source:
1️⃣ (Benchmarking High Performance I/O with SSD for Cassandra on AWS) https://lnkd.in/dvdBmdtE
2️⃣ (JMeter Plugin for Cassandra) https://lnkd.in/dFYc9p4N
we still face disk I/O challenges today -- perhaps even more than before due to rapidly increasing data volumes and high user expectations.
modern disk I/O challenges include:
π΄ cloud variability - even with high-performance SSD-based storage, disk performance can fluctuate significantly due to shared multi-tenant environments
π΄ microservices and containerization - modern tech stack using microservices often multiply disk I/O operations
π΄ database and analytics workloads - heavy database operations (analytics with snowflake) can still create significant disk I/O bottlenecks
so don't stop monitoring, don't assume and stay curious...
Comments
Post a Comment