Tech-Gen

why disk I/O matters?

disk I/O is seen as the speed at which data is read from or written to storage. poor disk performance directly affects response times, causing bottlenecks even in high-end systems.

as technology is scaling, we observe that slow response times aren't always due to high CPU usage or memory bottlenecks. often, the root cause is disk I/O latency.

Netflix during its early days of cloud migration to AWS, faced challenges with disk I/O performance, specifically the disk latency and IOPS weren't sufficient to handle peak streaming demands.

some of the challenges they encountered was:

🔴 I/O bottlenecks in the cloud - before moving to an SSD instance, limited disk throughput was a major hurdle for running I/O intensive apps in the cloud.

🔴 strained disk performance - Netflix's Cassandra on earlier EC2 types (m2.4xlarge) struggled with I/O limits.

🔴 reliance on caching - to keep up with read demand, they deployed a large Memcached tier and leveraged ample RAM on Cassandra nodes to cache data, reducing disk hits.

so, they took a benchmarking approach where they first ran low-level disk benchmarks on AWS's new hi.4xlarge SSD instance to measure new performance. with the test conclusion, the instance could sustain over 100000 IOPS and ~1GB/s throughput with very low latency. (20 to 60 microseconds)

key findings from the benchmark:

✅ SSD lived up to the hype and validated AWS's claims

✅ ample I/O Headroom

✅ bottleneck shifted to CPU since SSD provided so much I/O capacity that the cluster's performance was now constrained by processing power rather than disk speed

✅ dramatically lower latency

📘 source:

1️⃣ (Benchmarking High Performance I/O with SSD for Cassandra on AWS) https://lnkd.in/dvdBmdtE

2️⃣ (JMeter Plugin for Cassandra) https://lnkd.in/dFYc9p4N

we still face disk I/O challenges today -- perhaps even more than before due to rapidly increasing data volumes and high user expectations.

modern disk I/O challenges include:

🔴 cloud variability - even with high-performance SSD-based storage, disk performance can fluctuate significantly due to shared multi-tenant environments

🔴 microservices and containerization - modern tech stack using microservices often multiply disk I/O operations

🔴 database and analytics workloads - heavy database operations (analytics with snowflake) can still create significant disk I/O bottlenecks

so don't stop monitoring, don't assume and stay curious...

Tech-Gen

Search This Blog

Comments

Post a Comment

Popular posts from this blog

Top 10 high-level EC2 scenario-based questions to challenge your AWS & DevOps skills

GitOps-Driven Management of VKS Clusters: Enabling GitOps on VCF 9.0 (Part 03)