Back of the Envelope Estimation
Jeff Dean: back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to get a good feel for which designs will meet your requirements.
Power of two
To obtain correct calculations, it is critical to know the data volume unit using the power of 2.
| Power | Approximate Value | Full Name | Short Name |
|---|---|---|---|
| 2^10 | 1 thousand | 1 kilobyte | 1 KB |
| 2^20 | 1 million | 1 megabyte | 1 MB |
| 2^30 | 1 billion | 1 gigabyte | 1 GB |
| 2^40 | 1 trillion | 1 terabyte | 1 TB |
| 2^50 | 1 quadrillion | 1 petabyte | 1 PB |
Latency numbers
Approximate latency numbers (based on modern systems):
| Operation | Latency | Notes |
|---|---|---|
| L1 cache reference | ~1 ns | Fastest memory access |
| Branch mispredict | ~5 ns | CPU pipeline stall |
| L2 cache reference | ~7 ns | Still very fast |
| Mutex lock/unlock | ~25 ns | Synchronization overhead |
| Main memory reference | ~100 ns | 100x slower than L1 |
| Compress 1KB with Zippy | ~3 μs (3,000 ns) | Snappy compression |
| Send 2KB over 1 Gbps network | ~20 μs (20,000 ns) | Network bandwidth limit |
| Read 1MB sequentially from network | ~10 ms | Varies with network quality |
| Read 1MB sequentially from disk | ~1 ms (SSD) to ~20 ms (HDD) | SSD is 20x faster |
Key takeaways for distributed systems:
- Memory hierarchy matters: L1 -> L2 -> RAM shows 100x jumps in latency
- Network is expensive: event fast networks are ~20,000x slower than RAM access
- Sequential disk reads: SSDs make a huge difference (20x improvement over HDDs)
- Cache locality: Keeping data in L1/L2 cache can dramatically improve performance
Availability numbers
High availability is the ability of a system to be continuously operational for a desirably long period of time. It is usually measured in nines, the more the nines, the better.
| Availability % | Downtime per day | Downtime per year |
|---|---|---|
| 99% (2 nines) | 14.40 minutes | 3.65 days |
| 99.9% (3 nines) | 1.44 minutes | 8.77 hours |
| 99.99% (4 nines) | 8.64 seconds | 52.60 minutes |
| 99.999% (5 nines) | 864.00 milliseconds | 5.26 minutes |
| 99.9999% (6 nines) | 86.40 milliseconds | 31.56 seconds |
Example
Design Twitter’s QPS and storage system
- Make assumptions
- 500M daily active users (DAU)
- Each user posts 2 tweets per day on average
- Average tweet size: 300 characters
- 20% of users post, 80% just read
- 10% of tweets have media (images/videos)
- Data is stored for 5 years
- Calculate tweets per day
- Active posters: 500M * 20% = 100M users
- Tweets per day = 100M * 2 = 200M tweets/day
- Tweets QPS: 200M / 24 hour / 3600s = ~2300
- Peek QPS: 3 * QPS = ~7000
- Calculate tweet storage
- Metadata per tweet
- User ID: 8 bytes
- Tweet ID: 8 bytes
- Timestamp: 8 bytes
- Likes/retweets counts: 8 bytes
- other metadata: ~32 bytes
- total: ~64 bytes
- Tweet
- 300 characters * 2 bytes (UTF-8) = 600 bytes
- Total: (64 + 600) * 200M = 132.8 GB/day
- Calculate media storage
- Tweets with media: 200M * 10% = 20M/day
- Average media size: 200 KB (compressed image)
- Media storage: 20M * 200 KB = 4 TB/day
- Total daily storage
- 132.8 GB/day + 4 TB/day = 4.13 TB/day
- Calculate 5-year storage
- Storage: 4.13 TB/day _ 365 days _ 5 years = ~7.5 PB
- Add 30% for replication/backups, total ~= 10PB
Last updated on