Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The Twitter operations team focuses on software performance, availability, capacity planning, and configuration management using metrics, logs, and science. They use a dedicated managed services team and run their own servers instead of cloud services. The document outlines Twitter's rapid growth and challenges in maintaining performance. It discusses strategies for monitoring, analyzing metrics to find weak points, deploying changes, and improving processes through configuration management and peer reviews.
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Fixing Twitter: Finding Your Fail Whale
1. Fixing Twitter
... and Finding your own Fail Whale
John Adams
Twitter Operations
<jna@twitter.com>
2. Operations
• Small team, growing rapidly.
• What do we do?
• Software Performance (back-end)
• Availability
• Capacity Planning (metrics-driven)
• Configuration Management
• We don’t deal with the physical plant.
3. Managed Services
• Dedicated team (NTTA)
• 24/7 Hands on remote support
• No clouds. We tried that!
• Need raw processing power, latency too
high in existing cloud offerings
• Frees us to deal with real, intellectual,
computer science problems.
4. 752%
2008 Growth
5
3.75
2.5
1.25
0
Dec 07 Feb 08 Apr 08 Jun 08 Aug 08 Oct 08 Dec 08
Unique Visitors (in Millions)
13. Monitoring
• Graph and report critical metrics in as near
real time as possible
• You already have the tools.
• RRD
• Ganglia + custom gMetric scripts
• MRTG
14. Dashboards
• “Criticals” view
• Smokeping/MRTG
• Google Analytics
• Not just for
HTTP 200s/SEO
• XML Feeds from
managed services
• Data Porn!
15. Analyze
• Turn data into information
• Where is the code base going?
• Are things worse than they were?
• Understand the impact of the last
software deploy
• Run check scripts during and after
deploys
• Capacity Planning, not Fire Fighting!
16. Forecasting Curve-fitting for capacity planning
(R, fityk, Mathematica, CurveFit)
unsigned int (32 bit)
Twitpocolypse
status_id
signed int (32 bit)
Twitpocolypse
r2=0.99
17. Deploys
• Graph time-of-deploy along side server
CPU and Latency
• Display time-of-last-deploy on dashboard
last deploy times
18. Whale-Watcher
• Simple shell script,
• MASSIVE WIN.
• Whale = HTTP 503 (timeout)
• Robot = HTTP 500 (error)
• Examines last 100,000 lines of aggregated
daemon / www logs
• “Whales per Second” > Wthreshold
• Thar be whales! Call in ops.
20. Feature “Darkmode”
• Specific site controls to enable and
disable computationally or IO-Heavy site
function
• The “Emergency Stop” button
• Changes logged and reported to all teams
• Around 60 switches we can throw
• Static / Read-only mode
21. Configuration
Management
• Start automated configuration management
EARLY in your company.
• Don’t wait until it’s too late.
• Twitter started within the first few months.
22. Configuration
Management
• Complex Environment
• Multiple Admins
• Unknown Interactions
• Solution: 2nd set of eyes.
24. Reviewboard
www.review-board.org
• SVN pre-commit hook causes a failure if
the log message doesn’t include
‘reviewed’
• SVN post-commit hook informs people
what changed via email
• Watches the entire SVN tree
27. Many limiting factors in the request pipeline
Apache Rails
MPM Model (mongrel)
MaxClients 2:1 oversubscribed
TCP Listen queue depth
to cores
Memcached
# connections
MySQL
Varnish (search) # db connections
# threads
29. CPU: More with Less
• Reduction in 40% of CPU by replacing dual
and quad core machines with 8 core
• Switching from AMD to Intel Xeon = 30%
gain
• Saved data center space, power, cost per
month.
• Not the best option if you own machines.
Capital expenditure = hard to realize new
technology gains.
30. Rails
• Stop blaming Rails.
• Analysis found:
• Caching + Cache invalidation problems
• Bad queries generated by ActiveRecord,
resulting in slow queries against the db
• Queue Latency
• Memcache / Page Cache Corruption
• Replication Lag
31. Disk is the new Tape.
• Social Networking application profile has
many O(ny) operations.
• Page requests have to happen in < 500mS
or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
32. Caching
• We’re the real-time web, but lots of caching
opportunity
• Most caching strategies rely on long TTLs
(>60 s)
• Separate memcache pools for different data
types to prevent eviction
• Optimize Ruby Gem to libmemcached +
FNV Hash instead of Ruby + MD5
• Twitter now largest contributor to
libmemcached
33. Caching 50% decrease in load with Native C
gem + libmemcached
34. Cache Money!
• Active Record Plugin
• Cache when reading from the DB
• Cache when writing to the DB
• Transparently provides caching
• Removes need for set/get cache code
• Open Source!
35. Caching
• “Cache Everything!” not the best policy
• Invalidating caches at the right time is
difficult.
• Cold Cache problem
• Network Memory Bus != Infinite
36. Memcached
• memcached isn’t perfect.
• Memcached SEGVs hurt us early on.
• Evictions make the cache unreliable for
important configuration data
(loss of darkmode flags, for example)
• Data and Hash Corruption (even in 1.2.6)
• Exposed corruption issue with specific
inputs causing SEGV and unexpected
behavior
37. API + Caching (search)
• Cache and control abusive clients
• Varnish between two Apache Virtual Hosts
(failover to another backend if Varnish
dies)
• Remove Cache busting query strings before
applying hash algorithm
• Using ESI to cache jQuery requests when
specifying a callback= parameter - big win.
38. Relational Databases
not a Panacea
• Good for:
• Users, Relational Data, Transactions
• Bad:
• Queues. Polling operations. Caching.
• You don’t need ACID for everything.
• Enter the message queue...
39. Queues
• Many message queue solutions on the
market
• At high loads, most perform poorly when
used in ‘durable’ mode.
• Erlang based queues work well
(RabbitMQ), but you need in house Erlang
experience.
• We wrote our own.
• Kestrel to the rescue!
40. Kestrel
Falco tinnunculus
• Works like memcache (same protocol)
• SET = enqueue | GET = dequeue
• No strict ordering of jobs
• No shared state between servers
• Written in Scala.
41. Asynchronous
Requests
• Inbound traffic consumes a mongrel
• Outbound traffic consumes a mongrel
• The request pipeline should not be used to
handle 3rd party communications or
back-end work.
• Daemons, Daemons, Daemons.
42. Don’t make services
dependent
• Move operations out of the synchronous
request cycle
• Email
• Complex object generation (timelines)
• 3rd party services (bit.ly, sms, etc.)
43. Daemons
• Many different types at Twitter.
• # of daemons have to match the workload
• Early Kestrel would crash if queues filled
• “Seppaku” patch
• Kill daemons after n requests
• Long-running daemons = low memory
44. MySQL Challenges
• Replication Delay
• Single threaded. Slow.
• Social Networking not good for RDBMS
• N x N relationships and social graph /
tree traversal
• Sharding importance
• Disk issues (FS Choice, noatime,
scheduling algorithm)
45. MySQL
• Replication delay and cache eviction
produce inconsistent results to the end
user.
• Locks create resource contention for
popular data
46. Database Replication
• Major issues around users and statuses
tables
• Multiple functional masters (FRP, FWP)
• Make sure your code reads and writes to
the write DBs. Reading from master = slow
death
• Monitor the DB. Find slow / poorly
designed queries
• Kill long running queries before they kill
you (mkill)
47. status.twitter.com
• Keep users in the loop, or suffer.
• Hosted on different service (Tumblr)
• No matter how little information you have
available.
48. Key Points
• Databases not always the best store.
• Instrument everything.
• Use metrics to make decisions, not guesses.
• Don’t make services dependent
• Process asynchronously when possible