ad
ad
Topview AI logo

Microservices Gone Wrong at DoorDash

Science & Technology


Microservices Gone Wrong at DoorDash

In 2020, amid the pandemic, DoorDash experienced a massive traffic spike as delivery services became essential. To adapt to this demand, DoorDash transitioned from a monolithic architecture in Python to a microservices-oriented architecture. This decision, albeit valid for scalability and managing increased load, was fraught with significant challenges that showcased the complexities and pitfalls of adopting microservices.

Background

Before the switch, DoorDash operated a Python monolith, functional but limiting in terms of scalability and code maintainability as the user base expanded. As more users flocked to the platform, the company recognized the need for a more distributed architecture that would allow for independent deployments and potentially utilize different programming languages for different services, enhancing performance and maintainability.

Challenges Faced

Despite the initial promise of microservices, DoorDash encountered various issues:

  1. Cascading Failures: This occurs when one failing service causes others to fail as well. A simple dependency chain means that if one service experiences increased latency, that impact cascades through the system, leading to timeout issues and complete failures. This is more challenging to debug than in a monolithic system, especially for a team new to microservices.

  2. Retry Storms: Implementing retry logic within microservices can backfire when services are overwhelmed. If a service fails due to overload, automatic retries will exacerbate the situation, leading to an influx of repeated requests that could further choke the already struggling service.

  3. Death Spiral: Autoscaling, a primary benefit of microservices, turned problematic. If too many instances of a service fail, the remaining instances become overloaded, leading to further failures—a negative feedback loop that requires manual intervention to resolve.

  4. Metastable Failures: Even after the root cause of a problem is mitigated, the system can remain in a failing state due to previous overloads or failures. Manual developer intervention is often necessary to restore stability.

Countermeasures

To address these challenges, DoorDash implemented several strategies:

  • Load Shedding: This technique involves intelligently dropping less critical requests to maintain essential service functionality. By monitoring resource utilization, the system can prioritize important operations like payment processing over less critical tasks like fetching images.

  • Circuit Breakers: This approach allows services to stop sending requests to a downstream service that is experiencing issues. By halting requests for less crucial information, the circuit breaker can help stabilize the system.

  • Predictive Autoscaling: Rather than solely reacting to increased load, predictive autoscaling anticipates traffic patterns (e.g., more during the day than at night) and scales resources accordingly. This proactive approach can prevent the negative effects of sudden spikes in traffic.

In conclusion, while the decision to switch from a monolithic to a microservices architecture at DoorDash was rooted in valid reasoning and necessity, the challenges faced during and after the implementation illustrate the complexities involved in microservices. Successful adaptation requires more than just technology shifts; effective monitoring, intelligent decision-making, and timely interventions are vital.


Keyword

Keywords: DoorDash, microservices, monolithic architecture, traffic spike, cascading failures, retry storms, death spiral, metastable failures, load shedding, circuit breakers, predictive autoscaling.


FAQ

Q: Why did DoorDash switch to microservices?
A: The switch was primarily to handle the increased user traffic and improve scalability, allowing for independent service deployment and the use of different programming languages.

Q: What is a cascading failure in microservices?
A: It occurs when a failure in one service propagates through dependencies, causing subsequent services to fail, which can lead to a complete breakdown.

Q: What is retry storm?
A: A phenomenon where a service, when facing failures, retries multiple times, unintentionally overloading the system and worsening the issue instead of resolving it.

Q: What is the death spiral in the context of microservices?
A: This occurs when autoscaling fails to keep up with load, resulting in one node becoming overwhelmed, leading to a series of failures and further strain on the system.

Q: How did DoorDash cope with these challenges?
A: They implemented strategies like load shedding, circuit breakers, and predictive autoscaling to stabilize their services and manage loads effectively.