How To Avoid Losing Money And Users During RDS Engine Upgrade: Insights from the SafeBoda DevOps Team
By: Ahmad Kraiz, DevOps engineer SafeBoda
From the DevOps perspective, maintaining high availability and minimizing disruptions during system upgrades are non-negotiable. Recently, we faced the challenge of upgrading our SafeBoda RDS engines to newer versions to stay compliant and avoid increased costs.
SafeBoda has thousands of concurrent users during business hours, and any downtime will put us at risk of losing a significant amount of revenue and users, not to mention losing my job 😅.
To tackle this, we turned to Amazon RDS Blue/Green Deployments, a powerful feature that ensures a smooth transition with minimal downtime. This approach not only helps us avoid disruptions but also shows how committed we are to providing a great user experience and maximizing operational efficiency in every aspect of our tech ecosystem.
The Unavoidable Challenge: Downtime and Risk
Upgrading RDS engines traditionally involves significant downtime, ranging from 5 to 20 minutes, depending on database size. However, even a few minutes of downtime can have far-reaching impacts on our critical applications. Additionally, manual upgrades lack rollback capabilities, increasing the stakes of the operation.
Blue & Green: the Amazon RDS Deployment Magic
Amazon RDS Blue/Green Deployments offer a solution to this challenge by allowing us to release new database versions with a structured rollout approach, minimizing disruptions to SafeBoda end users and safeguarding against downtime-related risks. Here’s why this approach was instrumental:
1. Setting Up a Staging Environment
- Easily create a production-ready staging environment to test database changes without affecting the live environment
2. Automated Replication
- Automatically replicate database changes from production to staging, ensuring consistency and accuracy
3. Reducing Risk
- Stay current with patches and system updates
4. Safe Deployment with Minimal Downtime
- Switch over to the new environment seamlessly without changes to your application, typically under a minute, with the built-in switchover to eliminate data loss and ensure stability
5. Safe Test
- Test database changes in a production-like environment without affecting the production environment, reducing the risk of errors and issues
Step-by-Step Deployment Process
To execute a successful RDS engine upgrade using Blue/Green Deployments, follow these essential steps:
1. Configure Parameter Groups
- Create a new parameter group to enable logical replication and adjust parameter values
2. Restart RDS Instance
- Restart the RDS instance to apply the new parameter group
3. Initiate Blue/Green Deployment
- Set up Blue/Green deployment to manage the transition seamlessly
4. Address Incompatibilities
- Identify and upgrade incompatible extensions before the major engine upgrade
5. Perform Major Engine Upgrade
- Upgrade the Green main instance, creating a new parameter group specific to the new engine version
6. Switch Over
- Safely switch over to the Blue environment, noting the change in IP address while maintaining the endpoint
7. Finalize and Restore Default Settings
- Once the upgrade is complete, revert to the default parameter group and restart the RDS instance for finalization
Key Features of Switchover
- Timeout Setting: You can set a switchover timeout ranging from 30 seconds to one hour. The default is five minutes. If the switchover exceeds this limit, all changes are rolled back.
- Guardrails: Before initiating a switchover, RDS performs a series of checks on both the blue and green environments to ensure their readiness. These include checks on replication health, replication lag, and active writes. Any inconsistencies will halt the switchover.
- Switchover Actions: During the switchover, RDS runs guardrail checks, stops new write operations, drops and disallows new connections, waits for replication to sync, and renames DB instances and endpoints in both environments
Limitations of Logical Replication
1. The database schema and DDL commands are not replicated.
2. Sequence data is not replicated.
3. Large objects are not replicated.
4. Replication is only supported by tables.
5. All tables should have PK.
Valuable Insights Gained from Experience
1. To minimize disruption, avoid performing the upgrade during peak hours. Instead, choose a time when traffic is minimal.
2. Since the process can be long and may take more than an hour, initiate the Blue/Green Deployment and proceed with the major upgrade for the green instance. However, wait to carry out the cutover at a lower traffic time window (e.g., overnight), hence minimizing the impact.
Conclusion
Upgrading RDS engines at SafeBoda presented a significant challenge. It was a highly sensitive topic to address. Still, with Amazon RDS Blue/Green Deployments, we’ve mitigated these challenges, and we’ve not only ensured continuous availability but also minimized disruptions during critical upgrades. Ultimately, this method aligns with our DevOps philosophy of optimizing efficiency while maintaining high performance and reliability standards.
Additionally, it’s important to note that while RDS Blue/Green deployment doesn’t guarantee zero downtime deployment, it significantly reduces downtime. During the switchover process and DNS propagation, there may be a temporary interruption in the database connection (typically under a minute) within your application.