• DevOps
    Case Study

    How we built a resilient multi-account, multi-cloud solution for a Health Tech service provider!

    READ CASESTUDY
    icon

    24/7 DevOps as a Service

    Round-the-clock DevOps for uninterrupted efficiency.

    icon

    Infrastructure as a Code

    Crafting infrastructure with ingenious code.

    icon

    CI/CD Pipeline

    Automated CI/CD pipeline for seamless deployments.

    icon

    DevSecOps

    Integrated security in continuous DevOps practices.

    icon

    Hire DevOps Engineers

    Level up your team with DevOps visionaries.

    icon

    Consulting Services

    Navigate success with expert DevOps consulting.

  • TechOps
    Case Study

    How we built a scalable Odoo solution for a Travel Tech service provider!

    READ CASESTUDY

    WEB HOSTING SUPPORT

    icon

    HelpDesk Support

    Highly skilled 24/7 HelpDesk Support

    icon

    Product Support

    Boost your product support with our expertise.

    MANAGED SERVICES

    icon

    Server Management

    Don’t let server issues slow you down. Let us manage them for you.

    icon

    Server Monitoring

    Safeguard your server health with our comprehensive monitoring solutions.

    STAFF AUGMENTATION

    icon

    Hire an Admin

    Transform your business operations with our expert administrative support.

    icon

    Hire a Team

    Augment your workforce with highly skilled professionals from our diverse talent pool.

  • CloudOps
    Case Study

    How we helped a Private Deemed University in India, save US $3500/m on hosting charges!

    READ CASESTUDY
    icon

    AWS Well Architected Review

    Round-the-clock for uninterrupted efficiency

    icon

    Optimize

    Efficient CloudOps mastery for seamless cloud management

    icon

    Manage

    Automated CI/CD pipeline for seamless deployments

    icon

    Migrate

    Upgrade the journey, Migrate & Modernize seamlessly

    icon

    Modernize

    Simplify compliance complexities with our dedicated services

    icon

    FinOps as a Service

    FinOps as a Service

  • SecOps
    Case Study

    How we built a scalable Odoo solution for TravelTech service provider!

    READ CASESTUDY
    icon

    VAPT

    Vulnerability Assessment and Penetration Testing

    icon

    Source Code Review

    Ensuring source code security ans safe practices to reduce risks

    icon

    Security Consultation

    On demand services for improving server security

    icon

    System Hardening

    Reduced vulnerability and proactive protection

    icon

    Managed SoC

    Monitors and maintains system security. Quick response on incidents.

    icon

    Compliance as a Service

    Regulatory compliance, reduced risk

  • Insights
    Case Study

    How we helped a Private Deemed University in India, save US $3,500/m on hosting charges!

    READ CASESTUDY
    icon

    Blog

    Explore our latest articles and insights

    icon

    Case Studies

    Read about our client success stories

    icon

    Flipbook

    Explore our latest Flipbook

    icon

    Events

    Join us at upcoming events and conferences

    icon

    Webinars

    Watch our educational webinar series

  • Our Story
  • Contact Us

Interested to collaborate?

Get in touch with us!

Ready to elevate your business with certified cloud expertise? Contact us today to learn how our team can help you leverage cloud technology to drive growth, streamline operations, and enhance security.

  • AWSAWS
  • Azure CloudAzure Cloud
  • Google CloudGoogle Cloud
  • Akamai CloudAkamai Cloud
  • OVHOVH
  • Digital OceanDigital Ocean
  • HetznerHetzner
  • Kubernetes Consultancy Services
  • K8s & Cloud native Solutions
  • 24/7 Infrastructure Monitoring
  • DevOps as a Service
  • Cloud CI/CD Solutions
  • White Labeled MSP Support
  • Our story
  • Life@SupportSages
  • Insights
  • Careers
  • Events
  • Contact Us

Connect with us!


LinkedInFacebookXInstagramYouTube

aws partneraws advanced partner
SupportSages

Copyright © 2008 – 2026 SupportSages Pvt Ltd. All Rights Reserved.
Privacy PolicyLegal TermsData ProtectionCookie Policy
Automating RDS Instance Scaling and Failover with AWS Lambda and Step Functions

Automating RDS Instance Scaling and Failover with AWS Lambda and Step Functions

Arya P B

  • 7 min read
Automating RDS Instance Scaling and Failover with AWS Lambda and Step Functions

Generating audio, please wait...

Managing RDS (Relational Database Service) instances efficiently is crucial for maintaining database performance and cost-effectiveness. Automating the process of scaling up and down RDS instances based on your needs can save time and resources. In this blog, I’ll guide you through setting up an automated system using AWS Lambda and Step Functions to handle RDS instance scaling and failover seamlessly.

Overview

The solution consists of:

  • Lambda Functions: Triggered to determine the action (scale up or down) based on scheduled times.
  • Step Functions: Orchestrates the sequence of tasks, including scaling and failover, and ensures that the actions are executed in the correct order.

Prerequisites

To follow this guide, you will need:

  • An AWS account with access to RDS, Lambda, and Step Functions
  • Basic knowledge of AWS services and Python

Step 1: Creating the Lambda Functions

Trigger Lambda Function

The first Lambda function determines the action to take (downgrade or upgrade) based on the time and initiates a Step Function execution.

import boto3
from datetime import datetime
import json

step_functions_client = boto3.client('stepfunctions')
rds_client = boto3.client('rds')

def get_clusters_with_reader_scale():
    try:
        response = rds_client.describe_db_clusters()
        clusters = response['DBClusters']
        for cluster in clusters:
            tags_response = rds_client.list_tags_for_resource(
                ResourceName=cluster['DBClusterArn']
            )
            tags = {tag['Key']: tag['Value'] for tag in tags_response['TagList']}
            if tags.get('reader-scale') == 'yes':
                return cluster
        return None
    except Exception as e:
        raise Exception(f"Failed to describe DB clusters: {e}")

def lambda_handler(event, context):
    state_machine_arn = //replace with state_machine_arn
    
    try:
        # Get cluster details
        cluster_details = get_clusters_with_reader_scale()
        if not cluster_details:
            return {'error': "No cluster found with reader-scale = 'yes'"}
        
        cluster_identifier = cluster_details['DBClusterIdentifier']
        
        # Find current writer and reader instance IDs
        current_writer_instance_id = None
        current_reader_instance_id = []
        
        for member in cluster_details['DBClusterMembers']:
            if member['IsClusterWriter']:
                current_writer_instance_id = member['DBInstanceIdentifier']
            else:
                current_reader_instance_id.append(member['DBInstanceIdentifier'])
        
        if not current_writer_instance_id or not current_reader_instance_id:
            return {'error': "Could not determine writer or reader instance IDs."}
        
        # Fetch tags for the cluster
        cluster_tags = {tag['Key']: tag['Value'] for tag in cluster_details['TagList']}
        
        # Get the downgrade and upgrade instance classes from tags
        downgrade_class = cluster_tags.get('downgrade')
        upgrade_class = cluster_tags.get('upgrade')
        
        if not downgrade_class or not upgrade_class:
            return {'error': "Downgrade or upgrade class not specified in tags."}
        
        # Get the current time
        current_time = datetime.now().strftime("%H:%M")
        
        # Determine action based on time
        downgrade_time = cluster_tags.get('downgrade-time')
        upgrade_time = cluster_tags.get('upgrade-time')
        
        if current_time == downgrade_time:
            action = 'downgrade'
            instance_id = current_writer_instance_id
            target_instance_id = current_reader_instance_id
        elif current_time == upgrade_time:
            action = 'upgrade'
            instance_id = current_reader_instance_id
            target_instance_id = current_writer_instance_id
        else:
            return {'error': "Current time does not match downgrade or upgrade time."}
        
        # Define parameters for Step Function
        input_params = {
            'cluster_identifier': cluster_identifier,
            'instance_id': instance_id,
            'downgrade_class' : downgrade_class,
            'upgrade_class' : upgrade_class,
            'target_instance_id': target_instance_id,
            'action': action
        }
    
        # Start the Step Function execution
        response = step_functions_client.start_execution(
            stateMachineArn=state_machine_arn,
            input=json.dumps(input_params)
        )
        
        # Convert datetime object to string in response
        response['startDate'] = response['startDate'].isoformat()
        
        return response
    
    except Exception as e:
        return {'error': str(e)}

Handler Lambda Function

The handler function performs the actual scaling and failover actions based on the input received from Step Functions.

import boto3
from botocore.exceptions import ClientError

rds_client = boto3.client('rds')

def lambda_handler(event, context):
    print("Received event:", event)
    action = event['action']
    cluster_identifier = event.get('cluster_identifier')
    instance_id = event.get('instance_id')
    downgrade_class = event.get('downgrade_class')
    upgrade_class = event.get('upgrade_class')
    target_instance = event.get('target_instance_id')
    resource_type = event.get('resource_type')
    resource_id = event.get('resource_id')
    target_status = event.get('target_status')

    print("resource_id:", resource_id)
    print("target_instance", target_instance)
    print("instance_id", instance_id)

    if 'downgrade' in action:
        if len(target_instance) > 1:
            first_value = target_instance[0]
            print("first_value", first_value)
            target_instance_id = target_instance[1]
            print("second_value", target_instance_id)
    elif 'upgrade' in action:
        target_instance_id = target_instance
        instance_id = instance_id[0]
        
    print("instance_id", instance_id)
    if 'modify' in action:
        if 'upgrade' in action:
            try:
                # Retrieve the details of the existing instance
                instance_details = rds_client.describe_db_instances(DBInstanceIdentifier=instance_id)
                availability_zone = instance_details['DBInstances'][0]['AvailabilityZone']
                engine = instance_details['DBInstances'][0]['Engine']
                print(f"Availability zone of {instance_id}: {availability_zone}")
                print(f"Engine of {instance_id}: {engine}")

                print(f"Modifying instance: {instance_id} to class: {upgrade_class}")
                response = rds_client.modify_db_instance(
                    DBInstanceIdentifier=instance_id,
                    DBInstanceClass=upgrade_class,
                    ApplyImmediately=True
                )

                # Adding a new reader instance
                new_reader_instance_id = f"{cluster_identifier}-instance-3"
                print(f"Adding reader instance: {new_reader_instance_id} with class: {downgrade_class} in availability zone: {availability_zone}")
                response = rds_client.create_db_instance(
                    DBInstanceIdentifier=new_reader_instance_id,
                    DBInstanceClass=downgrade_class,
                    Engine=engine,
                    DBClusterIdentifier=cluster_identifier,
                    AvailabilityZone=availability_zone
                )
                return {
                    'status': 'modification_initiated',
                    'instance_id': instance_id,
                    'instance_class': downgrade_class,
                    'cluster_identifier': cluster_identifier,
                    'target_instance_id': target_instance_id
                }
            except ClientError as e:
                print(f"Modify instance error: {e}")
                return {'error': str(e)}
        
        elif 'downgrade' in action:
            try:
                print(f"Modifying instance: {instance_id} to class: {downgrade_class}")
                response = rds_client.modify_db_instance(
                    DBInstanceIdentifier=instance_id,
                    DBInstanceClass=downgrade_class,
                    ApplyImmediately=True
                )
                # Removing a reader instance
                print(f"Removing reader instance: {first_value}")
                response = rds_client.delete_db_instance(
                    DBInstanceIdentifier=first_value,
                    SkipFinalSnapshot=True
                )
                return {
                    'status': 'modification_initiated',
                    'instance_id': instance_id,
                    'instance_class': downgrade_class,
                    'cluster_identifier': cluster_identifier,
                    'target_instance_id': target_instance_id
                }
            except ClientError as e:
                print(f"Modify instance error: {e}")
                return {'error': str(e)}

    elif action == 'failover':
        try:
            print(f"Initiating failover for cluster: {cluster_identifier} to instance: {instance_id}")
            response = rds_client.failover_db_cluster(
                DBClusterIdentifier=cluster_identifier,
                TargetDBInstanceIdentifier=instance_id
            )
            print(f"Failover response: {response}")
            return {'status': 'failover_initiated', 'cluster_identifier': cluster_identifier}
        except ClientError as e:
            print(f"Failover error: {e}")
            return {'error': str(e)}
    
    elif action == 'check_status':
        try:
            if resource_type == 'instance':
                print(f"Checking instance status: {resource_id} in cluster: {cluster_identifier}")
                response = rds_client.describe_db_instances(DBInstanceIdentifier=resource_id)
                status = response['DBInstances'][0]['DBInstanceStatus']
                if status == 'modifying' or status == 'configuring-enhanced-monitoring':
                    print(f"Instance {resource_id} in progress: current status={status}")
                    return {'status': 'in_progress', 'instance_id': resource_id, 'current_status': status, 'cluster_identifier': cluster_identifier}
            elif resource_type == 'cluster':
                print(f"Checking cluster status: {resource_id}")
                response = rds_client.describe_db_clusters(DBClusterIdentifier=resource_id)
                status = response['DBClusters'][0]['Status']
                if status == 'modifying' or status == 'configuring-enhanced-monitoring':
                    print(f"Cluster {resource_id} in progress: current status={status}")
                    return {'status': 'in_progress', 'instance_id': resource_id, 'current_status': status, 'cluster_identifier': cluster_identifier}
            else:
                return {'error': 'Invalid resource type'}
            
            if status == target_status:
                print(f"Resource {resource_id} is now available")
                return {'status': 'success', 'instance_id': resource_id, 'cluster_identifier': cluster_identifier}
            else:
                print(f"Resource {resource_id} still in progress: current status={status}")
                return {'status': 'in_progress', 'instance_id': resource_id, 'current_status': status, 'cluster_identifier': cluster_identifier}
        except ClientError as e:
            print(f"Check status error: {e}")
            return {'error': str(e)}
    
    return {'status': 'no_action'}

Step 2: Creating the Step Functions

Here is the Step Functions definition to orchestrate the scaling and failover process.

{
  "Comment": "RDS Downgrade/Upgrade State Machine",
  "StartAt": "DetermineAction",
  "States": {
    "DetermineAction": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.action",
          "StringEquals": "downgrade",
          "Next": "ModifyInstanceForDowngrade"
        },
        {
          "Variable": "$.action",
          "StringEquals": "upgrade",
          "Next": "ModifyInstanceForUpgrade"
        }
      ],
      "Default": "NoActionRequired"
    },
    "ModifyInstanceForDowngrade": {
      "Type": "Task",
      "Resource": "<Handeler lambda arn>",
      "Parameters": {
        "action": "modify/downgrade",
        "instance_id.$": "$.instance_id",
        "downgrade_class.$": "$.downgrade_class",
        "upgrade_class.$": "$.upgrade_class",
        "cluster_identifier.$": "$.cluster_identifier",
        "target_instance_id.$": "$.target_instance_id"
      },
      "Next": "EndState"
    },
    "ModifyInstanceForUpgrade": {
      "Type": "Task",
      "Resource": "<Handeler lambda arn>",
      "Parameters": {
        "action": "modify/upgrade",
        "instance_id.$": "$.instance_id",
        "downgrade_class.$": "$.downgrade_class",
        "upgrade_class.$": "$.upgrade_class",
        "cluster_identifier.$": "$.cluster_identifier",
        "target_instance_id.$": "$.target_instance_id"
      },
      "Next": "WaitForModification"
    },
    "WaitForModification": {
      "Type": "Wait",
      "Seconds": 300,
      "Next": "CheckInstanceStatus"
    },
    "CheckInstanceStatus": {
      "Type": "Task",
      "Resource": "<Handeler lambda arn>",
      "Parameters": {
        "action": "check_status",
        "resource_type": "instance",
        "resource_id.$": "$.instance_id",
        "cluster_identifier.$": "$.cluster_identifier",
        "target_status": "available"
      },
      "Next": "IsInstanceAvailable"
    },
    "IsInstanceAvailable": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.status",
          "StringEquals": "success",
          "Next": "InitiateFailover"
        },
        {
          "Variable": "$.status",
          "StringEquals": "in_progress",
          "Next": "WaitForModification"
        },
        {
          "Variable": "$.current_status",
          "StringEquals": "configuring-enhanced-monitoring",
          "Next": "WaitForModification"
        }
      ],
      "Default": "WaitForModification"
    },
    "InitiateFailover": {
      "Type": "Task",
      "Resource": "<Handeler lambda arn>",
      "Parameters": {
        "action": "failover",
        "cluster_identifier.$": "$.cluster_identifier",
        "instance_id.$": "$.instance_id"
      },
      "Next": "EndState"
    },
    "EndState": {
      "Type": "Succeed"
    },
    "NoActionRequired": {
      "Type": "Succeed"
    }
  }
}

How It Works:

  1. Trigger Lambda Function: Determines the current action (upgrade or downgrade) based on the time and cluster tags.
  2. Step Functions: Manages the orchestration of scaling actions and failover.
  3. Handler Lambda Function: Executes the RDS instance modifications and failover operations.

Key Steps in the State Machine

  • DetermineAction: Decides whether to perform a downgrade or upgrade based on the input.
  • ModifyInstanceForDowngrade/Upgrade: Initiates the modification of the RDS instance.
  • WaitForModification: Waits for the modification to complete.
  • CheckInstanceStatus: Verifies the status of the instance.
  • InitiateFailover: Performs a failover if the upgrade is successful.

Conclusion

By leveraging AWS Lambda and Step Functions, you can automate the scaling and failover of RDS instances, ensuring efficient resource utilization and minimizing manual intervention. This approach not only saves time but also enhances the reliability and availability of your database services.
Start automating your RDS instance scaling and failover with AWS Lambda and Step Functions today! Streamline operations, reduce costs, and improve database reliability.

  • DevOps
Promotional banner
Promotional banner

Analyzing AWS IAM Users: Access Key and Password Age

Analyzing AWS IAM Users: Access Key and Password Age
  • DevOps
logo

Analyzing AWS IAM Users: Access Key and Password Age

Analyzing AWS IAM Users: Access Key and Password Age
  • AWS
  • DevOps
logo

Auto-Restart EC2 Instances on Status Check Failure: Quick Setup Guide

Auto-Restart EC2 Instances on Status Check Failure: Quick Setup Guide
  • DevOps
logo

Auto-Restart EC2 Instances on Status Check Failure: Quick Setup Guide

Auto-Restart EC2 Instances on Status Check Failure: Quick Setup Guide
  • AWS
  • DevOps
logo

Posts by Arya P B

Athena