EC2 Instance Management (Start and Stop) Based on Memory Utilization Exceeding Threshold Value”

In cloud environments, managing EC2 instances effectively is essential to ensure system reliability and performance. One of the key aspects of EC2 instance management is monitoring resource usage, particularly memory utilization. If memory utilization exceeds a threshold, it may be necessary to take corrective actions, such as stopping and restarting the instance. This article demonstrates how to automate the process of monitoring EC2 memory usage, checking for connectivity, and managing instances using AWS Lambda and Step Functions.

Overview of the Solution

The solution automates the following workflow:

Monitor EC2 Instances: Periodically check the memory utilization of EC2 instances.
Check Memory Utilization: If the memory usage exceeds a defined threshold, further actions are taken.
SSM Connection Check: If the instance is reachable via AWS Systems Manager (SSM), no action is taken. If not, the instance is stopped and restarted.
Use AWS Step Functions: The entire process is orchestrated using AWS Step Functions, allowing the automation of the flow from memory checking to stopping/restarting EC2 instances.

Architecture

The solution involves the following AWS services:

AWS Lambda: Used for writing custom code to monitor EC2 instances, check memory utilization, manage instance states (stop/start), and interact with AWS Systems Manager.
AWS Step Functions: Orchestrates the entire flow, ensuring that the actions are taken based on the logic and state transitions.
AWS EC2: The core compute resource being monitored and managed.
AWS CloudWatch: Used for monitoring the metrics related to EC2 instances, such as memory and CPU utilization.
AWS Systems Manager (SSM): Used to check if the instance is accessible remotely before performing any shutdown or restart.

Lambda Function to Monitor EC2 Instances

The Lambda function is responsible for:

Checking Memory Utilization: It checks the memory utilization of an EC2 instance.
Stopping or Starting Instances: If the memory utilization exceeds a set threshold, it stops the instance. If the instance is unreachable, it attempts to restart it after waiting for a set period.
Handling EC2 Instances Based on Action: Based on the memory usage, it triggers actions to stop or start instances using Step Functions.

import boto3
import time
from datetime import datetime, timedelta

ec2_client = boto3.client('ec2')
ssm_client = boto3.client('ssm')
cloudwatch = boto3.client('cloudwatch')
stepfunctions_client = boto3.client('stepfunctions')

MEMORY_THRESHOLD = 4
WAIT_TIME = 300
STEP_FUNCTION_ARN = 'arn'

def lambda_handler(event, context):
print("Received event:", event)
action = event.get('action')

if action:
instance_id = event.get('instance_id')
if not instance_id:
print("Instance ID is required for specific actions.")
return {'error': "Instance ID is required for this action."}

if action == 'check_status':
instance_status = check_instance_status(instance_id)
trigger_step_function(instance_id, instance_status)
return {'status': instance_status, 'instance_id': instance_id}

elif action == 'wait5min':
memory_utilization = get_memory_utilization(instance_id)

if memory_utilization > MEMORY_THRESHOLD:
if can_connect_via_ssm(instance_id):
print(f"Connection via SSM successful for instance {instance_id}, no reboot needed.")
return {'message': 'Instance reachable via SSM, no reboot needed'}

ec2_client.stop_instances(InstanceIds=[instance_id])
print(f"Memory utilization remains high, and instance {instance_id} is unreachable, stopping instance.")
action = 'stop' 
trigger_step_function(instance_id, action)
return {'action': 'stop', 'instance_id': instance_id}
else:
print(f"Memory utilization has decreased to {memory_utilization}% for instance {instance_id}")
return {'message': 'Memory utilization below threshold, no stop action taken'}

elif action == 'start':
ec2_client.start_instances(InstanceIds=[instance_id])
return {'action': 'start', 'instance_id': instance_id}

else:
instances = get_all_instances()
if not instances:
print("No running EC2 instances found.")
return {'message': "No action required"}

for instance_id in instances:
print(f"Checking instance: {instance_id}")
memory_utilization = get_memory_utilization(instance_id)

if memory_utilization < MEMORY_THRESHOLD:
print(f"Memory utilization is below threshold ({memory_utilization}%) for instance {instance_id}")
else:
print(f"Memory utilization is high ({memory_utilization}%) for instance {instance_id}, waiting for {WAIT_TIME} seconds...")
action = 'wait5min'
trigger_step_function(instance_id, action)

def get_all_instances():
try:
response = ec2_client.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
instance_ids = [ instance['InstanceId']
for reservation in response['Reservations']
for instance in reservation['Instances']
]
return instance_ids
except Exception as e:
print(f"Error retrieving instances: {e}")
return []

def get_memory_utilization(instance_id):
# Fetch memory utilization metric (CloudWatch)
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='MemoryUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.utcnow() - timedelta(minutes=10),
EndTime=datetime.utcnow(),
Period=60,
Statistics=['Average']
)
if response['Datapoints']:
return response['Datapoints'][-1]['Average']
else:
print(f"No memory utilization data found for instance {instance_id}.")
return 0

def can_connect_via_ssm(instance_id):
try:
response = ssm_client.send_command(
InstanceIds=[instance_id],
DocumentName="AWS-RunShellScript",
Parameters={'commands': ["echo 'test connection'"]}
)
command_id = response['Command']['CommandId']
time.sleep(5)
command_status = ssm_client.get_command_invocation(
CommandId=command_id,
InstanceId=instance_id
)
return command_status['Status'] == 'Success'
except Exception as e:
print(f"Error checking SSM connection for instance {instance_id}: {e}")
return False

def check_instance_status(instance_id):
try:
response = ec2_client.describe_instance_status(
InstanceIds=[instance_id],
IncludeAllInstances=True
)
instance_status = response['InstanceStatuses'][0]['InstanceState']['Name']
return instance_status
except Exception as e:
print(f"Error checking instance status for {instance_id}: {e}")
return "Unknown"

def trigger_step_function(instance_id, action):
try:
response = stepfunctions_client.start_execution(
stateMachineArn=STEP_FUNCTION_ARN,
input=f'{{"instance_id": "{instance_id}", "action": "{action}"}}'
)
print("Step Function started successfully:", response['executionArn'])
except Exception as e:
print("Error starting Step Function:", e);

AWS Step Function for Orchestrating Actions

{
"Comment": "EC2 instance memory and status check workflow",
"StartAt": "Choice",
"States": {
"Choice": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.action",
"StringEquals": "stop",
"Next": "Wait"
},
{
"Variable": "$.action",
"StringEquals": "wait5min",
"Next": "Wait 5 minutes"
}
],
"Default": "End"
},
"Wait 5 minutes": {
"Type": "Wait",
"Seconds": 300,
"Next": "Lambda Invoke"
},
"Lambda Invoke": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:", 
"Payload.$": "$"
},
"End": true
},
"Wait": {
"Type": "Wait",
"Seconds": 300,
"Next": "checkInstanceStatus"
},
"checkInstanceStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:",
"Payload": {
"action": "check_status",
"instance_id.$": "$.instance_id"
}
},
"Next": "checkstate"
},
"checkstate": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.status",
"StringEquals": "stopping",
"Next": "Wait"
},
{
"Variable": "$.status",
"StringEquals": "stopped",
"Next": "startInstance"
}
],
"Default": "End"
},
"startInstance": {
"Type": "Task",
"Resource": "arn:aws:lambda:",
"Parameters": {
"action": "start",
"instance_id.$": "$.instance_id"
},
"End": true
},
"End": {
"Type": "Succeed"
}
}
}

Conclusion

By integrating AWS Lambda and Step Functions, we can automate the management of EC2 instances based on memory utilization. This solution ensures that high memory usage is detected, the instance status is checked, and necessary actions like stopping or starting the EC2 instance are triggered automatically. With this approach, you can save time and reduce manual intervention in managing your EC2 instances, ensuring that your cloud environment is always running smoothly.