[DevOps云实践] IaaC:通过CloudWatch Agent和自定义Metric监视服务器
现在有许多第三方工具可以用于监控EC2实例。尽管事实如此,我想为您提供使用AWS原生资源和工具(如AWS CloudWatch、CloudWatch Agent和CloudFormation)快速设置EC2监控所需的所有信息。
首先,您应该知道,目前在启动EC2实例后,默认情况下仅可在CloudWatch中使用以下指标:
- CPUCreditBalance
- NetworkPacketsIn
- NetworkOut
- DiskReadOps
- StatusCheckFailed_Instance
- DiskReadBytes
- NetworkIn
- StatusCheckFailed
- NetworkPacketsOut
- DiskWriteBytes
- CPUSurplusCreditsCharged
- CPUCreditUsage
- CPUSurplusCreditBalance
- DiskWriteOps
- StatusCheckFailed_System
- CPUUtilization
如果您想监控DiskSpace
、Memory
等指标,您必须在实例上安装CloudWatch Agent
。该代理会定期将这些自定义指标发送到AWS CloudWatch。
下面将提供所有必要的脚本以及Cloudformation模板,这将使您能够在几分钟内设置好一切,即使您是AWS的新手,以前从未使用过AWS CloudWatch,也可以用此模板一键部署!
步骤
首先,让我们简要描述每个步骤:
为EC2附加IAM角色。这将允许安装在实例上的CloudWatch代理将自定义指标发送到AWS CloudWatch。
使用提供的示例,创建JSON配置文件并将其保存到S3。这些配置文件将在CloudWatch代理安装期间使用。这些文件将让CloudWatch代理知道您想要从实例发送哪些自定义指标到AWS CloudWatch。
运行提供的bash、powershell脚本来安装CloudWatch代理。
创建SNS主题并添加订阅者,当指标超过阈值时会收到通知。
使用Cloudformation模板创建CloudWatch警报。
步骤#1:
确保附加到您的EC2实例的IAM角色具有以下AWS托管策略CloudWatchAgentServerPolicy。
步骤#2:
使用以下提供的信息创建JSON配置文件。有两个文件,因为Linux和Windows语法略有不同。请将两个文件都保存到S3存储桶并确保它们是公开可访问的。保存每个文件的URL,您将需要它进一步使用。
cloudwatchagent-linux-config-json:
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 300,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 300
}
}
}
}
cloudwatchagent-windows-config-json:
{
"metrics": {
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 300,
"resources": [
"*"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 300
}
}
}
}
步骤#3:
为了在Linux(Ubuntu,CentOS,AmazonLinux)和Windows操作系统上安装CloudWatch Agent,请使用以下用户数据脚本。
对于Linux用户数据脚本,您需要将https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE
(参见第9行)替换为您的Linux配置文件cloudwatchagent-linux-config.json
的S3 URL。
install-cloudwatchagent-ubuntu-userdata.sh
#!/bin/bash mkdir tempcloudwatch cd tempcloudwatch apt install wget -y apt install unzip -y wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip unzip AmazonCloudWatchAgent.zip sudo ./install.sh wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -O config.json sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s
install-cloudwatchagent-centos-and-amazonlin-userdata.sh
#!/bin/bash mkdir tempcloudwatch cd tempcloudwatch yum install wget -y yum install unzip -y wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/AmazonCloudWatchAgent.zip unzip AmazonCloudWatchAgent.zip sudo ./install.sh wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -O config.json sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:config.json -s
对于Windows用户数据脚本,您需要将https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE
(参见第13行)替换为您的Windows配置文件cloudwatchagent-windows-config.json
的S3 URL。
<powershell>
mkdir "c:\cwagent"
wget "https://s3.amazonaws.com/amazoncloudwatch-agent/windows/amd64/latest/AmazonCloudWatchAgent.zip" -OutFile "C:\cwagent\cwagent.zip"
Add-Type -AssemblyName System.IO.Compression.FileSystem
function Unzip
{
param([string]$zipfile, [string]$outpath)
[System.IO.Compression.ZipFile]::ExtractToDirectory($zipfile, $outpath)
}
Unzip "C:\cwagent\cwagent.zip" "C:\cwagent"
cd "C:\cwagent"
.\install.ps1
wget https://YOUR-PUBLIC-URL-HERE-WITH-CONFIG-FILE -OutFile "C:\Program Files\Amazon\AmazonCloudWatchAgent\config.json"
cd “C:\Program Files\Amazon\AmazonCloudWatchAgent”
.\amazon-cloudwatch-agent-ctl.ps1 -a fetch-config -m ec2 -c file:config.json -s
</powershell>
第四步: 创建两个SNS主题
请创建两个AWS SNS主题。 一个用于关键类型警报的SNS主题,另一个用于警告类型警报。
第五步: 使用CloudFormation模板创建CloudWatch告警。
以下Cloudformation模板将允许您创建CPU、内存、系统状态、实例状态和磁盘空间的CloudWatch告警。 每种告警都有两种类型,即警告(WARNING )和关键(CRITICAL)。 它们之间的区别很简单——每种类型可以有一个单独的SNS主题,警告告警的阈值低于关键告警的阈值。基本上,关键警报应该通知值班人员,而警告警报只应发送电子邮件。
作为堆栈参数,您需要提供: 实例ID、实例名称,以及来自第四步的SNS主题的ARN。 对于磁盘空间堆栈,您还需要提供磁盘的名称。 作为堆栈名称,我建议您使用实例名称,因为在这种情况下,CloudWatch告警的名称格式将是EC2实例名称-指标类型-随机数字(例如: SERVER01-DiskSpaceWARNING-ABCDEFEFEFAE)。
最后
因为每个Cloudformation资源的“DeletionPolicy”设置为“Retain”,您可以在堆栈状态更改为CREATE_COMPLETE后删除堆栈。所有CloudWatch告警将不会被删除。
Linux操作系统的Cloudformation模板- CPU / 内存 / 状态检查
AWSTemplateFormatVersion: '2010-09-09' Description: Linux CloudWatch Alarms - CPU Memory Instance and System Status #------------------------------------------------------------------------------ Parameters: #------------------------------------------------------------------------------ instanceid: Description: "Choose an instance id" Type: AWS::EC2::Instance::Id instancename: Description: "Please provide EC2 instance name" Type: "String" MinLength: '1' MaxLength: '50' criticalsnsarn: Description: "Please provide an ARN of SNS topic - CRITICAL Type" Type: "String" warningsnsarn: Description: "Please provide an ARN of SNS topic - WARNING Type" Type: "String" #------------------------------------------------------------------------------ Resources: #------------------------------------------------------------------------------ CPUAlarmWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 90%" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: '900' EvaluationPeriods: '1' Threshold: '90' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ CPUAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 95%" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: '900' EvaluationPeriods: '2' Threshold: '95' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ MemoryAlarmWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 90%" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: "mem_used_percent" Namespace: CWAgent Statistic: Average Period: '900' EvaluationPeriods: '1' Threshold: '90' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ MemoryAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 95%" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: "mem_used_percent" Namespace: CWAgent Statistic: Average Period: '900' EvaluationPeriods: '2' Threshold: '95' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ SystemStatusAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - instance recovery process has been triggered because of failed System Status Check" Namespace: AWS/EC2 MetricName: StatusCheckFailed_System Statistic: Minimum Period: '60' EvaluationPeriods: '2' ComparisonOperator: GreaterThanThreshold Threshold: '0' AlarmActions: - !Sub "arn:aws:automate:${AWS::Region}:ec2:recover" - !Ref warningsnsarn OKActions: - !Ref warningsnsarn Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ InstanceStatusAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - Instance Status Check Failed - please investigate. Troubleshooting: https://goo.gl/Ea27Gd" Namespace: AWS/EC2 MetricName: StatusCheckFailed_Instance Statistic: Minimum Period: '60' EvaluationPeriods: '3' ComparisonOperator: GreaterThanThreshold Threshold: '0' AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn Dimensions: - Name: InstanceId Value: !Ref instanceid #-----------------------------------------------------------------------------
Linux操作系统的Cloudformation模板- 磁盘空间告警:
AWSTemplateFormatVersion: '2010-09-09' Description: Linux CloudWatch Diskspace Alarms #------------------------------------------------------------------------------ Parameters: #------------------------------------------------------------------------------ instanceid: Description: "Choose an instance id" Type: AWS::EC2::Instance::Id instancename: Description: "Please provide EC2 instance name" Type: "String" MinLength: '1' MaxLength: '50' criticalsnsarn: Description: "Please provide an ARN of SNS topic - CRITICAL Type" Type: "String" warningsnsarn: Description: "Please provide an ARN of SNS topic - WARNING Type" Type: "String" volume: Description: "Provide disk's/folder's name (ex.: xvda1)" Type: "String" Default: "xvda1" path: Description: "Provide path" Type: "String" Default: "/" fstype: Description: "Choose fstype - ext4 or xfs -> Ubuntu and AmazonLinux use ext4, CentOS use xfs" Type: String AllowedValues: - ext4 - xfs - btrfs ConstraintDescription: You must specify ext4,xfs,or btrfs. #------------------------------------------------------------------------------- Resources: #------------------------------------------------------------------------------- DiskSpaceAlarmWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - over 90% of ${volume} volume space is in use" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: "disk_used_percent" Namespace: CWAgent Statistic: Average Period: '300' EvaluationPeriods: '1' Threshold: '90' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: device Value: !Ref volume - Name: path Value: !Ref path - Name: fstype Value: !Ref fstype #------------------------------------------------------------------------------- DiskSpaceAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - over 95% of ${volume} volume space is in use" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: "disk_used_percent" Namespace: CWAgent Statistic: Average Period: '300' EvaluationPeriods: '1' Threshold: '95' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: device Value: !Ref volume - Name: path Value: !Ref path - Name: fstype Value: !Ref fstype #-------------------------------------------------------------------------------
Windows操作系统的Cloudformation模板- CPU / 内存 / 状态检查:
AWSTemplateFormatVersion: '2010-09-09' Description: Windows CloudWatch Alarms - CPU Memory Instance and System Status #------------------------------------------------------------------------------ Parameters: #------------------------------------------------------------------------------ instanceid: Description: "Choose an instance id" Type: AWS::EC2::Instance::Id instancename: Description: "Please provide EC2 instance name" Type: "String" MinLength: '1' MaxLength: '50' criticalsnsarn: Description: "Please provide an ARN of SNS topic - CRITICAL Type" Type: "String" warningsnsarn: Description: "Please provide an ARN of SNS topic - WARNING Type" Type: "String" #------------------------------------------------------------------------------ Resources: #------------------------------------------------------------------------------ CPUAlarmWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 90%" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: '900' EvaluationPeriods: '1' Threshold: '90' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ CPUAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High CPU Usage 95%" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: CPUUtilization Namespace: AWS/EC2 Statistic: Average Period: '900' EvaluationPeriods: '2' Threshold: '95' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ MemoryAlarmWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 90%" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: "Memory % Committed Bytes In Use" Namespace: CWAgent Statistic: Average Period: '900' EvaluationPeriods: '1' Threshold: '90' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: objectname Value: Memory #------------------------------------------------------------------------------ MemoryAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - High Memory Usage 95%" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: "Memory % Committed Bytes In Use" Namespace: CWAgent Statistic: Average Period: '900' EvaluationPeriods: '2' Threshold: '95' ComparisonOperator: GreaterThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: objectname Value: Memory #------------------------------------------------------------------------------ SystemStatusAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - instance recovery process has been triggered because of failed System Status Check" Namespace: AWS/EC2 MetricName: StatusCheckFailed_System Statistic: Minimum Period: '60' EvaluationPeriods: '2' ComparisonOperator: GreaterThanThreshold Threshold: '0' AlarmActions: - !Sub "arn:aws:automate:${AWS::Region}:ec2:recover" - !Ref warningsnsarn OKActions: - !Ref warningsnsarn Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------ InstanceStatusAlarmCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - Instance Status Check Failed - please investigate. Troubleshooting: https://goo.gl/Ea27Gd" Namespace: AWS/EC2 MetricName: StatusCheckFailed_Instance Statistic: Minimum Period: '60' EvaluationPeriods: '3' ComparisonOperator: GreaterThanThreshold Threshold: '0' AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn Dimensions: - Name: InstanceId Value: !Ref instanceid #------------------------------------------------------------------------------
Windows操作系统的Cloudformation模板- 磁盘空间告警:
AWSTemplateFormatVersion: '2010-09-09' Description: Windows CloudWatch Diskspace Alarms #------------------------------------------------------------------------------- Parameters: #------------------------------------------------------------------------------- instanceid: Description: "Choose an instance id" Type: AWS::EC2::Instance::Id instancename: Description: "Please provide EC2 instance name" Type: "String" MinLength: '1' MaxLength: '50' criticalsnsarn: Description: "Please provide an ARN of SNS topic - CRITICAL Type" Type: "String" warningsnsarn: Description: "Please provide an ARN of SNS topic - WARNING Type" Type: "String" volume: Description: "Provide Disk name (ex.: C:)" Type: "String" Default: "C:" MinLength: '1' MaxLength: '5' #------------------------------------------------------------------------------- Resources: #------------------------------------------------------------------------------- DiskSpaceWARNING: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - over 90% of ${volume} Drive space is in use" AlarmActions: - !Ref warningsnsarn OKActions: - !Ref warningsnsarn MetricName: "LogicalDisk % Free Space" Namespace: CWAgent Statistic: Average Period: '300' EvaluationPeriods: '1' Threshold: '10' ComparisonOperator: LessThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: instance Value: !Ref volume - Name: objectname Value: LogicalDisk #------------------------------------------------------------------------------- DiskSpaceCRITICAL: Type: AWS::CloudWatch::Alarm DeletionPolicy: Retain Properties: AlarmDescription: !Sub "${instancename} - ${instanceid} - over 95% of ${volume} Drive space is in use" AlarmActions: - !Ref criticalsnsarn OKActions: - !Ref criticalsnsarn MetricName: "LogicalDisk % Free Space" Namespace: CWAgent Statistic: Average Period: '300' EvaluationPeriods: '1' Threshold: '5' ComparisonOperator: LessThanOrEqualToThreshold Dimensions: - Name: InstanceId Value: !Ref instanceid - Name: instance Value: !Ref volume - Name: objectname Value: LogicalDisk #-------------------------------------------------------------------------------
总结
在我看来,最好始终有StatusCheck CloudWatch告警,因为它们允许您监视实例的健康状况。 此外,在Cloudformation模板中,SystemStatusCheck CloudWatch告警配置为,在实例的SystemStatus更改为ALARM状态时,通常意味着在AWS虚拟化程序或该级别以下存在问题时,CloudWatch告警将触发EC2恢复操作。此操作允许通过运行EC2启动和停止命令来恢复不健康的实例。大多数情况下,当实例启动时,实例将被迁移到一个新的底层主机计算机。