-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathCHANGELOG.txt
265 lines (206 loc) · 22.7 KB
/
CHANGELOG.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
v0.27.1:
- (Bug/Major) Fixed python exception in some Spot interruption situations.
- (Bug/Major) Do not launch a new instance on Spot interruption if it is in a fleet in undefined state.
- (Bug/Major) Fixed breaking change introduced in v0.26.0 leading to API "control/reschedulenow" not working anymore.
- (Improvment/Minor) Updated all Python dependencies to latest versions.
v0.27.0:
- (Major) Upgrade Lambda Python runtime to 3.11 (Note: only marginal performance improvments expected as CloneSquad is spending lots of time calling AWS APIs). Upgraded Python dependencies to latest versions.
v0.26.0:
- (Major) Use of Step Functions to periodically call the Main function instead of SQS Delayed message callback. Due to new AWS policy regarding Lambda recursive loops, the former method (SQS delayed message) to do this schedule is detected as an issue and requires to contact AWS support for each CloneSquad deployment of previous versions to make it work correctly. With Step Functions method, the Main function is called every 10 seconds and perform a validation if it is time to execute (in conformance to the 'app.run_period' setting). (See https://aws.amazon.com/blogs/compute/detecting-and-stopping-recursive-loops-in-aws-lambda-functions/ for more information about Lambda recursive loop detection).
- (Minor) Updated Python dependencies to latest versions.
v0.25.0:
- (Major) Upgraded Python Lambda runtime from Python 3.7 to Python 3.9.
- (Minor) Switched from Amazon Linux 2 to Amazon Linux 2023 for DevKit base image.
v0.24.1:
- (Bug/Minor) When a subfleet is in state 'undefined' (or you did not specified a state explicitly), a warning was generated about CloudWatch metrics not currectly set. This bug had no functionnal impact but was generating noise in CloudWatch logs.
- (Improvment/Minor) ec2.schedule.spot.min_stop_period_after_interruption default value set to 15 minutes (instead of 6 hours). Most of Spot interruptions do not last long so setting this value to a very high value by default is not relevant.
- (Improvment/Minor) Updated all Python dependencies to latest versions.
v0.23.0:
- (Improvment/Major) EC2 Spot interrupted instances are now NOT restartable via CloneSquad during 6 hours by default (delay controlled with new configuration key ec2.schedule.spot.min_stop_period_after_interruption). This prevents interrupted Spot instances to be stopped by EC2 Spot service and restarted too quickly by CS leading to possible oscillatory behaviors.
v0.22.0:
- (Improvment/Major) EC2 Spot rebalance_recommandation events are now ignored by default. They can be enabled again with new configuration key 'ec2.schedule.spot.recommended_event.disable'
v0.21.1:
- (Bug/Major) Fixed inverted logic in CPU Credit Cloudwatch metric polling
v0.21.0:
- (Improvment/Major) CPU Crediting is now disabled by default in the Main fleet (like in subfleets) (See 'ec2.schedule.burstable_instance.max_cpu_crediting_instances')
- (Improvment/Minor) Made public existing private setting 'ec2.instance.max_start_instance_at_once' and 'ec2.instance.max_stop_instance_at_once'
- (Improvment/Minor) Updated Python dependencies to latest versions.
v0.20.5:
- (Bug/Minor) Added ACL "bucket-owner-full-control" while doing S3 PutObject (enable Athena to query the "CMDB" Metadata in a multi-account environment)
v0.20.4:
- (Bug/Minor) Updated Python dependencies to latest versions (Fix GitHub Dependabot alert about urllib3 version <=1.26.5)
v0.20.3:
- (Bug/Minor) Event 'drain_instances' was using a non-intuitive 'DrainedInstanceIds' parameter. Now a more generic 'InstanceIds' is defined ('DrainedInstanceIds' still exists for backward compatibility and with the same value than 'InstanceIds')
v0.20.2:
- (Bug/Major) Events 'drain_instances' and 'instance_transitions' were not sent for subfleet instances (i.e. only send for main fleet ones).
v0.20.1:
- (Bug/Minor) Fixed semantic for PRIVATE APIGW deployment. It is now possible to propagate the VPCEndpoint to use through the Discovery Lambda by guessing it from VpcId parameter.
Note: Compared to v0.20.0, only the CloudFormation custom resource used at deployment time is updated; Main/Interact/Discovery Lambdas are identitical.
v0.20.0:
- (Improvment/Major) Added 'Metadata and Backup export to S3' feature (with CMDB-like feature).
- (Improvment/Major) COMPATIBILITY BREACH: `ec2.schedule.desired_instance_count` default value is now `undefined` (instead of `-1`) meaning all scaling activities are now disabled by default (Change implemented to make reliable CloneSquad reinstall use-case and align Main fleet with Subfleet management behaviors).
- (Improvment/Major) SSM Maintenance Window objects can now be tagged with 'clonesquad:config:disable-default-configuration' to inhibit the default behavior of launching all the fleet instances during a maintenance period. (All others characteritics of Maintenance Window remain the same. see docs/SSM.md).
- (Improvment/Minor) Sample Notification Lambda (see examples/sam-sample-lambda/) can now send Notification Acks to an Interact SQS queue in another region.
- (Improvment/Minor) API 'configuration' and 'scheduler' offer now a 'restore' mode: In 'restore' mode, the uploaded configuration is not merged but fully replace the existing DynamoDB content.
- (Bug/Major) Regression since v0.18.0: Some notifications may not be sent as expected.
- (Bug/Major) Give KMS rights to the Main Lambda to allow start of EC2 instances with encrypted volumes. Added 'EBSVolumeKMSKeys' parameters to restrict access to specific KMS keys.
- (Bug/Major) Fixed incorrect SQS Policy protecting the InteractSQSQueue (Was forbidding other AWS account principals to perform sqs:SendMessage). Added parameter `InteractSQSQueueIAMPolicy` to define precisely which AWS Principal can connect (see docs/DEPLOYMENT_REFERENCE.md).
- (Bug/Major) Fixed too restricted Main Lambda IAM policy forbidding start of instances with AMI controlled by AWS License Manager.
- (Bug/Minor) On rare circumstances, 'BlockNewConnectionToPorts' can receive an error message due to iptables lock. '-w 10' option is added to every iptables call to wait for lock release.
- (Bug/Minor) Reduced the maximum number of instance starts in a single Main Lambda execution (from 50 to 25 to be less aggressive with the EC2 API): Users with more than 25 instances per fleet will see a full 100% fleet start only after multiple Main Lambda executions.
- (Bug/Minor) Allow '%' character in scheduler entry name.
v0.19.0:
- (Improvment/Major) COMPATIBILITY BREACH: In order to reduce the size on screen and improve subfleet dashboard readability, the subfleet metrics are shortened ("Subfleet." prefix is removed).
- (Improvment/Major) Added optional SSM subsystem (disabled by default): Brings SSM Maintenance Window support and SSM Event notifications through SSM RunCommands (see https://github.com/jcjorel/clonesquad-ec2-pet-autoscaler/blob/master/docs/SSM.md)
- (Improvment/Major) Added User Notification event 'ssm_maintenance_window_event' sent when an instance is entering/exiting a maintenance window period.
- (Improvment/Major) Added User Notification event 'new_instances_marked_as_unuseable' sent when an instance is detected unuseable (Spot interrupted, Not responding on-time to SSM Event requests, failed ELB healthchecks, EC2 impaired status...)
- (Improvment/Minor) API GW /v1/discovery and CloneSquad-{GroupName}-Discovery Lambda function now return also the currently deployed CloneSquad version.
- (Bug/Major) Fixed a case where user notifications failed to be sent due to messages too big.
- (Bug/Minor) Some transient RDS status can rarely happen generating InvalidDBInstanceState exception on start/stop action. We now trap this exception as non-critical and so no more notify the user for a minor transient situation.
- (Bug/minor) Renamed the deployed CW-Role to contain the region name and so allow deployment of CloneSquad in multiple regions into the same account.
v0.18.0:
- (Improvment/Major) Added fine-grained IAM policy for Lambda functions => No more AdministratorAccess default policy! :-)
- (Bug/Major) Fixed improprer management of notification events when, for any reason, a large amount is pending for processing.
- (Bug/Major) Fixed improper state management of TransferFamily instance.
- (Improvment/Minor) URL specified in configuration with https:// protocol can now automatically authenticate with SigV4.
v0.17.0:
- (Improvment/Major) Allow to specify an IAM boundary on all roles created by the CloneSquad deployment template.
- (Improvment/Major) Make notification messages smaller and more robust.
- (Bug/Major) Fixed regression related to parameter set handling introduced in v0.15.0
- (Bug/Minor) Devkit Configuration Wizard tool: All CloneSquad deployments were not displayed in some circumstances. Added paginator to handle accounts with lots of CloudFormation stacks deployed.
v0.16.0
- (Improvment/Major) Added support for marking/unmarking instances as unstoppable and/or unstartable. (+Improved time execution improvments)
- (Improvment/Minor) CLONESQUAD_LOGLEVELS Lambda environment variable now supports wildcards. Ex: CLONESQUAD_LOGLEVELS='*=DEBUG' to activate maximum verbose level
v0.15.1:
- (Bug/Minor) Make sure that CW event rules are regenerated on DST status change.
v0.15.0:
- (Improvment/Major) Added support for 'localcron' keyword in scheduler. This keyword enables local zone cron entries. Please look at documentation for some limitations with wildcards.
- (Improvment/Major) Introduced special subfleet name '__all__' which can be used to override values in all defined subfleets at once. Ex: 'subfleet.__all__.state=running' will force all subfleets to be in 'running' state whatever their current individual configuration is.
- (Improvment/Major) A new key format is introduced to override configuration keys. Format: 'overrride:ec2.schedule.min_instance_count' will override 'ec2.schedule.min_instance_count' value. Combined with TTL specification, it is useful to set temporary configuration.
- (Improvment/Major) API GW /configuration/<key>?ttl=<duration> can now be used to set an expiration delay on a key. On expiration, the key deletes by itself from the DynamoDB table.
v0.14.0:
- (Improvment/Major) Added support for FIFO SQS queue as event receiver.
- (Improvment/Minor) Added support the SQS queue handling in sam-sample-lambda demo function.
v0.13.1:
- (Bug/Major) When state key 'cloudwatch.metrics.cache' doesn't exist, a fatal exception is thrown. Ensure that deserialisation of a state key never return None but default value instead.
- (Bug/Minor) Randomly, a DB start/Stop operation can fail due to a transient InvalidDBClusterStateFault exception. This was causing a needless user notification and a LongTerm debug event record in DynamodDB. The exception is now trapped and is no more considered as a critical error. The bug is minor as it doesn't altered the normal scheduling behavior of the overall SW.
- (Bug/Minor) Added 'ec2.schedule.disable' configuration to disable all scheduling and automation algorithms in the Main fleet. It it semantically similar to 'subfleet.<subfleet_name>.state'.
v0.13.0:
- (Improvment/Minor) New 'ec2.schedule.start.warmup_delay' parameter to allow modification of the default 2 minute warmup delay. Users with applications requiring long instance boot time SHOULD use this parameter to inform CloneSquad of this fact: It is always recommended to have this value set to the worst case boot time if no TargetHealth check is used (CloneSquad usage without TargetGroup).
- (Improvment/Major) On Spot recommendation, new instances are started automatically. No more wait for 'Interrupted' message to do that.
- (Improvment/Minor) Removed the grace period before a Spot instance can be restarted after a shutdown.
- (Improvment/Minor) No more need to manually enable support for RDS and TransferFamily. This is now automatic.
v0.12.0:
- (Improvment/Major) Vertical scaling is now supported in subfleets (by default, Spot instances are favored over On-Demand. Can be changed.)
- (Improvment/Major) Vertical scaling configuration is simplified and CloneSquad won't change any more instance type by itself (brigging too much complexity).
v0.11.0:
- (Improvment/Minor) More demonstration code in the Event Lambda example. Display Name of the instance of start_instances/stop_instances events.
- (Improvment/Major) Redesigned notification sending code to avoid message truncation due to 256kB limit for Lambda, SQS and SNS payload.
v0.10.0:
- (Improvment/Minor) cs-instance-watcher: Added support for clonesquad:excluded and associated tags.
- (Improvment/Minor) cs-instance-watcher: Set API polling period to 5s (instead of 10s).
- (Improvment/Minor) Removed unneeded parameters from Alarm specifications. (plus log message beatify in cloudwatch.py)
- (Improvment/Minor) Added ignore: protocol scheme to CloudWatch alarm configuration ignore.
v0.9.1:
- (Bug/Major) Fix CloudFormation template update regression. Add a 'Dummy' parameter to force customresource recreation on update.
v0.9.0:
- Added cs-instance-watcher tool. It has to be installed on CloneSquad EC2 managed instances to improve failover of non-AWS LoadBalancers of instance draining condition when no AWS TargetGroup is used. See docs/TOOLS.md.
v0.8.4:
- (Bug/Major) Cached data for EC2 status were never flushed from API GW cache.
v0.8.3:
- (Improvment/Minor) Added explicit warning when a subfleet is in 'undefined' state to notify user that no scaling action will take place.
- (Improvment/Minor) Added 'raw' argument to API GW (/configuration?raw=true) to dump in a clean YAML format ready for further import.
- (Bug/Minor) Dump subfleet configuration keys with API GW (/configuration).
v0.8.2:
- (Improvment/Minor) Default value for subfleet.<subfleetname>.ec2.schedule.burstable_instance.max_cpu_crediting_instances changed from 50% to 0% to ease understand of shutdown behavior of burstable instances. CPU Crediting needs now to be explicitly actived with this key with a value > 0%.
v0.8.1:
- (Bug/Major) Fixed possible mixup of instances between subfleets when some subfleet names can be found in others.
v0.8.0:
- (Improvment/Major) BACKWARD COMPATIBILITY BREACH: Suppress all references to 'staticfleet' and reworded documentation, CloudWatch metrics and configuration keys toward simpler 'subfleet' wording.
- (Improvment/Minor) By default, do not restart automatically burstable instances to preserve accrued credits (ec2.schedule.burstable_instance.preserve_accrued_cpu_credit with 0 value by default).
- (Improvment/Minor) Added subfleet.<subfleetname>.ec2.schedule.min_instance_count with same semantic than ec2.schedule.min_instance_count.
- (Improvment/Major) Added subfleet.<subfleetname>.ec2.schedule.burstable_instance.max_cpu_crediting_instances for fine grained control of CPU Crediting in subfleets.
- (Improvment/Minor) Reworked draining code to make CPU Crediting leaner with subfleets.
v0.7.1:
- (Bug/Major) Make sure to not start Subfleet stopped instance when expecting to start an Autoscaled instance...
- (Improvment/Minor) Enabled by default the detailed metric per subfleet (and so, Subfleet dashboard).
v0.7.0:
- (Improvment/Major) Added 'staticfleet.<subfleetname>.ec2.schedule.desired_instance_count' to control the number of 'running' instances when 'staticfleet.<subfleetname>.enable' is set to 'running'.
- (Improvment/Major) BACKWARD COMPATIBILITY BREAK: Renamed all dashboards from CloneSquad-xxx to CS-xxx to save space in the CloudWatch console.
- (Improvment/major) Added dashboad specific to the subfleets. Enable detailed metrics per subfleet with 'staticfleet.<subfleetname>.ec2.schedule.metrics.enable' to let the dashboard appear.
- (Improvment/Major) Autoscaling using only per instance metric as scaling criteria is now smoother.
- (Improvment/Minor) Reduce CloudWatch cost by polling CPU Credit balance only for 'drained' instances.
- (Bug/Minor) Fixed dependency issues preventing proper deletion of Log groups on stack deletion.
- (Improvment/Minor) Enforce a release process more robust (especially by downloading the latest amazon linux image.
v0.6.1:
- (Bug/Major) Fixed regression in subfleet instance start.
v0.6.0:
- (Improvment/Major) BACKWARD COMPATIBILITY BREAK: Reworked semantic of API Gateway methods /metadata, /fleet/status and /fleet/metadata for better coherency and ease of handling at client side. Added subfleet informations to /fleet/status.
- (Improvment/Minor) Provide CPU alarm templates for 25%/33%/50%/90% thresholds.
v0.5.0:
- (Improvment/Major) Optimization of autoscaler instance list to save processing time when a large amount of instances (>50) is undermanagement.
v0.4.0:
- (Improvment/Major) BACKWARD COMPATIBILITY BREAK: Renamed ec2.schedule.max_cpu_crediting_instances in ec2.schedule.burstable_instance.max_cpu_crediting_instances
- (Bug/Major) In some circumstances (ex: all LH instances in error), the whole Fleet was erroneously started.
- (Improvment/Minor) Warn when the user specifies a wellknown Cloudwatch limitation in cron format. (Day-of-the-Week and Day-of-the-Month mutually exclusive)
- (Improvment/Minor) Introduced ec2.schedule.burstable_instance.preserve_accrued_cpu_credit configuration key.
- (Improvment/Minor) Introduced ec2.schedule.burstable_instance.max_cpu_credit_unhealthy_instances to control the maximum amount of burstable instances w/ exhausted CPU Credit Balance that could be considered as unhealthy at a given time.
- (Improvment/Minor) Added comments to the sample CPUUtilization Metric definition src/resources/ec2.scaleup.alarm-cpu-gt-75pc.yaml
- (Bug/Minor) Do not throw exception when 'Value' column is missing in a DynamoDB Config record.
- (Improvment/Debug) Enable a capacity to inject CPU Credit Balance with State table key ("ec2.debug.instance.%s.cpu_credit_balance" % instance["InstanceId"])
v0.3.0:
- (Improvment/Major) 'template.yaml' is no more generated. 'template-latest.yaml' and 'template-<version>.yaml is the new standard.
v0.2.4:
- (Improvment/Minor) BACKWARD COMPATIBILITY BREAK: Renamed 'ConfigurationURL' template parameter to 'ConfigConfigurationURLs'.
- (Bug/Major) Fixed regression introduced in v0.2.3 about burstable instance management.
- (Bug/Minor) Fixed discovery API enumeration random and rare exception.
v0.2.3:
- - (Improvment/Minor) Makes the 'CPU Crediting mode' more intuitive. If ec2.schedule.max_cpu_crediting_instances is set to 0%, CloneSquad will no more wait for CloudWatch CPU credit balance metric and so enable quick shutdown of t3/t4 instances. Also, t2 instances are not more subject to ‘CPU Crediting mode’ as they can’t preserve CPU Credit over restart.
v0.2.2:
- (Bug/Major) Removed memory caching of http(s):// and s3:// located configuration files.
v0.2.1:
- (Improvment/Minor) Get more details in logs when a config file load fails.
- (Improvment/Major) Added optional 'DynamoDBConfiguration' template.yaml parameter to configure DynamoDB Tables with PROVISIONED capacity instead of default PAY_PER_REQUEST billing model.
v0.2.0:
- (Improvment/Major) Optimized TargetGroup lookup that reduces number of call to "ELBv2" APIs. This reduces execution time of Main Lambda function and make it linear with the number of targetgroups with CloneSquad tags and no more with the number of overall targetgroups into the account.
v0.1.4:
- (Bug/Major) Limit the number of ARNs supplied to elbv2.describe_tags() API call to avoid an exception.
v0.1.3:
- (Bug/Major) Some random rare transient states can lead to a Python exception just after an instance start.
v0.1.2:
- (Bug/major) Fix impact of version name in template.yaml name during extract-version.
v0.1.0:
Note: This a major release with large reworks of the API Gateway code especially.
- (Improvment/Major) Major rework of API Gateway code. Added INTERACTING.md documentation.
- (Improvment/Major) Added new 'Spot recommendation' event support and reworked Spot interruption handling
- (Improvment/Major) Added support for TransferFamily subfleet management. Enable it with 'tranferfamily.enable' set to '1'.
- (Improvment/Minor) Added API GW 'control/reschedulenow' command.
- (Improvment/Minor) Removed some needed less variable output for API Gateway 'discovery'.
- (Improvment/Major) Behavior compatibility break: 'rds.enable' to be set to '1' to activate RDS database support. This change is made to reduce execution duration of the Main Lambda function when RDS database are not the User use-case.
- (Improvment/Major) Behavior compatibility break: Published CloudFormation template filename now contains the CloneSquad version.
v0.0.60:
- (Improvment/Major) Added 'Spot Rebalance recommendation' support and reworked 'Spot interruption handling'.
v0.0.59:
- (Bug/Major) Make sure that we do not send more than 20 metrics per CloudWatch.PutMetricData API call.
- (Improvment/Major) After a set of experiments, it is proven that ensuring the wider scale between AlarmThreshold and
AlarmBaselineThreshold induces the best smoothness of the scalein/scaleout algorithm. Prior to this patch, the CPU Baseline
Threshold in src/resources/predefined.config.yaml, was set to 20%: Setting it to 0% make the scaleout operation significantly smoother.
v0.0.58:
- (NewFeature/Minor) Added configuration key 'ec2.instance.status.override_url' allowing override of EC2 instance status (for debug and/or external 'unhealthy' status) with an S3/HTTP server located YAML file.
- (Bug/Minor) Fixed buggy set-ec2-alarm.sh scripts. (This script manually triggers a CloneSquad fleet EC2 instance alarm to test autoscaler algorithm reaction.)
- (Improvment/Minor) Add cost estimate for demo-instance-fleet.
- (Bug/Major) Protect public API Gateway with IAM authorizer. Previously API was available without authentication.
v0.0.57:
- (Bug/Major) CloudFormation customization injection was incorrect and complex. Simplified the process with a single CloudFormation parameter 'CustomizationZipParameters'.
v0.0.56:
- (Improvment/Major) Added new setting `ec2.az.evict_instances_when_az_faulty` to force instance eviction from a faulty AZ. Prior to this patch, it was the duty of TargetGroup healthchecks to evict instances. When users use Clonesquad without TargetGroup, users had to scalein/scaleout manually the fleet to evict the instances from the faulty AZ.
v0.0.55:
- (Improvment/Major) Scalin-in algorithm is now smoother as it is taking into account the instance scale ratio relative to 0.0. Prior to this patch, the ScaleIn algorithm was reducing the fleet at full defined scalein rate as soon as it was below the scalein threshold.
- (Improvment/Minor) Added new config flag 'ec2.describe_instance_types.enabled' (disabled by default) to push 4 new CPU and Memory efficiency metrics:
FleetvCPUCount: Number of CPU serving in the fleet
FleetvCPUNeed: Number of CPU needed to sustain the load
FleetMemCount: Amount of Memory serving in the fleet
FleetMemNeed: Amount of Memory guessed as needed
v0.0.54:
- (Improvment/Minor) Renamed StaticFleet CloudWatch metrics related to EC2 instances to ease dashboard reading
- (Improvment/Minor) Make RDS start/stop better protected against boto3 API exceptions