-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Retry Mechanism to E2E EKS Terraform Deployment #634
Add Retry Mechanism to E2E EKS Terraform Deployment #634
Conversation
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #634 +/- ##
=============================================
- Coverage 85.71% 50.73% -34.99%
- Complexity 19 264 +245
=============================================
Files 3 39 +36
Lines 49 1301 +1252
Branches 5 141 +136
=============================================
+ Hits 42 660 +618
- Misses 3 609 +606
- Partials 4 32 +28 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small questions but overall should be fine
# after installing App Signals. Attempts to connect will be made for up to 10 minutes | ||
if [ $success -eq 0 ]; then | ||
echo "Installing app signals to the sample app" | ||
../../../enable-app-signals.sh \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we just pull the enablement script into the folder it would be useful in?
|
||
# Validation for app signals telemetry data | ||
- name: Call endpoint and validate generated EMF logs | ||
id: log-validation | ||
if: steps.endpoint-check.outcome == 'success' && !cancelled() | ||
if: steps.deploy-sample-app.outcome == 'success' && !cancelled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add this to the other validation steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Is there a sample run where we can see this change in action? |
exit 1 | ||
fi | ||
echo "Attempt $retry_counter" | ||
success=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Success should be made 0 after the terraform apply
command has completed. That's is the assumption in the following code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Success of 1 indicates that the setting up App Signals on the sample app failed, while 0 indicates that everything ran successfully.
The logic is:
1. Set Success to 1 (Set initial value to 1 so that the while loop runs)
2. While Success is 1 (Indicates that terraform deployment/endpoint connection failed and will try again):
2a: Set Success to 0 (Set the value to 0 and if there were any failures change it to 1)
2b: Run Terraform apply (If the deployment failed, then success will change to 1)
2c: If Success is still 0, then install App Signals and check endpoint connection
2d: If endpoint connection failed, change success to 1
2e: If Success is 1 at this point, then either the deployment or connection failed and run the while loop again. If it is still 0, then the code ran successfully and exit the while loop
If the success is made 0 after the terraform apply, then it will override whether terraform deployment succeeded or not. If after the terraform deployment the success is 1, we want to skip the endpoint connection step and redeploy the terraform again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a little confusing because 1 usually means true
and in this case setting success
to 1 means the previous step failed, i.e. success is not true.
Also, could we not just wrap the retry in a while look that checks the retry counter instead of the success variable? Technically, success is a variable specific to a single attempt and shouldn't even be used outside the loop or as a condition for it. Instead, we should rely on the retry counter, i.e. if attempt # < attempt limit, run retry
* E2E Test: Ensure the use of IMDSv2 in EC2 instances (#621) * Add e2e canary to public preview regions (#623) * Fix trace validation error follow up fix (#626) * Fix Terrform Destroy Error on EKS Canary (#628) * fix-e2e-eks-terraform-destroy-error * Add region as parameter to terraform destroy * Bump nebula.release from 17.2.2 to 18.0.6 (#631) Bumps nebula.release from 17.2.2 to 18.0.6. --- updated-dependencies: - dependency-name: nebula.release dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump actions/setup-java from 3 to 4 (#629) Bumps [actions/setup-java](https://github.com/actions/setup-java) from 3 to 4. - [Release notes](https://github.com/actions/setup-java/releases) - [Commits](actions/setup-java@v3...v4) --- updated-dependencies: - dependency-name: actions/setup-java dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump hashicorp/setup-terraform from 2 to 3 (#586) Bumps [hashicorp/setup-terraform](https://github.com/hashicorp/setup-terraform) from 2 to 3. - [Release notes](https://github.com/hashicorp/setup-terraform/releases) - [Changelog](https://github.com/hashicorp/setup-terraform/blob/main/CHANGELOG.md) - [Commits](hashicorp/setup-terraform@v2...v3) --- updated-dependencies: - dependency-name: hashicorp/setup-terraform dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rust from 1.73 to 1.74 (#611) Bumps rust from 1.73 to 1.74. --- updated-dependencies: - dependency-name: rust dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump actions/setup-node from 3 to 4 (#574) Bumps [actions/setup-node](https://github.com/actions/setup-node) from 3 to 4. - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v3...v4) --- updated-dependencies: - dependency-name: actions/setup-node dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump tempfile from 3.8.0 to 3.8.1 in /tools/cp-utility (#585) Bumps [tempfile](https://github.com/Stebalien/tempfile) from 3.8.0 to 3.8.1. - [Changelog](https://github.com/Stebalien/tempfile/blob/master/CHANGELOG.md) - [Commits](https://github.com/Stebalien/tempfile/commits) --- updated-dependencies: - dependency-name: tempfile dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Provide aws-region for the e2e test in worklow (#643) * Provide aws-region for the e2e test in worklow * Update region to us-east-1 and add concurrency * Revert "Provide aws-region for the e2e test in worklow (#643)" (#645) This reverts commit 44b5b68. * E2E Testing: Add concurrency tag to test in main build and nightly build (#646) * Use aws-region in the workflow (#649) * Add Retry Mechanism to E2E EKS Terraform Deployment (#634) * Add Retry Mechanism to E2E EKS Terraform Deployment * Add Extra Comments * Call Test APIs First before Validation * Add clean-app-signals to retry logic * Change App Signal Download Directory and modify if statement for validation * Modify while loop and refactor code * Dynamic input RPM link by region setting (#647) * Dynamic input RPM link by region setting * Remove unneeded env variable * Fix an issue in echo shell command * Revert previous wrong 'fix' regarding variable call * Add Retry Mechanism to E2E EC2 Terraform Deployment (#635) * Add Retry Mechanism to E2E EC2 Terraform Deployment * Add Extra Comments * Refactor code * Change App Signals Directory (#650) * change dep config to compileOnly to fix high cardinality metrics (#651) * E2E Testing: Fix EKS test candidate image override (#652) This change checks if there is an adot image passed to the workflow and patches the App Signals deployment to update the image and restarts the cloudwatch pods. --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: Mahad Janjua <[email protected]> Co-authored-by: Harry <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Vasi Vasireddy <[email protected]> Co-authored-by: XinRan Zhang <[email protected]> Co-authored-by: Mengyi Zhou (bjrara) <[email protected]>
Issue #, if available:
The EKS Canary occasionally fails due to transitivity issues. One of the recurring errors are
Max attempts reached
in the step :Wait for Endpoint to Come Online
. This occurs due to the endpoint sometimes taking longer than expected to become ready.Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.