Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APPSRE-11527 dont fail on ocm 5xx #4866

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fishi0x01
Copy link
Contributor

OCM stage/int is an unstable environment and can fail with 5xx. We should not fail the integration based on this.

In this PR, we identify OCM int/stage 5xx errors and ignore them for now. Note, that this is just a quick-fix. Current DTP error handling is rather messy and needs to be addressed properly in APPSRE-11527.

@fishi0x01 fishi0x01 force-pushed the APPSRE-11527_dtp-error-handling branch from b8fa113 to 506eaa0 Compare February 13, 2025 13:09
self._ocm_client = ocm_client
self._url = url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this _url added only for error handling? Can you get url from HTTPError object? the error object should be able to get url by e.request.url, then no need to wrap OCMAPIError to all methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah those are pretty nasty mypy-wise. Request | RequestIntent | None. Further, Id need to substract url - path in order to get host.
All in all a lot of logic in the catch block. Adding _url here felt easier and safer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok if once OCM env stable we revert all changes in this pr, if we want to make it as permanent change, I would not add new dependencies to OCMClient only for this error handling.
To make stage and integration endpoint check more strict instead of just string in check, we can use urlparse(url).hostname to do exactly match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OCM env in stage/int will never be regarded as stable. However, DTPv3 will change a lot here, since currently the error handling in this integration is quite messy.

For this case, I believe it is simpler and safer to just add the extra attr, instead of none checks and additional function calls in catch block. But this will change overall in DTPv3 refactor 👍

Comment on lines +241 to +258
unexpected_errors_cnt = 0
for err, _ in unhandled_exceptions:
if (
isinstance(err, OCMAPIError)
and err.status_code >= 500
and (
"api.integration.openshift.com" in err.endpoint
or "api.stage.openshift.com" in err.endpoint
)
):
# OCMAPI in int/stage can fail. Lets for now ignore those
continue
unexpected_errors_cnt += 1

# Lets only fail integration on unexpected errors
error_messages = [msg for _, msg in unhandled_exceptions]
if unexpected_errors_cnt:
raise ReconcileErrorSummary(error_messages)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since only boolean check needed, can use any

unexpected_errors = any(not integration_or_stage_500(err) for err, _ in unhandled_exceptions)

or "api.stage.openshift.com" in err.endpoint
)
):
# OCMAPI in int/stage can fail. Lets for now ignore those
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would it change if we ignored 5xx on prod as well? Do we really need to fail or skipping is fine as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. 5xx on prod would most likely indicate a syncset/manifest API issue on prod which is pretty serious - so I thought for now it would be fine to keep old behavior. However, since we will expose those as separate metrics anyways, we can have more fine-grained alerts and severities configured outside of this code and dont need to block.

Will change this to skip on all 5xx on all OCM envs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants