-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APPSRE-11527 dont fail on ocm 5xx #4866
base: master
Are you sure you want to change the base?
APPSRE-11527 dont fail on ocm 5xx #4866
Conversation
b8fa113
to
506eaa0
Compare
self._ocm_client = ocm_client | ||
self._url = url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this _url
added only for error handling? Can you get url from HTTPError
object? the error object should be able to get url by e.request.url
, then no need to wrap OCMAPIError
to all methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah those are pretty nasty mypy-wise. Request | RequestIntent | None
. Further, Id need to substract url
- path
in order to get host.
All in all a lot of logic in the catch block. Adding _url
here felt easier and safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's ok if once OCM env stable we revert all changes in this pr, if we want to make it as permanent change, I would not add new dependencies to OCMClient
only for this error handling.
To make stage and integration endpoint check more strict instead of just string in
check, we can use urlparse(url).hostname
to do exactly match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OCM env in stage/int will never be regarded as stable. However, DTPv3 will change a lot here, since currently the error handling in this integration is quite messy.
For this case, I believe it is simpler and safer to just add the extra attr, instead of none checks and additional function calls in catch block. But this will change overall in DTPv3 refactor 👍
unexpected_errors_cnt = 0 | ||
for err, _ in unhandled_exceptions: | ||
if ( | ||
isinstance(err, OCMAPIError) | ||
and err.status_code >= 500 | ||
and ( | ||
"api.integration.openshift.com" in err.endpoint | ||
or "api.stage.openshift.com" in err.endpoint | ||
) | ||
): | ||
# OCMAPI in int/stage can fail. Lets for now ignore those | ||
continue | ||
unexpected_errors_cnt += 1 | ||
|
||
# Lets only fail integration on unexpected errors | ||
error_messages = [msg for _, msg in unhandled_exceptions] | ||
if unexpected_errors_cnt: | ||
raise ReconcileErrorSummary(error_messages) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since only boolean check needed, can use any
unexpected_errors = any(not integration_or_stage_500(err) for err, _ in unhandled_exceptions)
or "api.stage.openshift.com" in err.endpoint | ||
) | ||
): | ||
# OCMAPI in int/stage can fail. Lets for now ignore those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would it change if we ignored 5xx on prod as well? Do we really need to fail or skipping is fine as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. 5xx on prod would most likely indicate a syncset/manifest API issue on prod which is pretty serious - so I thought for now it would be fine to keep old behavior. However, since we will expose those as separate metrics anyways, we can have more fine-grained alerts and severities configured outside of this code and dont need to block.
Will change this to skip on all 5xx on all OCM envs.
OCM stage/int is an unstable environment and can fail with 5xx. We should not fail the integration based on this.
In this PR, we identify OCM int/stage 5xx errors and ignore them for now. Note, that this is just a quick-fix. Current DTP error handling is rather messy and needs to be addressed properly in APPSRE-11527.