Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: For cluster internal scopes also add variant without trailing dot #547

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sbernauer
Copy link
Member

Description

Please add a description here. This will become the commit message of the merge request later.

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

Preview Give feedback

Reviewer

Preview Give feedback

Acceptance

Preview Give feedback

@sbernauer sbernauer changed the title fix: For cluster internal scopes, also add variant without trailing dot fix: For cluster internal scopes also add variant without trailing dot Jan 21, 2025
Copy link
Member

@nightkr nightkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the places that depend on the old value here, not just blindly add both.

Comment on lines +188 to +191
let mut cluster_domains = vec![cluster_domain.to_string()];
if let Some(cluster_domain_without_trailing_dot) = cluster_domain.strip_suffix('.') {
cluster_domains.push(cluster_domain_without_trailing_dot.to_owned());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing, Kerberos wants consistency and TLS doesn't really care. Either should be helped by doing both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to respond in #547 (comment)

Comment on lines 175 to 179
[domain_realm]
cluster.local = {realm_name}
cluster.local. = {realm_name}
.cluster.local = {realm_name}
.cluster.local. = {realm_name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IME, this shouldn't be necessary at all (probably since we set the default realm before). But if we do keep it then we should read the actual cluster domain, not hard-code cluster.local specifically.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this PR was mostly about trying out if we can fix the TLS cert problems we have.
Fixed it in 5781e66

@sbernauer
Copy link
Member Author

General comment on why we are thinking of adding both (with and wo trailing dot) to scopes - both for TLS and Kerberos:

  1. We don't know what users will be entering. Performance sensitive users might choose a trailing dot, users with problems might not be able to add a dot
  2. The zookeeper ./scripts/run-tests --test smoke_zookeeper-3.9.2_use-server-tls-true_use-client-auth-tls-true_openshift-false tests failed with [ERROR] Could not establish secure connection using client certificates!
    Everything was configured correctly with trailing dots. We could only get the zk client happy by adding a SAN without the dot
  3. Let's imaging a 24.11 users first updates secret-op. Until he is able to update all other product operators everything can potentially break, because every talks cluster.local, but secret-op only hands out cluster.local.
  4. Is there a downside of adding both? - At the first glance this PR seems like a "let's be better safe than sorry"

That being said this is a WIP, I would leave it totally up on @dervoeti and @maltesander to decide how to proceed, as they looked at the issue at the first place. I just happened to bump op-rs and run into a failing test

@nightkr
Copy link
Member

nightkr commented Jan 21, 2025

We don't know what users will be entering. Performance sensitive users might choose a trailing dot, users with problems might not be able to add a dot

Wouldn't secret-operator know just as well as any other operator? Are you planning on supporting mixed environments?

We could only get the zk client happy by adding a SAN without the dot

curl was also happy IME too without the dot, so I guess this is one indication that TLS SANs should never have it.

Let's imaging a 24.11 users first updates secret-op. Until they are able to update all other product operators everything can potentially break, because every talks cluster.local, but secret-op only hands out cluster.local.

Migration is a fair concern, that's true. We should explicitly document those migration paths in the comments.

Is there a downside of adding both? - At the first glance this PR seems like a "let's be better safe than sorry"

We should know what credentials we're asking for, and why. Whether they need to be included when provisioning manually, and so on.

It's fine to add things to that list, it just shouldn't be something we do blindly.

@dervoeti
Copy link
Member

I agree that, in general, we should prefer not adding the hostname without the dot if it's not really necessary / we can work around it.

I'm not sure what exactly the scenario was (@sbernauer and/or @maltesander did the research on this) but I think one reason was that Zookeeper does a reverse DNS lookup on the client IP and complains that the client cert is not valid for the returned hostname (without the trailing dot).

That would be a reason to add the alternative hostname to the SANs. Other ways to solve this are trying to fix this in Zookeeper or maybe explicitly not supporting Zookeeper mTLS if you use a cluster domain with a trailing dot. I'm fine with either solution, adding the alternative hostname to the SAN was probably just the easiest way make it work.

@dervoeti
Copy link
Member

dervoeti commented Feb 6, 2025

So, we have to make a decision. I can't really comment on the Kerberos related changes, but as far as I understand it, they are not strictly necessary but would make migration easier. In that case I would be fine with not merging these changes if they are controversial, an easy migration path is nice to have but I think it's okay if we don't have it.
Regarding the TLS related change, I can't see a practical negative effect on security if we add the non-FQDN DNS name to the SANs if a cert for a FQDN DNS name is requested. It would ease the migration path (at least in one direction) and, more importantly, it would fix the Zookeeper mTLS issue.
So I'd be in favor of merging at least the TLS related change.

But I'm also fine with not merging this at all and explicitly listing Zookeeper mTLS as "known not to work with FQDN cluster domains yet". In that case we probably still support many setups with FQDN cluster domains with 25.3, so it's better than before.

Opinions @nightkr @sbernauer @maltesander ?

@nightkr
Copy link
Member

nightkr commented Feb 6, 2025

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

For Kerberos I'm not sure. I think the argument for having both makes sense, at least during the transitional period (though we should probably make sure we have both variants in both cases). Maybe an exception here would be if we can centralize this logic in listener-op, and have that be what decides the Flag Day™️.

@dervoeti
Copy link
Member

dervoeti commented Feb 6, 2025

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

I'll also do some tests with this later, if it works I'm fine with that solution as well.

@maltesander
Copy link
Member

For TLS only the non-FQDN variant seems to matter at all, at least in my testing. We should only keep the non-FQDN variant there.

Yeah we definitly need the non-FQDN in there. That fixed most of the problems i had. IIRC zookeeper required the FQDN in the certificate.

I would punt on Kerberos as well. Main thing is to fix the certs?

@dervoeti
Copy link
Member

dervoeti commented Feb 7, 2025

I did some tests with Zookeeper yesterday, including mTLS tests with and without FQDN cluster domains, adding just the non-FQDN hostname to the SANs worked fine. Will do some more testing today with other products.
I only changed one line in secret-op: b5f3d51

@dervoeti
Copy link
Member

@maltesander @nightkr @sbernauer I created a PR that only adds the non-FQDN variant to the SANs, works fine for me:
#564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants