Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[internal-dns] fix for clickhouse zone services #7612

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

karencfv
Copy link
Contributor

@karencfv karencfv commented Feb 24, 2025

This commit fixes an issue where the ClickhouseNative SRV record was pointing to both clickhouse and clickhouse_cluster IPs. This was causing Oximeter to indiscriminately write to either single node or cluster when both where running side by side. Additionally, we add a record for ClickhouseSingleServerAdmin.

Manual testing on a local omicron deployment

With only single node enabled

coatlicue@centzon:~/src/omicron$ pfexec zlogin oxz_clickhouse_ae7bf43d-8234-4b6f-ab3a-540685f5ba41
[Connected to zone 'oxz_clickhouse_ae7bf43d-8234-4b6f-ab3a-540685f5ba41' pts/3]
The illumos Project     helios-2.0.23078        December 2024
root@oxz_clickhouse_ae7bf43d:~# nslookup -type=SRV _clickhouse-native._tcp.control-plane.oxide.internal
;; Got recursion not available from fd00:1122:3344:1::1, trying next server
;; Got recursion not available from fd00:1122:3344:2::1, trying next server
Server:         fd00:1122:3344:3::1
Address:        fd00:1122:3344:3::1#53

Non-authoritative answer:
_clickhouse-native._tcp.control-plane.oxide.internal    service = 0 0 9000 ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal.

Authoritative answers can be found from:
ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::e

root@oxz_clickhouse_ae7bf43d:~# nslookup -type=SRV _clickhouse-admin-single-server._tcp.control-plane.oxide.internal
;; Got recursion not available from fd00:1122:3344:1::1, trying next server
;; Got recursion not available from fd00:1122:3344:2::1, trying next server
Server:         fd00:1122:3344:3::1
Address:        fd00:1122:3344:3::1#53

Non-authoritative answer:
_clickhouse-admin-single-server._tcp.control-plane.oxide.internal       service = 0 0 8888 ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal.

Authoritative answers can be found from:
ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::e

With both single node and cluster enabled

root@oxz_clickhouse_ae7bf43d:~# nslookup -type=SRV _clickhouse-admin-single-server._tcp.control-plane.oxide.internal
;; Got recursion not available from fd00:1122:3344:1::1, trying next server
;; Got recursion not available from fd00:1122:3344:2::1, trying next server
Server:         fd00:1122:3344:3::1
Address:        fd00:1122:3344:3::1#53

Non-authoritative answer:
_clickhouse-admin-single-server._tcp.control-plane.oxide.internal       service = 0 0 8888 ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal.

Authoritative answers can be found from:
ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::e

root@oxz_clickhouse_ae7bf43d:~# nslookup -type=SRV _clickhouse-native._tcp.control-plane.oxide.internal
;; Got recursion not available from fd00:1122:3344:1::1, trying next server
;; Got recursion not available from fd00:1122:3344:2::1, trying next server
Server:         fd00:1122:3344:3::1
Address:        fd00:1122:3344:3::1#53

Non-authoritative answer:
_clickhouse-native._tcp.control-plane.oxide.internal    service = 0 0 9000 ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal.

Authoritative answers can be found from:
ae7bf43d-8234-4b6f-ab3a-540685f5ba41.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::e

root@oxz_clickhouse_ae7bf43d:~# nslookup -type=SRV _clickhouse-admin-server._tcp.control-plane.oxide.internal
;; Got recursion not available from fd00:1122:3344:1::1, trying next server
;; Got recursion not available from fd00:1122:3344:2::1, trying next server
Server:         fd00:1122:3344:3::1
Address:        fd00:1122:3344:3::1#53

Non-authoritative answer:
_clickhouse-admin-server._tcp.control-plane.oxide.internal      service = 0 0 8888 1ebb94b9-9cc4-4c4a-8402-f758cc3b1173.host.control-plane.oxide.internal.
_clickhouse-admin-server._tcp.control-plane.oxide.internal      service = 0 0 8888 b817f829-383f-402a-be4a-b393c1afdff0.host.control-plane.oxide.internal.
_clickhouse-admin-server._tcp.control-plane.oxide.internal      service = 0 0 8888 bdc23f73-83f9-4029-a749-2375d8cb4033.host.control-plane.oxide.internal.

Authoritative answers can be found from:
1ebb94b9-9cc4-4c4a-8402-f758cc3b1173.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::26
b817f829-383f-402a-be4a-b393c1afdff0.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::27
bdc23f73-83f9-4029-a749-2375d8cb4033.host.control-plane.oxide.internal  has AAAA address fd00:1122:3344:101::28

single-node

root@oxz_clickhouse_ae7bf43d:~# /opt/oxide/clickhouse/clickhouse client --host fd00:1122:3344:101::e -q "select * from oximeter.fields_string limit 5"
ddm_router:originated_tunnel_endpoints  1927403117836933753     hostname        centzon
ddm_router:originated_tunnel_endpoints  13387177631086028603    hostname        oxz_switch
ddm_router:originated_underlay_prefixes 12164380667314673492    hostname        centzon
ddm_router:originated_underlay_prefixes 9657734132044246100     hostname        oxz_switch
ddm_session:advertisements_received     5533636839163709296     hostname        centzon

cluster

oximeter_cluster_1 :) select * from oximeter.fields_string limit 5

SELECT *
FROM oximeter.fields_string
LIMIT 5

Query id: c4771ce0-1b36-4f48-aca9-068ec826a67b

Ok.

0 rows in set. Elapsed: 0.002 sec. 

Closes: #7577

Copy link
Collaborator

@bnaecker bnaecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably OK at this point, but this is starting to look like we want two different methods to handle the single-node and clustered cases. We're doing almost entirely non-overlapping work in here now, except for the call to host_zone(). We could do that the next time we modify it, but I also don't see a reason to avoid it now.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding and fixing this @karencfv. Great job.

I agree with Ben that it's fine to merge this in, but it would probably be better to make two separate functions for dns config. One for single_node and one for cluster.

@karencfv karencfv enabled auto-merge (squash) February 24, 2025 22:57
@karencfv karencfv merged commit 6e1021a into oxidecomputer:main Feb 25, 2025
16 checks passed
@karencfv karencfv deleted the dns-clickhouse-bugfix branch February 25, 2025 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClickhouseNative SRV records should only point to single-node
3 participants