Configuration-driven tool for managing DataHub RBAC (domains, groups, policies, ownership) as version-controlled YAML. Works with both DataHub Cloud and OSS, with PAT authentication.
pip install -r requirements.txt
# Set environment variables
export DATAHUB_SERVER=https://your-instance.datahub.cloud/gms
export DATAHUB_TOKEN=your-personal-access-token
# Powershell: $env:DATAHUB_SERVER = "https://your-instance.datahub.cloud/gms"
# Powershell: $env:DATAHUB_TOKEN = "your-personal-access-token"
# Preview changes (no mutations sent)
python -m src --dry-run
# Apply to DataHub
python -m srcconfig/
hierarchy.yaml # Domain -> sub-domain hierarchy with optional owners
privileges.yaml # Privilege sets per persona (from RBAC design)
settings.yaml # Connection, feature flags, naming conventions
src/
models.py # Pydantic config validation and YAML loading
apply.py # Reads configs, resolves domains, applies groups/owners/policies
tests/
test_models.py # Config validation and loading edge cases
test_apply.py # Domain resolution, group/policy/owner management (all mocked)
- Identity comes from an external IDP. Users and their group memberships are synced into DataHub from an external identity provider (e.g., Okta, Azure AD via SCIM/SSO). This tool creates the DataHub groups and policies; the IDP populates who belongs to each group. The
ownerfield inhierarchy.yamlis a convenience that adds a single user to the admin group — it does not replace IDP-based group sync. - Policies, not roles. DataHub's built-in roles (Admin, Editor, Reader) are intentionally not used. All access is granted exclusively through custom policies targeting groups. Users should have no DataHub role assignments — only group memberships that map to policies managed by this tool.
- Domains are the security boundary. Domain Admin and Sub-Domain Admin policies are scoped by a
DOMAINresource filter. A domain defines who can edit what. By default, domains and sub-domains must already exist in DataHub and their names must matchhierarchy.yamlexactly (case-sensitive). Setcreate_domains: trueto auto-create missing domains instead of erroring. - No visibility restrictions. All users can discover and view all metadata assets. Search Access Controls (
VIEW_ENTITYscoping) are not implemented (Note: this is a DataHub Cloud-only feature).
On each run the tool:
- Resolves domains — looks up existing domains/sub-domains in DataHub by exact name match and maps them to URNs (optionally creates missing ones when
create_domains: true) - Creates groups — creates admin groups per domain/sub-domain, plus global Platform Admins and All Users groups (skips if already exists)
- Assigns owners — sets the configured user as Business Owner on the domain entity and adds them to the domain's admin group
- Creates/updates policies — matches policies by name; creates new ones or updates existing ones with the correct privileges, actors, and domain scope
All steps are idempotent and the full run produces a summary of actions taken.
References existing domains and sub-domains by exact name. Domains must already exist in DataHub — the tool resolves names to URNs at runtime. Each domain/sub-domain can optionally specify an owner (corpuser ID or full URN) who will be set as Business Owner and added to the admin group.
domains:
- name: Analytics
description: Analytics domain
owner: jane_doe # optional: set as Business Owner + add to admin group
sub_domains:
- name: Reporting
description: Reporting sub-domain
owner: urn:li:corpuser:bob # full URN also accepted
- name: Data Science
description: Data science sub-domain
- name: Engineering
description: Engineering domain
sub_domains:
- name: Platform
description: Platform engineering sub-domainConnection settings (supports ${ENV_VAR} interpolation), feature flags, and naming patterns for groups/policies.
datahub:
server: "${DATAHUB_SERVER}"
token: "${DATAHUB_TOKEN}"
disable_ssl_verification: false # set true for self-signed certs
options:
dry_run: false
include_cloud_privileges: true # set false for OSS-only deployments
create_domains: false # set true to auto-create missing domains
naming:
domain_admin_group: "{domain_name} - Admins"
sub_domain_admin_group: "{domain_name}/{sub_domain_name} - Admins"
platform_admin_group: "Platform Admins"
# ... (see file for full list)Static privilege definitions per persona, derived from the RBAC implementation design. Rarely modified. Privileges suffixed with _cloud are only included when options.include_cloud_privileges: true (default). Set to false for OSS-only deployments.
| Persona | Scope | Group | Policies |
|---|---|---|---|
| Platform Admin | Global | Platform Admins | 1 platform + 1 metadata |
| Domain Admin | Per domain | {Domain} - Admins | 1 metadata per domain |
| Sub-Domain Admin | Per sub-domain | {Domain}/{SubDomain} - Admins | 1 metadata per sub-domain |
| Business User | Global | All Users | 1 metadata (all users) |
- Domains: Resolved by exact name match; optionally created when
create_domains: true - Groups: Created if missing, skipped if already exists
- Policies: Matched by name — existing policies are updated, new ones are created
- Owners: Ownership and group membership are set via additive mutations (safe to re-run)
python -m src [--dry-run] [--config-dir CONFIG_DIR]
| Flag | Default | Description |
|---|---|---|
--dry-run |
false |
Preview all planned actions without connecting to DataHub |
--config-dir |
config/ |
Path to the directory containing the YAML config files |
The included workflow (.github/workflows/apply-rbac.yaml) runs on push to main when config files change, and supports manual dispatch. Set DATAHUB_SERVER and DATAHUB_TOKEN as repository secrets.
pytest tests/All tests mock the DataHubGraph client — no live DataHub instance needed.