Skip to content

max-datahub/datahub-rbac-config

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataHub RBAC-as-Code

Configuration-driven tool for managing DataHub RBAC (domains, groups, policies, ownership) as version-controlled YAML. Works with both DataHub Cloud and OSS, with PAT authentication.

Quick Start

pip install -r requirements.txt

# Set environment variables
export DATAHUB_SERVER=https://your-instance.datahub.cloud/gms
export DATAHUB_TOKEN=your-personal-access-token
# Powershell: $env:DATAHUB_SERVER = "https://your-instance.datahub.cloud/gms"
# Powershell: $env:DATAHUB_TOKEN = "your-personal-access-token"

# Preview changes (no mutations sent)
python -m src --dry-run

# Apply to DataHub
python -m src

Project Structure

config/
  hierarchy.yaml    # Domain -> sub-domain hierarchy with optional owners
  privileges.yaml   # Privilege sets per persona (from RBAC design)
  settings.yaml     # Connection, feature flags, naming conventions
src/
  models.py         # Pydantic config validation and YAML loading
  apply.py          # Reads configs, resolves domains, applies groups/owners/policies
tests/
  test_models.py    # Config validation and loading edge cases
  test_apply.py     # Domain resolution, group/policy/owner management (all mocked)

Assumptions

  • Identity comes from an external IDP. Users and their group memberships are synced into DataHub from an external identity provider (e.g., Okta, Azure AD via SCIM/SSO). This tool creates the DataHub groups and policies; the IDP populates who belongs to each group. The owner field in hierarchy.yaml is a convenience that adds a single user to the admin group — it does not replace IDP-based group sync.
  • Policies, not roles. DataHub's built-in roles (Admin, Editor, Reader) are intentionally not used. All access is granted exclusively through custom policies targeting groups. Users should have no DataHub role assignments — only group memberships that map to policies managed by this tool.
  • Domains are the security boundary. Domain Admin and Sub-Domain Admin policies are scoped by a DOMAIN resource filter. A domain defines who can edit what. By default, domains and sub-domains must already exist in DataHub and their names must match hierarchy.yaml exactly (case-sensitive). Set create_domains: true to auto-create missing domains instead of erroring.
  • No visibility restrictions. All users can discover and view all metadata assets. Search Access Controls (VIEW_ENTITY scoping) are not implemented (Note: this is a DataHub Cloud-only feature).

What It Does

On each run the tool:

  1. Resolves domains — looks up existing domains/sub-domains in DataHub by exact name match and maps them to URNs (optionally creates missing ones when create_domains: true)
  2. Creates groups — creates admin groups per domain/sub-domain, plus global Platform Admins and All Users groups (skips if already exists)
  3. Assigns owners — sets the configured user as Business Owner on the domain entity and adds them to the domain's admin group
  4. Creates/updates policies — matches policies by name; creates new ones or updates existing ones with the correct privileges, actors, and domain scope

All steps are idempotent and the full run produces a summary of actions taken.

Configuration

config/hierarchy.yaml

References existing domains and sub-domains by exact name. Domains must already exist in DataHub — the tool resolves names to URNs at runtime. Each domain/sub-domain can optionally specify an owner (corpuser ID or full URN) who will be set as Business Owner and added to the admin group.

domains:
  - name: Analytics
    description: Analytics domain
    owner: jane_doe                       # optional: set as Business Owner + add to admin group
    sub_domains:
      - name: Reporting
        description: Reporting sub-domain
        owner: urn:li:corpuser:bob        # full URN also accepted
      - name: Data Science
        description: Data science sub-domain

  - name: Engineering
    description: Engineering domain
    sub_domains:
      - name: Platform
        description: Platform engineering sub-domain

config/settings.yaml

Connection settings (supports ${ENV_VAR} interpolation), feature flags, and naming patterns for groups/policies.

datahub:
  server: "${DATAHUB_SERVER}"
  token: "${DATAHUB_TOKEN}"
  disable_ssl_verification: false   # set true for self-signed certs

options:
  dry_run: false
  include_cloud_privileges: true    # set false for OSS-only deployments
  create_domains: false             # set true to auto-create missing domains

naming:
  domain_admin_group: "{domain_name} - Admins"
  sub_domain_admin_group: "{domain_name}/{sub_domain_name} - Admins"
  platform_admin_group: "Platform Admins"
  # ... (see file for full list)

config/privileges.yaml

Static privilege definitions per persona, derived from the RBAC implementation design. Rarely modified. Privileges suffixed with _cloud are only included when options.include_cloud_privileges: true (default). Set to false for OSS-only deployments.

Personas

Persona Scope Group Policies
Platform Admin Global Platform Admins 1 platform + 1 metadata
Domain Admin Per domain {Domain} - Admins 1 metadata per domain
Sub-Domain Admin Per sub-domain {Domain}/{SubDomain} - Admins 1 metadata per sub-domain
Business User Global All Users 1 metadata (all users)

Idempotency

  • Domains: Resolved by exact name match; optionally created when create_domains: true
  • Groups: Created if missing, skipped if already exists
  • Policies: Matched by name — existing policies are updated, new ones are created
  • Owners: Ownership and group membership are set via additive mutations (safe to re-run)

CLI Options

python -m src [--dry-run] [--config-dir CONFIG_DIR]
Flag Default Description
--dry-run false Preview all planned actions without connecting to DataHub
--config-dir config/ Path to the directory containing the YAML config files

GitHub Actions

The included workflow (.github/workflows/apply-rbac.yaml) runs on push to main when config files change, and supports manual dispatch. Set DATAHUB_SERVER and DATAHUB_TOKEN as repository secrets.

Tests

pytest tests/

All tests mock the DataHubGraph client — no live DataHub instance needed.

About

DataHub RBAC-as-Code: manage groups and policies from version-controlled YAML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages