The Modern Data Architecture Accelerator (MDAA) is designed to accelerate the implementation of a secure, compliant and fully capable Modern Data Architecture on AWS, allowing organizations of all sizes and sophistication to quickly focus on driving business outcomes from their data while maintaining high assurance of security compliance. Specifically, organizations are enabled to rapidly solve data-driven problems using both traditional analytics, as well as using contemporary capabilities such as generative AI.
MDAA provides rapid deployment of all major elements of a Modern Data Architecture, such as Ingest, Persistence, Governance, DataOps, Consumption, Visual Analytics, Data Science, and AI/ML. Additionally, MDAA has been designed to accelerate compliance with AWS Solutions, NIST 800-53 Rev5 (US), HIPAA, PCI-DSS CDK Nag Rulesets, as well as ITSG-33 (Canada) security control requirements. Terraform modules are compliant with standard Checkov security policies. This combination of integral compliance and broad, configuration-driven capability allows for rapid design and deployment of simple to complex data analytics environments--including Lake House and Data Mesh architectures--while minimizing security compliance risks.
- Any organization looking to rapidly deploy a secure Modern Data Architecture in support of data-driven business/mission requirements, such as Analytics, Business Intelligence, AI/ML, and Generative AI
- Large organizations looking to design and deploy complex Modern Data Architectures such as Lake House or Data Mesh.
- Small to Medium organizations looking for code-free, configuration-driven deployment of a Data Analytics platform.
- Builder organizations who are building custom, code-driven data analytics architectures through use of reusable compliant constructs across multiple languages.
- Any organization with elevated compliance/regulatory requirements.
Getting started with MDAA requires the following steps:
- Architecture and Design - A physical platform architecture should be defined either from scratch, or derived from an AWS/MDAA reference design.
- Configuration - One or more MDAA configuration files are authored, along with individual configuration files for each MDAA module.
- (Optional) Customization - Optionally, resources and stacks can be customized through code-based escape hatches before deployment.
- Predeployment Preparation - In this step, the MDAA NPM packages are built and published to a private NPM repo.
- Deployment - Each MDAA configuration file is either manually or automatically deployed (via CD/CD).
Alternatively, you can jump directly into a set of sample architectures and configurations. Note that these sample configurations can be used as a starting point for much more sophisticated architectures.
- Basic DataLake with Glue - A basic S3 Data Lake with Glue database and crawler
- Basic Terraform DataLake - A basic S3 Data Lake built with the MDAA Terraform module
- Fine-grained Access Control DataLake - An S3 Data Lake with fine-grained access control using LakeFormation
- Data Warehouse - A standalone Redshift Data Warehouse
- Lakehouse - A full LakeHouse implementation, with Data Lake, Data Ops Layers (using NYC taxi data), and a Redshift data warehouse
- AI Development Platform - A standalone SageMaker AI Studio Data Science Platform
- GenAI Platform - A standalone GAIA GenAI Platform
Additionally, once your Modern Data Architecture is deployed, you can use these sample Data Operations blueprints, including MDAA configs and DataOps code, to start solving your data-driven problems.
- Basic Crawler - A basic crawler blueprint
- Event-Driven CSV to Parquet Lambda - A blueprint for transforming small-medium CSV files into Parquet as they are uploaded into a datalake.
- Schedule-Driven CSV to Parquet Glue - A blueprint for transforming larger CSV files into Parquet on a scheduled basis using Glue ETL.
MDAA is designed as a set of logical architectural layers, each constituted by a set of functional 'modules'. Each module configures and deploys a set of resources which constitute the data analytics environment. Modules may have logical dependencies on each other, and may also leverage non-MDAA resources deployed within the environment, such as those deployed via Landing Zone Accelerator.
While MDAA can be used to implement a comprehensive, end to end data analytics platform, it does not result in a closed system. MDAA may be freely integrated with non-MDAA deployed platform elements and analytics capabilities. Any individual layer or module of MDAA can be replaced by a non-MDAA component, and the remaining layers/modules will continue to function (assuming basic functional parity with the replaced MDAA module/layer).
MDAA is conceptually, architecturally, and technically similar in nature to the Landing Zone Accelerator (LZA), providing similar functionality for analytics platform configuration and deployment as LZA does for general cloud platform configuration and deployment. The logical layers of MDAA are specifically designed to be deployed on top of a general purpose, secure cloud platform such as that deployed by LZA.
See MDAA Security
- Leverage Infrastructure as Code (CDK/CloudFormation, Terraform)--as the single agent of deployment and change within the target AWS accounts
- Optional governed, secure self-service deployments via Service Catalog
- Consistent but customizable naming convention across all deployed resources
- Consistent tagging of all generated resources
- Flexible, YAML configuration-driven deployments (CDK Apps) with implicit application of security controls in code
- Ability to orchestrate architectures with both Terraform and CDK-based modules
- Optional publishing of Service Catalog products for end-user self-service of compliant infrastructure
- Reusable CDK L2 and L3 Constructs, and Terraform Modules for consistent application of security controls across modules
- Extensibility through multi-language support using the same approach as CDK itself (via JSII)
- TypeScript/Node.js
- Python 3.x
- Java
- .Net
MDAA is implemented as a set of compliant modules which can be deployed via a unified Deployment/Orchestration layer.
-
MDAA CDK Modules - A set of configuration-driven CDK Apps, which leverage the MDAA CDK Constructs in order to define and deploy compliant data analytics environment components as CloudFormation stacks. These apps can be executed directly and independently using CDK cli, or composed and orchestrated via the MDAA CLI.
-
MDAA Terraform Modules (Preview) - A set of standardized Terraform modules which adhere to security control requirements. These apps can be executed directly and independently using Terraform cli, or composed and orchestrated via the MDAA CLI. Note that Terraform integration is currently in preview, and not all MDAA functionality is available.
-
MDAA CDK L2 and L3 Constructs - A set of reusable CDK constructs which are leveraged by the rest of the MDAA codebase, but can also be reused to build additional compliant CDK constructs, stacks, or apps. These constructs are each designed for compliance with AWS Solutions, HIPAA, PCI-DSS and NIST 800-53 R5 CDK Nag rulesets. Similar to the CDK codebase MDAA is built on, MDAA constructs are available with binding for multiple languages, currently including TypeScript/Node.js and Python 3.
-
MDAA CLI (Deployment/Orchestration) App - A configuration driven CLI application which allows for composition and orchestration of multiple MDAA Modules (CDK and Terraform) in order to deploy a compliant end to end data analytics environment. Also ensures that each MDAA Module is deployed with the specified configuration into the specified accounts while also accounting for dependencies between modules.
- (Preview)SageMaker Catalog - Allows SageMaker Catalog domains to be deployed.
- (Preview)DataZone - Allows DataZone domains and environment blueprints to be deployed.
- (Preview)Macie Session - Allows Macie sessions to be deployed at the account level.
- LakeFormation Data Lake Settings - Allows LF Settings to be administered using IaC.
- LakeFormation Access Controls - Allows LF Access Controls to be administered using IaC
- Glue Catalog - Configures the Encryption at Rest settings for Glue Catalog at the account level. Additionally, configures Glue catalogs for cross account access required by a Data Mesh architecture.
- IAM Roles and Policies - Generates IAM roles for use within the Data Environment
- Audit - Generates Audit resources to use as target for audit data and for querying audit data via Athena
- Audit Trail - Generates CloudTrail to capture S3 Data Events into Audit Bucket
- Service Catalog - Allows Service Catalog Portfolios do be deployed and access granted to principals
- Datalake KMS and Buckets - Generates a set of encrypted data lake buckets and bucket policies. Bucket policies are suitable for direct access via IAM and/or federated roles, as well as indirect access via LakeFormation/Athena.
- Athena Workgroup - Generates Athena Workgroups for use on the Data Lake
- Data Ops Project - Generates shared secure resources for use in Data Ops pipelines, such as Glue Databases, LakeFormation grants, and DataZone Projects/Environments/DataSources
- Data Ops Crawlers - Generates Glue crawlers for use in Data Ops pipelines
- Data Ops Jobs - Generates Glue jobs for use in Data Ops pipelines
- Data Ops Workflows - Generates Glue workflows for orchestrating Data Ops pipelines
- Data Ops Step Functions - Generates Step Functions for orchestrating Data Ops pipelines
- Data Ops Lambda - Deploys Lambda functions for reacting to data events and performing smaller scale data processing
- Data Ops DataBrew - Generates Glue DataBrew resources (Jobs, Recipes) for performing data profiling and cleansing
- (Preview) Data Ops Nifi - Generates Apache Nifi clusters for building event-driven data flows
- (Preview) Data Ops Database Migration Service (DMS) - Generates DMS Replication Instances, Endpoints, and Tasks
- Redshift Data Warehouse - Deploys secure Redshift Data Warehouse clusters
- Opensearch Domain - Deploys secure Opensearch Domains and Opensearch Dashboards
- QuickSight Account - Deploys resources which can be used to deploy a QuickSight account
- QuickSight Namespace - Deploys QuickSight namespaces into an account to allow for QuickSight multi tenancy in the same QuickSight/AWS Account
- QuickSight Project - Deploys QuickSight Shared Folders and permissions
- SageMaker Studio Domain - Deploys secured SageMaker Studio Domain
- SageMaker Notebooks - Deploys secured SageMaker Notebooks
- Data Science Team/Project - Deploys resource to support a team's Data Science activities
- Generative AI Accelerator - Deploys resources for an authenticated GenAI-powered ChatBot
- EC2 - Generates secure EC2 instances and Security groups
- SFTP Transfer Family Server - Deploys SFTP Transfer Family service for loading data into the Data Lake
- SFTP Transfer Family User Administrator - Allows SFTP Transfer Family users to be administered in IaC
- DataSync - Deploys DataSync resources for data movement service between on-premises storage systems and cloud-based storage services
- EventBridge - Deploys EventBridge resources such as EventBuses
These constructs are specifically designed to be compliant with the AWSSolutions, HIPAA, PCI-DSS, and NIST 800-53 R5 CDK Nag Rulesets and are used throughout the MDAA codebase. Additionally, these compliant constructs can be directly leveraged to build new constructs outside of the MDAA codebase.
- Athena Workgroup Constructs
- EC2 Constructs
- (Preview) ECS Constructs
- (Preview) EKS Constructs
- Glue Crawlers, Jobs, and Security Configuration Constructs
- Glue DataBrew Job and Recipe Constructs
- IAM Role Construct
- KMS CMK Construct
- Lambda Role and Function Constructs
- Redshift Cluster Construct
- S3 Bucket Construct
- SageMaker Constructs (Studio and Notebooks)
- OpenSearch Constructs
- SQS Queue Construct
- SNS Topic Construct
- SFTP Transfer Family Server Construct
- (Preview) RDS Aurora Constructs
- (Preview) DynamoDB Construct
These modules are specifically designed to be compliant with standard Checkov rules. Each Terraform module will have Checkov applied at plan/deploy time. Note that these modules are managed in a separate MDAA Terraform Git Repo.
- Athena Workgroups
- S3 Datalake
- Data Science Team
- Glue Catalog Settings
- DataOps Glue Crawlers
- DataOps Glue Jobs
- DataOps Glue Workflow
- DataOps Projects
MDAA can be used and extended in the following ways:
-
Configuration-driven, compliant, end to end Analytics Environments can be configured and deployed using MDAA config files and the MDAA CLI
- Organizations with minimal IaC development and support capability or bandwidth
- Accessible by all roles
- No code, Yaml configurations
- Simple to complex configurations and deployments
- High end to end compliance assurance
-
Custom, code-driven end to end Analytics Environments can be authored and deployed using MDAA reusable constructs
- Organizations with IaC development and support capability
- Accessible by Developers and Builders
- Multi-language support
- High compliance assurance for resources deployed via MDAA constructs
-
Custom-developed and deployed data-driven applications/workloads can be configured to leverage MDAA-deployed resources via the standard set of SSM params which are published by all MDAA modules
- Independently developed in Terraform, CDK or CFN
- Loosely coupled with MDAA via SSM Params
- Workload/Application compliance independently validated
This solution collects anonymous operational metrics to help AWS improve the quality and features of the solution. For more information, including how to disable this capability, please see the [implementation guide] (https://docs.aws.amazon.com/cdk/latest/guide/cli.html#version_reporting).
MDAA includes comprehensive testing for both TypeScript/CDK code and Python Lambda/Glue functions:
- TypeScript Testing: CDK unit tests using CDK Assertions framework
- Python Testing: Modern
uv
-based testing with pytest for Lambda functions and Glue jobs - CI/CD Integration: Automated testing in build pipelines
# Run all tests
./scripts/test.sh # Both TypeScript and Python tests
# Run specific test types
lerna run test --stream # TypeScript tests only
npm run test:python:all # Python tests only
# Development workflow
lerna run build && lerna run test # Build and test TypeScript
uv run pytest # Run Python tests (from python-tests/ dir)
For detailed development and testing information, see:
- DEVELOPMENT.md - Development setup and testing guide
- PYTHON_TESTING.md - Comprehensive Python testing documentation
- CONTRIBUTING.md - Contribution guidelines
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.