Synthetic dataset generator using Apache Airflow and Groq open-source models, designed to generate and load data into MongoDB collections.
- Intelligent Data Generation: Leveraging AI for synthetic dataset creation
- Workflow Automation: Streamlined data pipeline using Apache Airflow
- Scalable Architecture: Flexible data generation and loading mechanisms
- Open-Source Transparency: Modular design for easy customization
Apache Airflow is an open-source platform for orchestrating complex computational workflows and data processing pipelines. Key benefits:
- Programmatic workflow definition
- Dependency management
- Monitoring and retry mechanisms
- Scalable task execution
# Using pip
pip install apache-airflow
# Using Astro CLI (recommended)
brew install astro # macOS
# Windows/Linux: Download from Astronomer website
- Python 3.8+
- Apache Airflow 2.x
- Groq API access
- MongoDB
- Astro CLI (optional)
# Clone repository
git clone https://github.com/avd1729/DataWeave.git
# Create virtual environment
virtualenv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\Activate # Windows
# Install dependencies
pip install -r requirements.txt
- Set Groq API credentials
- Configure MongoDB connection
- Define Airflow DAG parameters
dataweave/
│
├── dags/ # Airflow workflow definitions
├── utils/ # Data processing utilities
│ ├── data_extraction.py
│ ├── data_loading.py
│ └── data_transformation.py
└── requirements.txt
- Use Groq AI models for intelligent data generation
- Parameterized generation based on predefined schemas
- Support for multiple data domain generations
To generate custom datasets, modify the Prompt
class in data_extraction.py
:
class Prompt:
def __init__(self):
self.prompt = """
[Your Custom Prompt Here]
Ensure the prompt includes:
- Structured data requirements
- Specific domain context
- Output format specifications
"""
- Specificity: Clearly define data structure
- Context: Provide domain-specific details
- Format: Specify JSON or desired output format
- Complexity: Include nuanced generation requirements
- Healthcare: Patient record generation
- Financial: Transaction simulation
- IoT: Sensor data creation
- Urban Planning: Population movement modeling
- Use clear, descriptive language
- Specify exact data fields
- Define constraints and uniqueness rules
- Include contextual parameters
- Break down complex requirements
- Use JSON schema as a reference
- Provide example output structures
- Specify randomization or pattern requirements
Extend the Prompt
class with:
- Dynamic prompt generation methods
- Conditional data creation logic
- Domain-specific validation rules
# Custom Prompt for E-commerce User Behavior
prompt.prompt = """
Generate synthetic user interaction data for an e-commerce platform...
"""
# Start Airflow development environment
astro dev start