Skip to content

Commit 9ceb30d

Browse files
author
Zoon
committed
Update dataflow.md
1 parent 62944ae commit 9ceb30d

File tree

1 file changed

+32
-12
lines changed

1 file changed

+32
-12
lines changed

docs/dataflow.md

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,41 @@
11
# Data Flow Documentation
22
## Data Flow Overview
3-
The data flow for the Ratchada Whisper Hugger project involves several stages from data ingestion to output. The following steps outline how data flows through the system:
3+
The data flow for the `ratchada_utils` project involves processing audio files for speech-to-text transcription using the Whisper model. The flow begins with data input, goes through preprocessing and transcription, and ends with storing the transcribed data.
44

5+
## Inputs
56
### Data Ingestion
67

7-
- **Component**: ratchada_utils.processor module
8-
- **Description**: Data is ingested from various sources such as raw text files, web scraping, or APIs. This step involves gathering data, often including text, metadata (date, author), and other relevant information.
9-
- **Differences Between Dev and Prod**: In development, the data source may be simulated or mocked to allow for rapid testing. In production, data is gathered from live, real-world sources.
8+
- **Input**: Audio files (WAV format)
9+
- **Component**: Google Cloud Storage
10+
- **Scheduling**: On-demand or scheduled batch processing
1011

11-
### Data Preprocessing
12+
### Preprocessing
1213

13-
- **Component**: process.py and basic.py in ratchada_utils.processor
14-
- **Description**: The ingested data is cleaned and preprocessed. This involves tokenization, normalization, and transformation to prepare the data for evaluation or further processing. Preprocessing scripts handle tasks such as splitting text into tokens, removing unnecessary characters, and formatting data.
15-
- **Differences Between Dev and Prod**: Preprocessing steps might be adjusted in development to test different scenarios, whereas production uses optimized configurations for efficiency and accuracy.
14+
- **Input**: Audio files
15+
- **Component**: ratchada_utils.processor
16+
- **Steps**: Audio normalization, silence removal
1617

17-
### Data Evaluation
18+
### Transcription
1819

19-
- **Component**: evaluate_utils.py in ratchada_utils.evaluator
20-
- **Description**: Evaluates the processed data to ensure it meets the required standards and performance metrics. This step includes comparing processed data against benchmarks or performing quality checks.
21-
- **Differences Between Dev and Prod**: Evaluation criteria may be stricter in production, with thresholds set based on real-world performance requirements. Development evaluation may use looser criteria for easier debugging.
20+
- **Input**: Preprocessed audio files
21+
- **Component**: Whisper model (WhisperForConditionalGeneration and WhisperProcessor)
22+
- **Steps**: Transcribing audio to text
23+
24+
### Post-processing
25+
26+
- **Input**: Transcribed text
27+
- **Component**: ratchada_utils.processor
28+
- **Steps**: Text cleaning, formatting
29+
30+
## Outputs
31+
### Data Storage
32+
33+
- **Output**: Transcribed and processed text
34+
- **Component**: Google Cloud Storage / Elasticsearch
35+
- **Location**: Dev and prod-specific storage buckets or indices
36+
37+
### Data Utilization
38+
39+
- **Output**: Processed data for downstream tasks (e.g., NLP, analytics)
40+
- **Component**: Various applications consuming the processed data
41+
- **Scheduling**: Continuous or on-demand access

0 commit comments

Comments
 (0)