|
1 |
| -# Project Architecture and Access to Production |
| 1 | +# Project Architecture |
2 | 2 |
|
3 |
| -``` @TODO: Summary of Architecture and steps to access each component ``` |
| 3 | +## Overview |
4 | 4 |
|
5 |
| -The `github-starter` project is meant as a base repository template; it should be a basis for other projects. |
6 |
| - |
7 |
| -`github-starter` is hosted on Github, and is available as a [Template in Backstage]([url](https://catalog.tm8.dev/create?filters%5Bkind%5D=template&filters%5Buser%5D=all)****) |
| 5 | +Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It's primarily used for tokenization and evaluation of speech-to-text outputs. |
8 | 6 |
|
9 | 7 | ## Architecture Details
|
10 |
| -```Note: Structure of this document assumes Dev and Prod are in different Cloud Platform projects. You can reduce the sections for architecture if redundant. Just note the datasets, vms, buckets, etc. being used in Dev vs Prod ``` |
11 |
| -- Provider: ``` @TODO: GCP / AWS / Azure / etc ``` |
12 |
| -- Dev Environment: ``` @TODO: Link to dev env ``` |
13 |
| -- Prod Environment: ``` @TODO: Link to prod env ``` |
14 |
| -- Technology: ``` @TODO: Python / Airflow / Dagster / Postgres / Snowflake / etc ``` |
15 |
| - |
16 |
| -### Implementation Notes |
17 |
| -``` @TODO: Note down known limitations, possible issues, or known quirks with the project. The following questions might help: ``` <br> |
18 |
| -``` 1. Which component needs most attention? ie. Usually the first thing that needs tweaks ``` <br> |
19 |
| -``` 2. Are the parts of the project that might break in the future? (eg. Filling up of disk space, memory issues if input data size increases, a web scraper, etc.)``` <br> |
20 |
| -``` 3. What are some known limitations for the project? (eg. Input data schema has to remain the same, etc.)``` |
21 |
| - |
22 |
| -## Dev Architecture |
23 |
| -``` @TODO: Dev architecture diagram and description. Please include any OAuth or similar Applications.``` |
24 |
| -``` @TODO: List out main components being used and their descriptions.``` |
25 |
| - |
26 |
| -### Virtual Machines |
27 |
| -``` @TODO: List VMs used and what they host``` |
28 |
| -### Datasets |
29 |
| -``` @TODO: List datasets given following format``` |
30 |
| -#### Dataset A |
31 |
| -- Description: PSGC Data |
32 |
| -- File Location: GCS Uri / GDrive link / etc |
33 |
| -- Retention Policy: 3 months |
34 |
| - |
35 |
| -### Tokens and Accounts |
36 |
| -``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.``` |
37 |
| - |
38 |
| -**Dev Github Service Account Token** |
39 |
| - |
40 |
| -- Location: Bitwarden Github Collection |
41 |
| -- Person in charge: Client Name ( [email protected]) |
42 |
| -- Validity: 30 Days |
43 |
| -- Description: This token is used to call the Github API using the account `` [email protected]` |
44 |
| -- How to Refresh: |
45 |
| - 1. Go to github.com |
46 |
| - 2. Click refresh |
47 |
| - |
48 |
| -## Production Architecture |
49 |
| -``` @TODO: Prod architecture diagram and description. Please include any OAuth or similar Applications.``` |
50 |
| -``` @TODO: List out main components being used and their descriptions.``` |
51 |
| - |
52 |
| -### Virtual Machines |
53 |
| -``` @TODO: List VMs used and what they host``` |
54 |
| -### Datasets |
55 |
| -``` @TODO: List datasets given following format``` |
56 |
| -#### Dataset A |
57 |
| -- Description: PSGC Data |
58 |
| -- File Location: GCS Uri / GDrive link / etc |
59 |
| -- Retention Policy: 3 months |
60 |
| - |
61 |
| -### Tokens and Accounts |
62 |
| -``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.``` |
63 |
| - |
64 |
| -**Prod Github Service Account Token** |
65 |
| - |
66 |
| -- Location: Bitwarden Github Collection |
67 |
| -- Person in charge: Client Name ( [email protected]) |
68 |
| -- Validity: 30 Days |
69 |
| -- Description: This token is used to call the Github API using the account `` [email protected]` |
70 |
| -- How to Refresh: |
71 |
| - 1. Go to github.com |
72 |
| - 2. Click refresh |
73 |
| - |
74 |
| -## Accessing Cloud Platform Environments |
75 |
| -```@TODO: Describe the steps to access the prod VMs/platform``` |
76 |
| - |
77 |
| -**Get access to Client AWS Platform** |
78 |
| -- Person in charge: Client Name/Dev Name |
79 |
| -- Bitwarden Credentials: |
80 |
| -1. Install AWS CLI |
81 |
| -2. Run `aws configure` - ID and Secret from Bitwarden |
82 |
| - |
83 |
| -**Accessing Prod VM** |
84 |
| -1. Update your ssh config to have the following: |
85 |
| -``` |
86 |
| -Host project-vpn |
87 |
| - Hostname xx.xxx.xxx.xxx |
88 |
| - User ubuntu |
89 |
| -
|
90 |
| -# Use the Private IP for the rest |
91 |
| -Host dev-project-app |
92 |
| - Hostname xxx.xx.xx.xx |
93 |
| - User ubuntu |
94 |
| - ProxyJump project-vpn |
95 |
| -``` |
96 |
| -2. Run `ssh dev-project-app` |
97 |
| - |
98 |
| -**Access Prod App in UI** |
99 |
| -1. Install `sshuttle` |
100 |
| -2. Run `sshuttle -r dev-project-app xxx.xx.0.0/16` |
101 |
| -3. Open web browser using the Private IP found in you SSH config (http:xxx.xx.xx.xx:3000) |
| 8 | + |
| 9 | +- Language: Python (3.10+) |
| 10 | +- Package Manager: pip |
| 11 | +- Documentation: MkDocs |
| 12 | +- Testing: pytest |
| 13 | +- Code Style: Black, Flake8 |
| 14 | +- Continuous Integration: GitHub Actions |
| 15 | + |
| 16 | +## Main Components |
| 17 | + |
| 18 | +1. Text Processor |
| 19 | + - Location: `ratchada_utils/processor/` |
| 20 | + - Key Function: `tokenize_text()` |
| 21 | + - Description: Handles text tokenization for both prediction and reference texts. |
| 22 | + |
| 23 | +2. Evaluator |
| 24 | + - Location: `ratchada_utils/evaluator/` |
| 25 | + - Key Function: `simple_evaluation()` |
| 26 | + - Description: Provides metrics for comparing predicted text against reference text. |
| 27 | + |
| 28 | +## Data Flow |
| 29 | + |
| 30 | +1. Input: Raw text (string) |
| 31 | +2. Processing: Tokenization |
| 32 | +3. Evaluation: Comparison of tokenized prediction and reference texts |
| 33 | +4. Output: Evaluation metrics (pandas DataFrame) |
| 34 | + |
| 35 | +## Dependencies |
| 36 | + |
| 37 | +Major dependencies include: |
| 38 | +- pandas: For data manipulation and analysis |
| 39 | +- numpy: For numerical operations |
| 40 | +- concurrent.futures: For parallel processing |
| 41 | + |
| 42 | +Full list of dependencies can be found in `requirements.txt`. |
| 43 | + |
| 44 | +## Development Environment |
| 45 | + |
| 46 | +The project is developed using a virtual environment to isolate dependencies. See the [Development Guide](development.md) for setup instructions. |
| 47 | + |
| 48 | +## Deployment |
| 49 | + |
| 50 | +The package is deployed to PyPI for easy installation by users. Deployment is handled through GitHub Actions. See the [Deployment Procedure](deployment.md) for details. |
| 51 | + |
| 52 | +## Security Considerations |
| 53 | + |
| 54 | +- The library doesn't handle sensitive data directly, but users should be cautious about the content they process. |
| 55 | +- No external API calls are made by the library. |
| 56 | + |
| 57 | +## Scalability |
| 58 | + |
| 59 | +The `simple_evaluation` function uses `concurrent.futures.ProcessPoolExecutor` for parallel processing, allowing it to scale with available CPU cores. |
| 60 | + |
| 61 | +## Limitations |
| 62 | + |
| 63 | +- The library is designed for text processing and may not handle other data types effectively. |
| 64 | +- Performance may degrade with extremely large input sizes due to memory constraints. |
| 65 | + |
| 66 | +## Future Improvements |
| 67 | + |
| 68 | +1. Implement more advanced tokenization methods |
| 69 | +2. Add support for additional evaluation metrics |
| 70 | +3. Optimize memory usage for large inputs |
| 71 | + |
| 72 | +For any questions or issues regarding the architecture, please open an issue on the GitHub repository. |
0 commit comments