Skip to content

Commit ea4200a

Browse files
author
Zoon
committed
update docs
1 parent 92291f3 commit ea4200a

File tree

10 files changed

+310
-192
lines changed

10 files changed

+310
-192
lines changed

docs/README.md

Lines changed: 58 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,65 @@
1-
# GitHub Starter
1+
# Ratchada_Utils
2+
3+
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
4+
[![PyPI version](https://badge.fury.io/py/ratchada-utils.svg)](https://badge.fury.io/py/ratchada-utils)
5+
[![Python Versions](https://img.shields.io/pypi/pyversions/ratchada-utils.svg)](https://pypi.org/project/ratchada-utils/)
26

37
## Project Brief
48

5-
``` @TODO: Replace this readme with the one for your project. ```
9+
Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It offers tools for tokenization and evaluation of speech-to-text outputs.
10+
11+
## Features
12+
13+
- Text tokenization
14+
- Simple evaluation of speech-to-text outputs
15+
- Parallel processing for improved performance
16+
17+
## Installation
18+
19+
You can install `ratchada_utils` using pip:
20+
21+
```bash
22+
pip install ratchada_utils
23+
```
24+
25+
For the latest development version, you can install directly from the GitHub repository:
26+
```bash
27+
pip install git+https://github.com/yourusername/ratchada_utils.git
28+
```
29+
30+
## Usage
31+
### Tokenizing Text
32+
33+
```bash
34+
from ratchada_utils.processor import tokenize_text
35+
36+
text = "Your input text here."
37+
tokenized_text = tokenize_text(text, pred=True)
38+
print("Tokenized Text:", tokenized_text)
39+
# Output: Tokenized Text: ['your', 'input', 'text', 'here']
40+
```
41+
42+
### Evaluate
43+
44+
```bash
45+
import pandas as pd
46+
from ratchada_utils.evaluator import simple_evaluation
647

7-
`github-starter` is managed by Foundations Engineering.
8-
It's meant to uphold documentation standards and enforce security standards by serving as a template for everyone's GitHub repos.
48+
result = pd.read_csv("./output/result-whisper-ratchada.csv")
49+
summary = simple_evaluation(result["pred_text"], result["true_text"])
50+
print(summary)
51+
```
952

10-
Status: **In Progress** / Production / Deprecated
53+
## Requirements
1154

12-
## How To Use
55+
1. Python 3.10 or higher
56+
2. Dependencies are listed in requirements.txt
1357

14-
To understand the process, read the [RFC](https://docs.google.com/document/d/1PT97wDuj31BZo87SKSw0YSPvRVzunh1lAJ4I4nPctBI/edit#).
15-
Search for `@TODO` from within your instantiated repo and fill in the gaps.
58+
## Documentation
59+
For full documentation, please visit our documentation page.
60+
## Contributing
61+
Contributions are welcome! Please feel free to submit a Pull Request.
62+
## License
63+
This project is licensed under the MIT License - see the LICENSE file for details.
64+
## Contact
65+
For any questions or issues, please open an issue on the GitHub repository.

docs/architecture.md

Lines changed: 68 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -1,101 +1,72 @@
1-
# Project Architecture and Access to Production
1+
# Project Architecture
22

3-
``` @TODO: Summary of Architecture and steps to access each component ```
3+
## Overview
44

5-
The `github-starter` project is meant as a base repository template; it should be a basis for other projects.
6-
7-
`github-starter` is hosted on Github, and is available as a [Template in Backstage]([url](https://catalog.tm8.dev/create?filters%5Bkind%5D=template&filters%5Buser%5D=all)****)
5+
Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It's primarily used for tokenization and evaluation of speech-to-text outputs.
86

97
## Architecture Details
10-
```Note: Structure of this document assumes Dev and Prod are in different Cloud Platform projects. You can reduce the sections for architecture if redundant. Just note the datasets, vms, buckets, etc. being used in Dev vs Prod ```
11-
- Provider: ``` @TODO: GCP / AWS / Azure / etc ```
12-
- Dev Environment: ``` @TODO: Link to dev env ```
13-
- Prod Environment: ``` @TODO: Link to prod env ```
14-
- Technology: ``` @TODO: Python / Airflow / Dagster / Postgres / Snowflake / etc ```
15-
16-
### Implementation Notes
17-
``` @TODO: Note down known limitations, possible issues, or known quirks with the project. The following questions might help: ``` <br>
18-
``` 1. Which component needs most attention? ie. Usually the first thing that needs tweaks ``` <br>
19-
``` 2. Are the parts of the project that might break in the future? (eg. Filling up of disk space, memory issues if input data size increases, a web scraper, etc.)``` <br>
20-
``` 3. What are some known limitations for the project? (eg. Input data schema has to remain the same, etc.)```
21-
22-
## Dev Architecture
23-
``` @TODO: Dev architecture diagram and description. Please include any OAuth or similar Applications.```
24-
``` @TODO: List out main components being used and their descriptions.```
25-
26-
### Virtual Machines
27-
``` @TODO: List VMs used and what they host```
28-
### Datasets
29-
``` @TODO: List datasets given following format```
30-
#### Dataset A
31-
- Description: PSGC Data
32-
- File Location: GCS Uri / GDrive link / etc
33-
- Retention Policy: 3 months
34-
35-
### Tokens and Accounts
36-
``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```
37-
38-
**Dev Github Service Account Token**
39-
40-
- Location: Bitwarden Github Collection
41-
- Person in charge: Client Name ([email protected])
42-
- Validity: 30 Days
43-
- Description: This token is used to call the Github API using the account ``[email protected]`
44-
- How to Refresh:
45-
1. Go to github.com
46-
2. Click refresh
47-
48-
## Production Architecture
49-
``` @TODO: Prod architecture diagram and description. Please include any OAuth or similar Applications.```
50-
``` @TODO: List out main components being used and their descriptions.```
51-
52-
### Virtual Machines
53-
``` @TODO: List VMs used and what they host```
54-
### Datasets
55-
``` @TODO: List datasets given following format```
56-
#### Dataset A
57-
- Description: PSGC Data
58-
- File Location: GCS Uri / GDrive link / etc
59-
- Retention Policy: 3 months
60-
61-
### Tokens and Accounts
62-
``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```
63-
64-
**Prod Github Service Account Token**
65-
66-
- Location: Bitwarden Github Collection
67-
- Person in charge: Client Name ([email protected])
68-
- Validity: 30 Days
69-
- Description: This token is used to call the Github API using the account ``[email protected]`
70-
- How to Refresh:
71-
1. Go to github.com
72-
2. Click refresh
73-
74-
## Accessing Cloud Platform Environments
75-
```@TODO: Describe the steps to access the prod VMs/platform```
76-
77-
**Get access to Client AWS Platform**
78-
- Person in charge: Client Name/Dev Name
79-
- Bitwarden Credentials:
80-
1. Install AWS CLI
81-
2. Run `aws configure` - ID and Secret from Bitwarden
82-
83-
**Accessing Prod VM**
84-
1. Update your ssh config to have the following:
85-
```
86-
Host project-vpn
87-
Hostname xx.xxx.xxx.xxx
88-
User ubuntu
89-
90-
# Use the Private IP for the rest
91-
Host dev-project-app
92-
Hostname xxx.xx.xx.xx
93-
User ubuntu
94-
ProxyJump project-vpn
95-
```
96-
2. Run `ssh dev-project-app`
97-
98-
**Access Prod App in UI**
99-
1. Install `sshuttle`
100-
2. Run `sshuttle -r dev-project-app xxx.xx.0.0/16`
101-
3. Open web browser using the Private IP found in you SSH config (http:xxx.xx.xx.xx:3000)
8+
9+
- Language: Python (3.10+)
10+
- Package Manager: pip
11+
- Documentation: MkDocs
12+
- Testing: pytest
13+
- Code Style: Black, Flake8
14+
- Continuous Integration: GitHub Actions
15+
16+
## Main Components
17+
18+
1. Text Processor
19+
- Location: `ratchada_utils/processor/`
20+
- Key Function: `tokenize_text()`
21+
- Description: Handles text tokenization for both prediction and reference texts.
22+
23+
2. Evaluator
24+
- Location: `ratchada_utils/evaluator/`
25+
- Key Function: `simple_evaluation()`
26+
- Description: Provides metrics for comparing predicted text against reference text.
27+
28+
## Data Flow
29+
30+
1. Input: Raw text (string)
31+
2. Processing: Tokenization
32+
3. Evaluation: Comparison of tokenized prediction and reference texts
33+
4. Output: Evaluation metrics (pandas DataFrame)
34+
35+
## Dependencies
36+
37+
Major dependencies include:
38+
- pandas: For data manipulation and analysis
39+
- numpy: For numerical operations
40+
- concurrent.futures: For parallel processing
41+
42+
Full list of dependencies can be found in `requirements.txt`.
43+
44+
## Development Environment
45+
46+
The project is developed using a virtual environment to isolate dependencies. See the [Development Guide](development.md) for setup instructions.
47+
48+
## Deployment
49+
50+
The package is deployed to PyPI for easy installation by users. Deployment is handled through GitHub Actions. See the [Deployment Procedure](deployment.md) for details.
51+
52+
## Security Considerations
53+
54+
- The library doesn't handle sensitive data directly, but users should be cautious about the content they process.
55+
- No external API calls are made by the library.
56+
57+
## Scalability
58+
59+
The `simple_evaluation` function uses `concurrent.futures.ProcessPoolExecutor` for parallel processing, allowing it to scale with available CPU cores.
60+
61+
## Limitations
62+
63+
- The library is designed for text processing and may not handle other data types effectively.
64+
- Performance may degrade with extremely large input sizes due to memory constraints.
65+
66+
## Future Improvements
67+
68+
1. Implement more advanced tokenization methods
69+
2. Add support for additional evaluation metrics
70+
3. Optimize memory usage for large inputs
71+
72+
For any questions or issues regarding the architecture, please open an issue on the GitHub repository.

docs/deployment.md

Lines changed: 60 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,72 @@
11
# Deployment Procedure
22

3-
## Dev Deployment
3+
## PyPI Deployment
44

55
### Pre-requisites
6-
``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
6+
- PyPI account
7+
- `setuptools` and `wheel` installed
8+
- `.pypirc` file configured with your PyPI credentials
79

810
### How-to-Guide
9-
``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
11+
1. Update the version number in `setup.py`.
12+
2. Create source distribution and wheel:
1013

14+
```zsh
15+
python setup.py sdist bdist_wheel
16+
```
1117

12-
## Production Deployment
18+
3. Upload to PyPI:
19+
```zsh
20+
twine upload dist/*
21+
```
22+
23+
## Documentation Deployment
1324

1425
### Pre-requisites
15-
``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
26+
- MkDocs installed
27+
- GitHub Pages configured for your repository
1628

1729
### How-to-Guide
18-
``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
30+
1. Build the documentation:
31+
```zsh
32+
mkdocs build
33+
```
34+
2. Deploy to GitHub Pages:
35+
```zsh
36+
mkdocs gh-deploy
37+
```
38+
3. docs/features/example.md:
39+
This file should be replaced with actual features of your project. Here's an example:
40+
```markdown
41+
# Feature: Text Tokenization
42+
43+
## Description
44+
The text tokenization feature allows users to split input text into individual tokens or words.
45+
46+
## How does it work
47+
1. The input text is passed to the `tokenize_text` function.
48+
2. The function removes punctuation and splits the text on whitespace.
49+
3. If `pred=True`, additional preprocessing steps are applied for prediction tasks.
50+
4. The function returns a list of tokens.
51+
52+
## Gotchas / Specific Behaviors
53+
- The tokenizer treats hyphenated words as a single token.
54+
- Numbers are treated as separate tokens.
55+
- The tokenizer is case-sensitive by default.
56+
```
57+
58+
4. docs/README.md:
59+
This file should be updated to reflect your project. Here's a suggested update:
60+
```markdown
61+
# Ratchada Utils Documentation
62+
63+
## Project Brief
64+
65+
`ratchada_utils` is a Python library for text processing and utilities related to the Ratchada Whisper model. It provides tools for tokenization and evaluation of speech-to-text models.
66+
67+
Status: **In Development**
68+
69+
## How To Use
70+
71+
Refer to the [Installation](../README.md#installation) and [Usage](../README.md#usage) sections in the main README for basic usage instructions. For more detailed information on each feature, check the individual documentation pages.
72+
```

0 commit comments

Comments
 (0)