Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Zoon committed Jul 26, 2024
1 parent 92291f3 commit ea4200a
Show file tree
Hide file tree
Showing 10 changed files with 310 additions and 192 deletions.
66 changes: 58 additions & 8 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,65 @@
# GitHub Starter
# Ratchada_Utils

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/ratchada-utils.svg)](https://badge.fury.io/py/ratchada-utils)
[![Python Versions](https://img.shields.io/pypi/pyversions/ratchada-utils.svg)](https://pypi.org/project/ratchada-utils/)

## Project Brief

``` @TODO: Replace this readme with the one for your project. ```
Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It offers tools for tokenization and evaluation of speech-to-text outputs.

## Features

- Text tokenization
- Simple evaluation of speech-to-text outputs
- Parallel processing for improved performance

## Installation

You can install `ratchada_utils` using pip:

```bash
pip install ratchada_utils
```

For the latest development version, you can install directly from the GitHub repository:
```bash
pip install git+https://github.com/yourusername/ratchada_utils.git
```

## Usage
### Tokenizing Text

```bash
from ratchada_utils.processor import tokenize_text

text = "Your input text here."
tokenized_text = tokenize_text(text, pred=True)
print("Tokenized Text:", tokenized_text)
# Output: Tokenized Text: ['your', 'input', 'text', 'here']
```

### Evaluate

```bash
import pandas as pd
from ratchada_utils.evaluator import simple_evaluation

`github-starter` is managed by Foundations Engineering.
It's meant to uphold documentation standards and enforce security standards by serving as a template for everyone's GitHub repos.
result = pd.read_csv("./output/result-whisper-ratchada.csv")
summary = simple_evaluation(result["pred_text"], result["true_text"])
print(summary)
```

Status: **In Progress** / Production / Deprecated
## Requirements

## How To Use
1. Python 3.10 or higher
2. Dependencies are listed in requirements.txt

To understand the process, read the [RFC](https://docs.google.com/document/d/1PT97wDuj31BZo87SKSw0YSPvRVzunh1lAJ4I4nPctBI/edit#).
Search for `@TODO` from within your instantiated repo and fill in the gaps.
## Documentation
For full documentation, please visit our documentation page.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Contact
For any questions or issues, please open an issue on the GitHub repository.
165 changes: 68 additions & 97 deletions docs/architecture.md
Original file line number Diff line number Diff line change
@@ -1,101 +1,72 @@
# Project Architecture and Access to Production
# Project Architecture

``` @TODO: Summary of Architecture and steps to access each component ```
## Overview

The `github-starter` project is meant as a base repository template; it should be a basis for other projects.

`github-starter` is hosted on Github, and is available as a [Template in Backstage]([url](https://catalog.tm8.dev/create?filters%5Bkind%5D=template&filters%5Buser%5D=all)****)
Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It's primarily used for tokenization and evaluation of speech-to-text outputs.

## Architecture Details
```Note: Structure of this document assumes Dev and Prod are in different Cloud Platform projects. You can reduce the sections for architecture if redundant. Just note the datasets, vms, buckets, etc. being used in Dev vs Prod ```
- Provider: ``` @TODO: GCP / AWS / Azure / etc ```
- Dev Environment: ``` @TODO: Link to dev env ```
- Prod Environment: ``` @TODO: Link to prod env ```
- Technology: ``` @TODO: Python / Airflow / Dagster / Postgres / Snowflake / etc ```

### Implementation Notes
``` @TODO: Note down known limitations, possible issues, or known quirks with the project. The following questions might help: ``` <br>
``` 1. Which component needs most attention? ie. Usually the first thing that needs tweaks ``` <br>
``` 2. Are the parts of the project that might break in the future? (eg. Filling up of disk space, memory issues if input data size increases, a web scraper, etc.)``` <br>
``` 3. What are some known limitations for the project? (eg. Input data schema has to remain the same, etc.)```

## Dev Architecture
``` @TODO: Dev architecture diagram and description. Please include any OAuth or similar Applications.```
``` @TODO: List out main components being used and their descriptions.```

### Virtual Machines
``` @TODO: List VMs used and what they host```
### Datasets
``` @TODO: List datasets given following format```
#### Dataset A
- Description: PSGC Data
- File Location: GCS Uri / GDrive link / etc
- Retention Policy: 3 months

### Tokens and Accounts
``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```

**Dev Github Service Account Token**

- Location: Bitwarden Github Collection
- Person in charge: Client Name ([email protected])
- Validity: 30 Days
- Description: This token is used to call the Github API using the account ``[email protected]`
- How to Refresh:
1. Go to github.com
2. Click refresh

## Production Architecture
``` @TODO: Prod architecture diagram and description. Please include any OAuth or similar Applications.```
``` @TODO: List out main components being used and their descriptions.```

### Virtual Machines
``` @TODO: List VMs used and what they host```
### Datasets
``` @TODO: List datasets given following format```
#### Dataset A
- Description: PSGC Data
- File Location: GCS Uri / GDrive link / etc
- Retention Policy: 3 months

### Tokens and Accounts
``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```

**Prod Github Service Account Token**

- Location: Bitwarden Github Collection
- Person in charge: Client Name ([email protected])
- Validity: 30 Days
- Description: This token is used to call the Github API using the account ``[email protected]`
- How to Refresh:
1. Go to github.com
2. Click refresh

## Accessing Cloud Platform Environments
```@TODO: Describe the steps to access the prod VMs/platform```

**Get access to Client AWS Platform**
- Person in charge: Client Name/Dev Name
- Bitwarden Credentials:
1. Install AWS CLI
2. Run `aws configure` - ID and Secret from Bitwarden

**Accessing Prod VM**
1. Update your ssh config to have the following:
```
Host project-vpn
Hostname xx.xxx.xxx.xxx
User ubuntu
# Use the Private IP for the rest
Host dev-project-app
Hostname xxx.xx.xx.xx
User ubuntu
ProxyJump project-vpn
```
2. Run `ssh dev-project-app`

**Access Prod App in UI**
1. Install `sshuttle`
2. Run `sshuttle -r dev-project-app xxx.xx.0.0/16`
3. Open web browser using the Private IP found in you SSH config (http:xxx.xx.xx.xx:3000)

- Language: Python (3.10+)
- Package Manager: pip
- Documentation: MkDocs
- Testing: pytest
- Code Style: Black, Flake8
- Continuous Integration: GitHub Actions

## Main Components

1. Text Processor
- Location: `ratchada_utils/processor/`
- Key Function: `tokenize_text()`
- Description: Handles text tokenization for both prediction and reference texts.

2. Evaluator
- Location: `ratchada_utils/evaluator/`
- Key Function: `simple_evaluation()`
- Description: Provides metrics for comparing predicted text against reference text.

## Data Flow

1. Input: Raw text (string)
2. Processing: Tokenization
3. Evaluation: Comparison of tokenized prediction and reference texts
4. Output: Evaluation metrics (pandas DataFrame)

## Dependencies

Major dependencies include:
- pandas: For data manipulation and analysis
- numpy: For numerical operations
- concurrent.futures: For parallel processing

Full list of dependencies can be found in `requirements.txt`.

## Development Environment

The project is developed using a virtual environment to isolate dependencies. See the [Development Guide](development.md) for setup instructions.

## Deployment

The package is deployed to PyPI for easy installation by users. Deployment is handled through GitHub Actions. See the [Deployment Procedure](deployment.md) for details.

## Security Considerations

- The library doesn't handle sensitive data directly, but users should be cautious about the content they process.
- No external API calls are made by the library.

## Scalability

The `simple_evaluation` function uses `concurrent.futures.ProcessPoolExecutor` for parallel processing, allowing it to scale with available CPU cores.

## Limitations

- The library is designed for text processing and may not handle other data types effectively.
- Performance may degrade with extremely large input sizes due to memory constraints.

## Future Improvements

1. Implement more advanced tokenization methods
2. Add support for additional evaluation metrics
3. Optimize memory usage for large inputs

For any questions or issues regarding the architecture, please open an issue on the GitHub repository.
66 changes: 60 additions & 6 deletions docs/deployment.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,72 @@
# Deployment Procedure

## Dev Deployment
## PyPI Deployment

### Pre-requisites
``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
- PyPI account
- `setuptools` and `wheel` installed
- `.pypirc` file configured with your PyPI credentials

### How-to-Guide
``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
1. Update the version number in `setup.py`.
2. Create source distribution and wheel:

```zsh
python setup.py sdist bdist_wheel
```

## Production Deployment
3. Upload to PyPI:
```zsh
twine upload dist/*
```

## Documentation Deployment

### Pre-requisites
``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
- MkDocs installed
- GitHub Pages configured for your repository

### How-to-Guide
``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
1. Build the documentation:
```zsh
mkdocs build
```
2. Deploy to GitHub Pages:
```zsh
mkdocs gh-deploy
```
3. docs/features/example.md:
This file should be replaced with actual features of your project. Here's an example:
```markdown
# Feature: Text Tokenization

## Description
The text tokenization feature allows users to split input text into individual tokens or words.

## How does it work
1. The input text is passed to the `tokenize_text` function.
2. The function removes punctuation and splits the text on whitespace.
3. If `pred=True`, additional preprocessing steps are applied for prediction tasks.
4. The function returns a list of tokens.

## Gotchas / Specific Behaviors
- The tokenizer treats hyphenated words as a single token.
- Numbers are treated as separate tokens.
- The tokenizer is case-sensitive by default.
```

4. docs/README.md:
This file should be updated to reflect your project. Here's a suggested update:
```markdown
# Ratchada Utils Documentation

## Project Brief

`ratchada_utils` is a Python library for text processing and utilities related to the Ratchada Whisper model. It provides tools for tokenization and evaluation of speech-to-text models.

Status: **In Development**

## How To Use

Refer to the [Installation](../README.md#installation) and [Usage](../README.md#usage) sections in the main README for basic usage instructions. For more detailed information on each feature, check the individual documentation pages.
```
Loading

0 comments on commit ea4200a

Please sign in to comment.