thinkingmachines
diff --git a/‎docs/README.md
Lines changed: 58 additions & 8 deletions b/‎docs/README.md
Lines changed: 58 additions & 8 deletions
diff --git a/‎docs/architecture.md
Lines changed: 68 additions & 97 deletions b/‎docs/architecture.md
Lines changed: 68 additions & 97 deletions
diff --git a/‎docs/deployment.md
Lines changed: 60 additions & 6 deletions b/‎docs/deployment.md
Lines changed: 60 additions & 6 deletions
@@ -1,15 +1,65 @@
-# GitHub Starter
+# Ratchada_Utils
+
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![PyPI version](https://badge.fury.io/py/ratchada-utils.svg)](https://badge.fury.io/py/ratchada-utils)
+[![Python Versions](https://img.shields.io/pypi/pyversions/ratchada-utils.svg)](https://pypi.org/project/ratchada-utils/)
 
 ## Project Brief
 
-``` @TODO: Replace this readme with the one for your project. ```
+Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It offers tools for tokenization and evaluation of speech-to-text outputs.
+
+## Features
+
+- Text tokenization
+- Simple evaluation of speech-to-text outputs
+- Parallel processing for improved performance
+
+## Installation
+
+You can install `ratchada_utils` using pip:
+
+```bash
+pip install ratchada_utils
+```
+
+For the latest development version, you can install directly from the GitHub repository:
+```bash
+pip install git+https://github.com/yourusername/ratchada_utils.git
+```
+
+## Usage
+### Tokenizing Text
+
+```bash
+from ratchada_utils.processor import tokenize_text
+
+text = "Your input text here."
+tokenized_text = tokenize_text(text, pred=True)
+print("Tokenized Text:", tokenized_text)
+# Output: Tokenized Text: ['your', 'input', 'text', 'here']
+```
+
+### Evaluate
+
+```bash
+import pandas as pd
+from ratchada_utils.evaluator import simple_evaluation
 
-`github-starter` is managed by Foundations Engineering.
-It's meant to uphold documentation standards and enforce security standards by serving as a template for everyone's GitHub repos.
+result = pd.read_csv("./output/result-whisper-ratchada.csv")
+summary = simple_evaluation(result["pred_text"], result["true_text"])
+print(summary)
+```
 
-Status: **In Progress** / Production / Deprecated
+## Requirements
 
-## How To Use
+1. Python 3.10 or higher
+2. Dependencies are listed in requirements.txt
 
-To understand the process, read the [RFC](https://docs.google.com/document/d/1PT97wDuj31BZo87SKSw0YSPvRVzunh1lAJ4I4nPctBI/edit#).
-Search for `@TODO` from within your instantiated repo and fill in the gaps.
+## Documentation
+For full documentation, please visit our documentation page.
+## Contributing
+Contributions are welcome! Please feel free to submit a Pull Request.
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+## Contact
+For any questions or issues, please open an issue on the GitHub repository.
@@ -1,101 +1,72 @@
-# Project Architecture and Access to Production
+# Project Architecture
 
-``` @TODO: Summary of Architecture and steps to access each component ```
+## Overview
 
-The `github-starter` project is meant as a base repository template; it should be a basis for other projects.
-
-`github-starter` is hosted on Github, and is available as a [Template in Backstage]([url](https://catalog.tm8.dev/create?filters%5Bkind%5D=template&filters%5Buser%5D=all)****)
+Ratchada_Utils is a Python library designed to provide text processing utilities, particularly for tasks related to the Ratchada Whisper model. It's primarily used for tokenization and evaluation of speech-to-text outputs.
 
 ## Architecture Details
-```Note: Structure of this document assumes Dev and Prod are in different Cloud Platform projects. You can reduce the sections for architecture if redundant. Just note the datasets, vms, buckets, etc. being used in Dev vs Prod ```
-- Provider: ``` @TODO: GCP / AWS / Azure / etc ```
-- Dev Environment: ``` @TODO: Link to dev env ```
-- Prod Environment: ``` @TODO: Link to prod env ```
-- Technology: ``` @TODO: Python / Airflow / Dagster / Postgres / Snowflake / etc ```
-
-### Implementation Notes
-``` @TODO: Note down known limitations, possible issues, or known quirks with the project. The following questions might help: ``` <br>
-``` 1. Which component needs most attention? ie. Usually the first thing that needs tweaks ``` <br>
-``` 2. Are the parts of the project that might break in the future? (eg. Filling up of disk space, memory issues if input data size increases, a web scraper, etc.)``` <br>
-``` 3. What are some known limitations for the project? (eg. Input data schema has to remain the same, etc.)```
-
-## Dev Architecture
-``` @TODO: Dev architecture diagram and description. Please include any OAuth or similar Applications.```
-``` @TODO: List out main components being used and their descriptions.```
-
-### Virtual Machines
-``` @TODO: List VMs used and what they host```
-### Datasets
-``` @TODO: List datasets given following format```
-#### Dataset A
-- Description: PSGC Data
-- File Location: GCS Uri / GDrive link / etc
-- Retention Policy: 3 months
-
-### Tokens and Accounts
-``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```
-
-**Dev Github Service Account Token**
-
-- Location: Bitwarden Github Collection
-- Person in charge: Client Name ([email protected])
-- Validity: 30 Days
-- Description: This token is used to call the Github API using the account ``[email protected]`
-- How to Refresh:
-  1. Go to github.com
-  2. Click refresh
-
-## Production Architecture
-``` @TODO: Prod architecture diagram and description. Please include any OAuth or similar Applications.```
-``` @TODO: List out main components being used and their descriptions.```
-
-### Virtual Machines
-``` @TODO: List VMs used and what they host```
-### Datasets
-``` @TODO: List datasets given following format```
-#### Dataset A
-- Description: PSGC Data
-- File Location: GCS Uri / GDrive link / etc
-- Retention Policy: 3 months
-
-### Tokens and Accounts
-``` @TODO: Please fill out all Tokens and Accounts being used in the project given the format below. Include tokens from client used in the project.```
-
-**Prod Github Service Account Token**
-
-- Location: Bitwarden Github Collection
-- Person in charge: Client Name ([email protected])
-- Validity: 30 Days
-- Description: This token is used to call the Github API using the account ``[email protected]`
-- How to Refresh:
-  1. Go to github.com
-  2. Click refresh
-
-## Accessing Cloud Platform Environments
-```@TODO: Describe the steps to access the prod VMs/platform```
-
-**Get access to Client AWS Platform**
-- Person in charge: Client Name/Dev Name
-- Bitwarden Credentials:
-1. Install AWS CLI
-2. Run `aws configure` - ID and Secret from Bitwarden
-
-**Accessing Prod VM**
-1. Update your ssh config to have the following:
-```
-Host project-vpn
-   Hostname xx.xxx.xxx.xxx
-   User ubuntu
-
-# Use the Private IP for the rest
-Host dev-project-app
-   Hostname xxx.xx.xx.xx
-   User ubuntu
-   ProxyJump project-vpn
-```
-2. Run `ssh dev-project-app`
-
-**Access Prod App in UI**
-1. Install `sshuttle`
-2. Run `sshuttle -r dev-project-app xxx.xx.0.0/16`
-3. Open web browser using the Private IP found in you SSH config (http:xxx.xx.xx.xx:3000)
+
+- Language: Python (3.10+)
+- Package Manager: pip
+- Documentation: MkDocs
+- Testing: pytest
+- Code Style: Black, Flake8
+- Continuous Integration: GitHub Actions
+
+## Main Components
+
+1. Text Processor
+   - Location: `ratchada_utils/processor/`
+   - Key Function: `tokenize_text()`
+   - Description: Handles text tokenization for both prediction and reference texts.
+
+2. Evaluator
+   - Location: `ratchada_utils/evaluator/`
+   - Key Function: `simple_evaluation()`
+   - Description: Provides metrics for comparing predicted text against reference text.
+
+## Data Flow
+
+1. Input: Raw text (string)
+2. Processing: Tokenization
+3. Evaluation: Comparison of tokenized prediction and reference texts
+4. Output: Evaluation metrics (pandas DataFrame)
+
+## Dependencies
+
+Major dependencies include:
+- pandas: For data manipulation and analysis
+- numpy: For numerical operations
+- concurrent.futures: For parallel processing
+
+Full list of dependencies can be found in `requirements.txt`.
+
+## Development Environment
+
+The project is developed using a virtual environment to isolate dependencies. See the [Development Guide](development.md) for setup instructions.
+
+## Deployment
+
+The package is deployed to PyPI for easy installation by users. Deployment is handled through GitHub Actions. See the [Deployment Procedure](deployment.md) for details.
+
+## Security Considerations
+
+- The library doesn't handle sensitive data directly, but users should be cautious about the content they process.
+- No external API calls are made by the library.
+
+## Scalability
+
+The `simple_evaluation` function uses `concurrent.futures.ProcessPoolExecutor` for parallel processing, allowing it to scale with available CPU cores.
+
+## Limitations
+
+- The library is designed for text processing and may not handle other data types effectively.
+- Performance may degrade with extremely large input sizes due to memory constraints.
+
+## Future Improvements
+
+1. Implement more advanced tokenization methods
+2. Add support for additional evaluation metrics
+3. Optimize memory usage for large inputs
+
+For any questions or issues regarding the architecture, please open an issue on the GitHub repository.
@@ -1,18 +1,72 @@
 # Deployment Procedure
 
-## Dev Deployment
+## PyPI Deployment
 
 ### Pre-requisites
-``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
+- PyPI account
+- `setuptools` and `wheel` installed
+- `.pypirc` file configured with your PyPI credentials
 
 ### How-to-Guide
-``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
+1. Update the version number in `setup.py`.
+2. Create source distribution and wheel:
 
+```zsh
+python setup.py sdist bdist_wheel
+```
 
-## Production Deployment
+3. Upload to PyPI:
+```zsh
+twine upload dist/*
+```
+
+## Documentation Deployment
 
 ### Pre-requisites
-``` @TODO: Fill with pre-reqs such as access, CI/CD setup, variables used for deployment etc ```
+- MkDocs installed
+- GitHub Pages configured for your repository
 
 ### How-to-Guide
-``` @TODO: Fill with steps to deploy. Feel free to subdivide to sections or multiple MD files through mkdocs.yml ```
+1. Build the documentation:
+```zsh
+mkdocs build
+```
+2. Deploy to GitHub Pages:
+```zsh
+mkdocs gh-deploy
+```
+3. docs/features/example.md:
+This file should be replaced with actual features of your project. Here's an example:
+```markdown
+# Feature: Text Tokenization
+
+## Description
+The text tokenization feature allows users to split input text into individual tokens or words.
+
+## How does it work
+1. The input text is passed to the `tokenize_text` function.
+2. The function removes punctuation and splits the text on whitespace.
+3. If `pred=True`, additional preprocessing steps are applied for prediction tasks.
+4. The function returns a list of tokens.
+
+## Gotchas / Specific Behaviors
+- The tokenizer treats hyphenated words as a single token.
+- Numbers are treated as separate tokens.
+- The tokenizer is case-sensitive by default.
+```
+
+4. docs/README.md:
+This file should be updated to reflect your project. Here's a suggested update:
+```markdown
+# Ratchada Utils Documentation
+
+## Project Brief
+
+`ratchada_utils` is a Python library for text processing and utilities related to the Ratchada Whisper model. It provides tools for tokenization and evaluation of speech-to-text models.
+
+Status: **In Development**
+
+## How To Use
+
+Refer to the [Installation](../README.md#installation) and [Usage](../README.md#usage) sections in the main README for basic usage instructions. For more detailed information on each feature, check the individual documentation pages.
+```