From ab8729556a4e8505780585d9b43a662a73a92dcb Mon Sep 17 00:00:00 2001 From: marianbasti Date: Tue, 2 Apr 2024 15:39:56 -0300 Subject: [PATCH 1/5] [MOD] README --- README.md | 186 +++++++++++++++++++++++------------------------------- 1 file changed, 78 insertions(+), 108 deletions(-) diff --git a/README.md b/README.md index 0fb0f20..32a26e2 100644 --- a/README.md +++ b/README.md @@ -16,27 +16,11 @@ readme_image

- -## Mission -Our mission is to democratize access to legal information in Argentina, fostering a more informed society. While legal information is abundant, it remains challenging for citizens outside the legal domain to efficiently search and comprehend our legal system. Until recently, there existed an upper limit on how informed a society could be, consequently constraining its development. This upper boundary, however, is not static - it is defined by the current technology available to that society. As groundbreaking new technologies emerge, they effectively raise this ceiling, allowing for previously unattainable levels of progress. Throughout history, we have witnessed two pivotal revolutions in information technology – the invention of the printing press and the advent of the internet – each propelling society forward by expanding the boundaries of its potential development. Now, large language models (LLMs) present us with a new pivotal moment, one where we can access all the information we need in an effective and efficient way. - -At the core of our vision is ALI (el Asistente Legal Inteligente), an intelligent legal assistant that acts as a bridge between the complexities of the legal system and the needs of everyday citizens. Through natural language interaction, ALI will demystify intricate legal concepts, providing clear explanations of rights, regulations, and the consequences of emerging legislation. While not a substitute for professional legal counsel, ALI will equip citizens with a deeper understanding of the laws that shape their lives. - -We recognize that building a tool as transformative as ALI requires a collaborative and transparent approach. This initiative must be driven by the needs and perspectives of the community it aims to serve. By fostering an open dialogue and actively involving diverse stakeholders – from legal experts to civic organizations and engaged citizens – we can ensure that ALI accurately reflects the multifaceted realities of Argentina's legal landscape. Transparency will be paramount, as we strive to build trust and accountability through every stage of development and deployment. - -Ultimately, our goal is to create a tool that belongs to the people, one that empowers individuals with the knowledge and understanding to navigate the legal system with confidence and agency. By embracing a community-driven, open approach, we can collectively unlock the full potential of this technology, shaping a future where access to legal information is truly democratized. - -You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/). - -## Who are We -We are just a group of Argentinean developers, named `sandbox.ai`. As of now, this is our temporary [placeholder website](https://sandbox-ai.github.io/). Our full website is currently under construction and will be available soon. Same thing with our social media, it's there but it's not fully fledged yet: [Twitter](https://twitter.com/sandboxaiorg), [LinkedIn](https://www.linkedin.com/in/sandboxai-org-b0a7842b4/). -You can also contact us directly via email at `sandboxai.org@proton.me`. - +This repo contains the code and instructions to set up ALI, a web UI and back-end pipeline for Retrieval Augmented Generation around legal documents. +You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/) . ## Table of Contents -- [Mission](#mission) -- [Who are We](#who-are-we) -- [Table of Contents](#table-of-contents) +- [Installation](#installation) - [Description](#description) - [Overview](#overview) - [Technical information](#technical-information) @@ -56,45 +40,86 @@ You can also contact us directly via email at `sandboxai.org@proton.me`. - [Sponsoring the project](#sponsoring-the-project) - [License](#license) +### Installation + +Clone repo, create new environment, install backend and frontend requirements +```bash +git clone https://github.com/sandbox-ai/ali.git +cd ali +conda create --name ali +conda activate ali +cd backend +pip install -r requirements.txt +cd ../frontend +npm install +``` + +**Usage** + +Launch the backend in one terminal +```bash +conda activate ali +export OPENAI_API_KEY= +# or if you have your own LLM backend +export OPENAI_API_BASE= +cd backend +python api.py +``` + +Open a new terminal and launch the frontend: +```bash +cd frontend +ng serve +``` + +Also see `backend/README.md` and `frontend/README.md` + ## Description ### Overview -The approach we took with ALI is a Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)) system. That is: given an user query, the system searches for the relevant information and hands this information to a Large Language Model (LLM), whom will use that information to answer the query. -It's important to note that RAG is some sort of trick or cheat. Ideally, we would like our LLM (or Mixture of LLMs) to "know" everything about the Argentinean law. This means that we would have to find a way to compress/encode all the legal information into the parameters of the model (often called "parametric memory"), so that when the LLM is prompted to answer the query, this legal information gets decompressed/decoded to generate an answer. These methods exist (e.g., [fine-tuning]() and [LoRA](https://arxiv.org/abs/2106.09685)), but as of now we these methods are not robust enough to rely solely on them (plus they can be expensive), thus we rely on RAG (often called "non-parametric memory"). If we could reliably teach our models the information that we need them to know (that is, every piece of useful information goes into parametric memory), then we wouldn't need things like RAG. +The complexity of legal documents and legalese terminology presents a barrier to most of the population, who wont't able to interact and understand their own legal system unless aided by a professional in the field. -One can think of storing information in model parameters as reading a book and remembering it, whereas RAG is more akin to having a book that you can check its pages whenever you need some of the information. +To help close this gap in the Spnish speaking world, we built the Asistente Legal Inteligente (or ALI). It uses Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)), i.e. searching a vectorstore given an user query and formulating an answer with a Large Language Model (LLM), and a custom dataset that lays a RAG optimized structure. +The result is a grounded and comprehensive assistant that can answer questions about general legislation and specific legal situations. ### Technical information #### General -As of now, the RAG implemented in ALI is as simple as it can be: -We embed the user queries using a custom [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn) (which we host in AWS), we search for the best matching legal documents using cosine-similarity, and then ask the LLM to answer the user query using the best matching documents. -This is far from something advanced and performant. To see our roadmap on RAG improvements, see #Improvements. -The RAG system is written from scratch in python, see `src/rag_session.py` +The user query is embedded using a custom Spanish [embedding model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn), then used to search for the best matching legal documents with cosine-similarity. -#### On LLM frameworks -We tried using both [llama_index](https://github.com/run-llama/llama_index) and [langchain](https://github.com/langchain-ai/langchain), but we found them too restrictive and in the end more cumbersome than writing everything from scratch, which allowed us to have finer control and find where the system was failing. +To formulate the answer, an LLM osted with a [OpenAI compatible API endpoint](https://platform.openai.com/docs/api-reference) is queried by passing a custom prompt and the best matching documents. -Looking ahead, we are more keen on writing the whole pipeline with [DSPy](https://github.com/stanfordnlp/dspy). +This technique has ample room for improvements. See our roadmap on [RAG improvements](#improvements-over-baseline-rag). +You can check out the RAG system written from scratch in `src/rag_session.py` -#### Main problems we encountered -The main problem we encountered with a RAG over argentinean legal data was the embedding of the information. This problem has two parts: +#### On LLM frameworks +We've tested both [llama_index](https://github.com/run-llama/llama_index) and [langchain](https://github.com/langchain-ai/langchain), but found them too restrictive and in the end more cumbersome than developing our own pipeline over [transformers](https://huggingface.co/docs/transformers/en/index), enabling finer control and suprevision. -1. Embedding model - We tried using OpenAI's embedding model (i.e., ada-002), but it wasn't great at distancing different legal topics and clustering similar ones (and it wasn't great in general for Spanish text). Thus we tried the custom embedding [model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn) described earlier. - Notes: OpenAI has release new [embbeding models](https://openai.com/blog/new-embedding-models-and-api-updates?ref=blog.salesforceairesearch.com) like `text-embedding-3` that we haven't tried, and Cohere seems to be releasing great embedding models (both performant and 100x cheapear than OpenAI's) that are on the top of the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). See [cohere-embed-multilingual-v3.0](https://huggingface.co/Cohere/Cohere-embed-multilingual-v3.0). +#### Main challenges encountered +The main problem we encountered with a RAG pipeline over argentinian legal data was the embedding of the information. This problem has two parts: -2. Legal Document Chunking/Parsing - - We found that just chunking legal documents with a regular document chunker wasn't great, it was hard to compute a good embedding vector with a chunk that is missing a lot of contextual information (e.g., a chunk can say something like: ""). Thus we decided to define each legal article as our atomic unit and embbed it as a whole. - - When we tried to embedd whole articles as they were, distancing and clustering of vectors wasn't great. The trick was to preppend all the legal metadata to each legal article BEFORE embedding it. That is, each article to be embbeded looks like: "Decreto de Necesidad y Urgencia N° DNU-2023-70-APN-PTE. Fecha 20-12-2023. Titulo II: DESREGULACIÓN ECONÓMICA. Capitulo II: Tarjetas de crédito (Ley N° 25.065). Articulo 15: La entidad emisora deberá obligatoriamente dar a conocer el público la [...]" instead of just "Articulo 15: La entidad emisora deberá obligatoriamente dar a conocer el público la [...]". +1. Embedding model: + We tested OpenAI's embedding model (ada-002), and found that it wasn't great at distancing different legal topics and clustering similar ones (and it wasn't great in general for Spanish). Thus we opted for the custom embedding [model](https://huggingface.co/dariolopez/roberta-base-bne-finetuned-msmarco-qa-es-mnrl-mn) described earlier. -Note that all the results that we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap) + **Notes:** Newer models that are topping the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) are worth testing. New OpenAI's [embbeding models](https://openai.com/blog/new-embedding-models-and-api-updates?ref=blog.salesforceairesearch.com) have been released and should be tested. +2. Legal Document Chunking/Parsing + + We found that naively chunking legal documents with a regular document chunker was sub-par, trimming and leaving out vital contextual information. We decided to define each legal article as our atomic unit and embbed it as a whole. +When we tried to embed full articles as they were, distancing and clustering of vectors wasn't great. The trick that did it was to prepend all the contextual metadata to each article BEFORE embedding. So, each article to be embbeded would look like: +``` +"Decreto de Necesidad y Urgencia N° DNU-2023-70-APN-PTE. Fecha 20-12-2023. Titulo II: DESREGULACIÓN ECONÓMICA. Capitulo II: Tarjetas de crédito (Ley N° 25.065). Articulo 15: La entidad emisora deberá obligatoriamente dar a conocer el público la [...]" +``` +instead of just +``` +"Articulo 15: La entidad emisora deberá obligatoriamente dar a conocer el público la [...]". +``` +All the results that we found in relation to the embedding models were a direct conclusion of plotting the resulting embedding vectors with the dimensionality reduction technique [t-SNE: t-distributed Stochastic Neighbor Embedding](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Note that there are alternatives such as [UMAP: Universal Manifold Approximation & Projection](https://github.com/lmcinnes/umap) #### Improvements over baseline RAG. There are many improvements to be made over this baseline RAG. The following is a non-exhaustive list: @@ -102,84 +127,24 @@ There are many improvements to be made over this baseline RAG. The following is - Query rewriting (hard without proper context of argentinean law, either parametric or non parametric), potentially connected with a Fusion retriever. - Document reranking - Hybrid Search (keyword-based search algorithms combined with vector search) -- Document reranking - Linear adapter for the embedding model -- Fine-tuning over each legal domain (think MoE over each legal subject, such as civil, penal, etc) +- Fine-tuning over each legal domain - Adding an agentic layer capable of handling complex questions through multiple hops of reasoning and retrieval. -#### Data prep -We have scraped the whole Boletín Oficial. You can find the code to do so [here](https://github.com/sandbox-ai/Boletin-Oficial-Argentina) -We need to parse all that data in the way we explained earlier (preppending metadata). This needs to be done with an LLM, since using a hard coded python function seems unfeasible (lots of edge cases that are hard to handle and predict). - -#### Hosting of custom LLMs -Since fine-tuning/QLoRAing models will likely improve the performance of the RAG system, the question of where to host those model arises (since hosting them in something like AWS is prohibitively, at least without lots of optimizations). The provider with the best latency and cheaper price for the time being seems to be [Together.AI](https://www.together.ai/), but we haven't tried it yet. - -#### Running ALI locally - -**Installation** - -Create new environment, install backend and frontend requirements -```bash - -conda create --name -conda activate -cd backend -pip install -r requirements.txt -cd ../frontend -npm install -``` - -**Usage** - -Launch the backend in one terminal -```bash -conda activate -export OPENAI_API_KEY= -cd backend -python api.py -``` - -Open a new terminal and launch the frontend: -```bash -cd frontend -ng serve -``` - -Also see `backend/README.md` and `frontend/README.md` - - -#### Using Local LLMs instead of cloud-served ones (openai, anthropic, etc) -We haven't developed this yet, but we believe the best frameworks to do this are: -- [Ollama](https://github.com/ollama/ollama) -- [LocalAI](https://github.com/mudler/LocalAI) +#### Dataset +For testing purposes, we include in the current repo a processed version of the [DNU 70/2023](https://www.argentina.gob.ar/normativa/nacional/decreto-70-2023-395521/texto), an extensive executive order that modifies a wide array of laws. +In order to create a vector database of the whole Argentinian law, we can look into the Boletín Oficial, where everything about legislation is published. -#### Running ALI on a smartphone!? -This space is VERY green yet, but we are excited about this possibility and is part of our long-term vision. -A few repos that are doing this and we've been keeping an eye on are: +We have built [a tool to scrap](https://github.com/sandbox-ai/Boletin-Oficial-Argentina) the Boletín Oficial into a dataset dataset. You can also find the result uploaded and up-to-date on [Huggingface](https://huggingface.co/datasets/marianbasti/boletin-oficial-argentina). -- [MLC LLM](https://github.com/mlc-ai/mlc-llm) -- [mllm](https://github.com/UbiquitousLearning/mLLM) -- [maid](https://github.com/Mobile-Artificial-Intelligence/maid) -- [Jan](https://github.com/janhq/jan/issues/728) -- [koboldcpp](https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#compiling-on-android-termux-installation) +This raw dataset must be parsed into the format described earlier (prepending contextual metadata). Given the inconsistent and unpredictable formatting of the documents and texts, there is no simple programmatic parsing to automate the process. We found that various NLP techniques are useful in automating this task (prompting LLMs, sentence-transformers, NER). #### Testing -We plan to test ALI with [ragas](https://github.com/explodinggradients/ragas), but this must be done with a local LLM (we tested this with a relatively small document and gpt4 and spent 40 U$$ in half an hour) +[Ragas](https://github.com/explodinggradients/ragas) is a framework to evaluate RAG pipelines that could be used to test ALI. It is important to acknowledge the costs using a paid API (we tested this with a relatively small document and GPT-4 and spent 40 USD in half an hour) -## Long-term vision -We are hoping to be able to run ALI locally smartphones. -For one reason, that seems to be the most efficient way (why use cloud when lot's of smartphone compute is idle?), but most importantly, that will give a lot of agency to each citizen. We develop it, we all own it. We are all free. -Plus this saves a lot on compute costs (what is the cheapest way to serve 47 million potential users? Locally of course!) -This space is still green, there's a lot of work going on. - -One of the best shots might be to use RNNs instead of Transformers. -One example of this is the [RWKV Language Model](https://wiki.rwkv.com/), which reduces compute requirements by a potential 100x and scales linearly with context length, among other things. -The other one are the [Mamba](https://arxiv.org/abs/2312.00752) style LLMs, which are based in a state-space architecture, such as [Jamba](https://huggingface.co/ai21labs/Jamba-v0.1). - - -## Thank you notes +## Acknowledgements Huge thanks to the [Justicio](https://github.com/bukosabino/justicio) team from Spain, who gave us a lot of tips and shared their embedding model with us Definetly go check their project and talk to the creators! @@ -192,9 +157,14 @@ Definetly go check their project and talk to the creators! Please refer to CONTRIBUTING.md for a detailed explanation of branch/commit naming conventions +## Who we are +We are a group of Argentinian developers named [`sandbox.ai`](https://sandbox-ai.github.io/). +You can find us on [Twitter](https://twitter.com/sandboxaiorg) and [LinkedIn](https://www.linkedin.com/in/sandboxai-org-b0a7842b4/). +You can also contact us directly at `sandboxai.org@proton.me`. + ## Sponsoring the project -As of now, ALI is running on AWS using the OpenAI api key. Both things are expensive and as of now these costs are being covered by the sandbox.ai devs. -If you would like to sponsor this project (either by covering the AWS costs or the OpenAI api key costs), please contact us at `sandboxai.org@proton.me`. +As of now, ALI is running on AWS using our OpenAI API key. Both things are expensive and as of now these costs are being covered by the sandbox.ai devs. +If you find this project useful and would like to sponsor us (either by covering the AWS costs or the OpenAI api key costs), please contact us at `sandboxai.org@proton.me`. ## License MIT From 2547518c42b1d3b7ca8addef5f500491d7b2b924 Mon Sep 17 00:00:00 2001 From: marianbasti Date: Tue, 2 Apr 2024 15:43:13 -0300 Subject: [PATCH 2/5] [MOD] README --- README.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 32a26e2..c8b7b7c 100644 --- a/README.md +++ b/README.md @@ -26,16 +26,11 @@ You can try ALI at [https://chat.omnibus.com.ar/](https://chat.omnibus.com.ar/) - [Technical information](#technical-information) - [General](#general) - [On LLM frameworks](#on-llm-frameworks) - - [Main problems we encountered](#main-problems-we-encountered) + - [Main challenges encountered](#main-challenges-encountered) - [Improvements over baseline RAG.](#improvements-over-baseline-rag) - - [Data prep](#data-prep) - - [Hosting of custom LLMs](#hosting-of-custom-llms) - - [Running ALI locally](#running-ali-locally) - - [Using Local LLMs instead of cloud-served ones (openai, anthropic, etc)](#using-local-llms-instead-of-cloud-served-ones-openai-anthropic-etc) - - [Running ALI on a smartphone!?](#running-ali-on-a-smartphone) + - [Dataset](#dataset) - [Testing](#testing) -- [Long-term vision](#long-term-vision) -- [Thank you notes](#thank-you-notes) +- [Acknowledgementes](#acknowledgements) - [Contributing](#contributing) - [Sponsoring the project](#sponsoring-the-project) - [License](#license) From 5d585cb7988b670d1b1757da9dbaf09b254f20b6 Mon Sep 17 00:00:00 2001 From: Tatako Felici Date: Tue, 2 Apr 2024 15:46:59 -0300 Subject: [PATCH 3/5] fix typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index c8b7b7c..43678a6 100644 --- a/README.md +++ b/README.md @@ -75,7 +75,7 @@ Also see `backend/README.md` and `frontend/README.md` The complexity of legal documents and legalese terminology presents a barrier to most of the population, who wont't able to interact and understand their own legal system unless aided by a professional in the field. -To help close this gap in the Spnish speaking world, we built the Asistente Legal Inteligente (or ALI). It uses Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)), i.e. searching a vectorstore given an user query and formulating an answer with a Large Language Model (LLM), and a custom dataset that lays a RAG optimized structure. +To help close this gap in the Spanish speaking world, we built the Asistente Legal Inteligente (or ALI). It uses Retrieval Augmented Generation ([RAG](https://arxiv.org/abs/2005.11401)), i.e. searching a vectorstore given an user query and formulating an answer with a Large Language Model (LLM), and a custom dataset that lays a RAG optimized structure. The result is a grounded and comprehensive assistant that can answer questions about general legislation and specific legal situations. From ef10c25f7acdce00042280c6cd06a93758eb459e Mon Sep 17 00:00:00 2001 From: marianbasti Date: Tue, 2 Apr 2024 16:15:17 -0300 Subject: [PATCH 4/5] [MOD] Change license to GPL --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 43678a6..9db17b9 100644 --- a/README.md +++ b/README.md @@ -162,7 +162,7 @@ As of now, ALI is running on AWS using our OpenAI API key. Both things are expen If you find this project useful and would like to sponsor us (either by covering the AWS costs or the OpenAI api key costs), please contact us at `sandboxai.org@proton.me`. ## License -MIT +GPL From 972898ff107c02cf1c16598ac15d886b56060d7c Mon Sep 17 00:00:00 2001 From: Marian Basti <31198560+marianbasti@users.noreply.github.com> Date: Tue, 2 Apr 2024 16:21:14 -0300 Subject: [PATCH 5/5] [MOD] typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 9db17b9..ea09ee6 100644 --- a/README.md +++ b/README.md @@ -132,7 +132,7 @@ For testing purposes, we include in the current repo a processed version of the In order to create a vector database of the whole Argentinian law, we can look into the Boletín Oficial, where everything about legislation is published. -We have built [a tool to scrap](https://github.com/sandbox-ai/Boletin-Oficial-Argentina) the Boletín Oficial into a dataset dataset. You can also find the result uploaded and up-to-date on [Huggingface](https://huggingface.co/datasets/marianbasti/boletin-oficial-argentina). +We have built [a tool to scrap](https://github.com/sandbox-ai/Boletin-Oficial-Argentina) the whole Boletín Oficial into a dataset. You can also find the result uploaded and up-to-date on [Huggingface](https://huggingface.co/datasets/marianbasti/boletin-oficial-argentina). This raw dataset must be parsed into the format described earlier (prepending contextual metadata). Given the inconsistent and unpredictable formatting of the documents and texts, there is no simple programmatic parsing to automate the process. We found that various NLP techniques are useful in automating this task (prompting LLMs, sentence-transformers, NER).