diff --git a/docs/api/benchmarks.rst b/docs/api/benchmarks.rst deleted file mode 100644 index 3197092..0000000 --- a/docs/api/benchmarks.rst +++ /dev/null @@ -1,44 +0,0 @@ -Benchmarking -============= - -When comparing LLMs, there is a constant tradeoff to make between quality, cost and latency. Stronger models are (in general) slower and more expensive - and sometimes overkill for the task at hand. Complicating matters further, new models are released weekly, each claiming to be state-of-the-art. - -Benchmarking on your data lets you see how each of the different models perform on your task. - -.. image:: ../images/benchmarks.png - :align: center - :width: 800 - :alt: Benchmarks Image. - -You can compare how quality relates to cost and latency, with live stats pulled from our `runtime benchmarks `_. - -When new models come out, simply re-run the benchmark to see how they perform on your task. - - -Preparing your dataset ------------------------ -First create a dataset which is representative of the task you want to evaluate. -You will need a list of prompts, optionally including a reference, *gold-standard* answer. Datasets containing reference answers tend to get more accurate benchmarks. - -The file itself should be in JSONL format, with one entry per line, as in the example below. - -.. code-block:: - - {"prompt": "This is the first prompt", "ref_answer": "This is the first reference answer"} - {"prompt": "This is the second prompt", "ref_answer": "This is the second reference answer"} - -Use at least 50 prompts to get the most accurate results. Currently there is an maximum limit of 500 prompts, for most tasks we don’t tend to see much extra detail past ~250. - -Benchmarking your dataset -------------------------- -In `your dashboard `_, clicking :code:`Select benchmark` and then :code:`Benchmark your prompts` opens the interface to upload a dataset. - -When the benchmark finishes, you'll receive an email, and the graph will be displayed in your `dashboard `_. - -The x-axis can be set to represent :code:`cost`, :code:`time-to-first-token`, or :code:`inter-token latency`, and on either a linear or log scale. - -How does it work? -^^^^^^^^^^^^^^^^^^ -Currently, we use gpt4o-as-a-judge (cf. https://arxiv.org/abs/2306.05685), to evaluate the quality of each model’s responses. - - diff --git a/docs/concepts/benchmarks.rst b/docs/concepts/benchmarks.rst index 50c4ed5..3197092 100644 --- a/docs/concepts/benchmarks.rst +++ b/docs/concepts/benchmarks.rst @@ -1,156 +1,44 @@ -Benchmarks -========== +Benchmarking +============= -In this section, we explain our process for benchmarking LLM endpoints. We discuss quality and runtime benchmarks separately. +When comparing LLMs, there is a constant tradeoff to make between quality, cost and latency. Stronger models are (in general) slower and more expensive - and sometimes overkill for the task at hand. Complicating matters further, new models are released weekly, each claiming to be state-of-the-art. -Quality Benchmarks ------------------- +Benchmarking on your data lets you see how each of the different models perform on your task. -Finding the best LLM(s) for a given application can be challenging. The performance of a model can vary significantly depending on the task, dataset, and evaluation metrics used. Existing benchmarks attempt to compare models based on standardized approaches, but biases inevitably creep in as models learn to do well on these targeted assessments. - -Practically, the LLM community still heavily relies on testing models manually to build an intuition around their expected behavior for a given use-case. While this generally works better, hand-crafted testing isn't sustainable as one's needs evolve and new LLMs emerge at a rapid pace. -Our LLM assessment pipeline is based on the method outlined below. - -Design Principles -^^^^^^^^^^^^^^^^^ - -Our quality benchmarks are based on a set of guiding principles. Specifically, we strive to make our pipeline: - -- **Systematized:** A rigorous benchmarking pipeline should be standardized across assessments, repeatable, and scalable. We make sure to benchmark all LLMs identically to with a well-defined approach we outline in the next passage. - -- **Task-centric:** Models perform differently on various tasks. Some might do better at coding, others are well suited for summarizing content, etc. These broad task categories can also be refined into specific subtasks. For e.g summarizing technical content to generate product documentation is radically different from summarizing news. This should be reflected in assessments. For this reason, we allow you to upload your custom prompt dataset, that you believe reflects the intended task, to use as a reference for running benchmarks. - -- **Customizable:** Assessments should reflect the unique needs of the assessor. Depending on your application requirements, you may need to strictly include / exclude some models from the benchmarks. We try to strike a balance between standardization and modularity such that you can run the benchmarks that are relevant to your needs. - -Methodology -^^^^^^^^^^^ - -Overview -******** -We benchmark models using the LLM-as-a-judge approach. This relies on using a powerful language model to generate assessments on the outputs of other models, using a standard reviewing procedure. LLM-as-a-judge is sometimes used to run experiments at scale when generating human assessments isn't an option or to avoid introducing human biases. - -Given a dataset of user prompts, each prompt is sent to all endpoints to generate an output. Then, we ask GPT-4 to review each output and give a final assessment based on how helpful and accurate the response is relative to either (a) the user prompt, in the case of unlabelled datasets, or (b) the prompt and the reference answer, in the case of labelled datasets. - -Scoring -******* - -The assessor LLM reviews the output of an endpoint which it categorizes as :code:`irrelevant`, :code:`bad`, :code:`satisfactory`, :code:`very good`, or :code:`excellent`. Each of these labels is then mapped to a numeric score ranging from 0.0 to 1.0. We repeat the same proces for all prompts in the dataset to get the endpoint's performance score on each prompt. The overall endpoint's score is then the average of these prompt-specific scores. - -Visualizing Results -******************* - -In addition to the list of model scores, we also compute runtime performance for the endpoint (as explained in the section below). Doing so allows us to plot the quality performance versus runtime to assess the quality-to-performance of the endpoints, instead of relying on the quality scores alone. - -.. image:: ../images/console_dashboard.png +.. image:: ../images/benchmarks.png :align: center - :width: 650 - :alt: Console Dashboard. - -.. note:: - Because quality scores are model-specific, they are the same across the different endpoints exposed for a given model. As a result, all the endpoints for a model will plot horizontally at the same quality level, with only the runtime metric setting them apart. - -Considerations and Limitations -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Despite having a well-defined benchmarking approach, it also inevitably comes with its own issues. Using an LLM to judge outputs may introduce a different kind of bias through the data used to train the assessor model. We are currently looking at ways to mitigate this with more diversified and / or customized judge LLM selection. - -Runtime Benchmarks ------------------- - -Finding the best model(s) for a task is just the first step to optimize LLM pipelines. Given the plethora of endpoint providers offering the same models, true optimization requires considering performance discrepancies across endpoints and time. - -Because this is a complex decision, it needs to be made based on data. For this data to be reliable, it should also result from transparent and objective measurements, which we outline in this below. - -.. note:: - Our benchmarking code is openly available in `this repository `_. - -Design Principles -^^^^^^^^^^^^^^^^^ - -Our runtime benchmarks are based on a set of guiding principles. Specifically, we believe benchmarks should be: - -- **Community-driven:** We invite everyone to audit or improve the logic and the code. We are building these benchmarks for the community, so contributions and discussions around them are more than welcome! + :width: 800 + :alt: Benchmarks Image. -- **User-centric:** External factors (e.g. how different providers set up their infrastructure) may impact measurements. Nevertheless, our benchmarks are not designed to gauge performance in controlled environments. Rather, we aime to measure performance as experienced by the end-user who, ultimately, is subject to the same distortions. +You can compare how quality relates to cost and latency, with live stats pulled from our `runtime benchmarks `_. -- **Model and Provider-agnostic:** While some metrics are more relevant to certain scenarios (e.g. cold start time in model endpoints that scale to zero), we try to make as few assumptions as possible on the providers or technologies being benchmarked. We only assume that endpoints take a string as the input and return a streaming response. +When new models come out, simply re-run the benchmark to see how they perform on your task. -Methodology -^^^^^^^^^^^ +Preparing your dataset +----------------------- +First create a dataset which is representative of the task you want to evaluate. +You will need a list of prompts, optionally including a reference, *gold-standard* answer. Datasets containing reference answers tend to get more accurate benchmarks. -Tokenizer -********* - -To avoid biases towards any model-specific tokenizer, we calculate all metrics using the same tokenizer across different models. We have chosen the `cl100k_base` tokenizer from OpenAI's `tiktoken `_ library for this since it’s MIT licensed and already widely adopted by the community. - -Inputs and Outputs -****************** - -To fairly assess optimizations such as speculative decoding, we use real text as the input and avoid using randomly generated data. The length of the input affects prefill time and therefore can affect the responsiveness of the system. To account for this, we run the benchmark with two input regimes. - -- Short inputs: Using sentences with an average length of 200 tokens and a standard deviation of 20. -- Long inputs: Using sentences with an average length of 1000 tokens and a standard deviation of 100. - -To build these clusters, we programmatically select sentences from `BookCorpus `_ and create two subsets of it. For instruct/chat models to answer appropriately and ensure a long enough response, we preface each prompt with :code:`Repeat the following lines <#> times without generating the EOS token earlier than that`, where :code:`<#>` is randomly sampled. - -For the outputs, we use randomized discrete values from the same distributions (i.e. N(200, 20) for short inputs and N(1000, 100) for long ones) to cap the number of tokens in the output. This ensures variable output length, which is necessary to consider algorithms such as Paged Attention or Dynamic Batching. - -When running one benchmark across different endpoints, we seed each runner with the same initial value, so that the inputs are the same for all endpoints. - -Computation -*********** - -To execute the benchmarks, we run three processes periodically from three different regions: **Hong Kong, Belgium and Iowa**. Each one of these processes is triggered every three hours and benchmarks every available endpoint. - -Accounting for the different input policies, we run a total of 4 benchmarks for each endpoint every time a region benchmark is triggered. - - -Metrics -******* - -Several key metrics are captured and calculated during the benchmarking process: - -- **Time to First Token (TTFT):** Time between request initiation and the arrival of the first streaming response packet. TTFT directly reflects the prompt processing speed, offering insights into the efficiency of the model's initial response. A lower TTFT signifies quicker engagement, which is crucial for applications that require dynamic interactions or real-time feedback. - -- **End to End Latency:** Time between request initiation and the arrival of the final packet in the streaming response. This metric provides a holistic view of the response time, including processing and transmission. - -- **Inter Token Latency (ITL):** Average time between consecutive tokens in the response. We compute this as :code:`(End to End Latency) / (Output Tokens - 1)`. ITL provides valuable information about the pacing of token generation and the overall temporal dynamics within the model's output. As expected, a lower ITL signifies a more cohesive and fluid generation of tokens, which contributes to a more seamless and human-like interaction with the model. - -- **Number of Output Tokens per Second:** Relation between the number of tokens generated and the time taken. We don't consider the TTFT here, so this is equivalent to :code:`1 / ITL`. In this case, a higher Number of Output Tokens per Second means a faster and more productive model output. It's important to note that this is **not** a measurement of the throughput of the inference server since it doesn't account for batched inputs. - -- **Cold Start:** Time taken for a server to boot up in environments where the number of active instances can get to zero. We consider a threshold of 15 seconds. What this means is that we do an initial "dumb" request to the endpoint and record its TTFT. If this TTFT is greater than 15 seconds, we measure the time it takes to get the second token. If the ratio between the TTFT and first ITL measurements is at least 10:1, we consider the TTFT to be Cold Start time. Once this process has finished. We start the benchmark process in the warmed-up instance. This metric reflects the time it takes for the system to be ready for processing requests, rendering it essential for users relying on prompt and consistent model responses, allowing you to account for any potential initialization delays in the responses and ensuring a more accurate expectation of the model's responsiveness. - -- **Cost**: Last but not least, we present information about the cost of querying the model. This is usually different for the input tokens and the response tokens, so it can be beneficial to choose different models depending on the end task. As an example, to summarize a document, a provider with lower price in the input tokens would be better, even if it comes with a slightly higher price in the output. On the other hand, if you want to generate long-format content, a provider with a lower price per generated token will be the most appropriate option. - -Data Presentation -***************** - -When aggregating metrics, particularly in benchmark regimes with multiple concurrent requests, we calculate and present the P90 (90th percentile) value from the set of measurements. We choose the P90 to reduce the influence of extreme values and provide a reliable snapshot of the model's performance. - -When applicable, aggregated data is shown both in the plots and the benchmark tables. - -.. image:: ../images/benchmarks_model_page.png - :align: center - :width: 650 - :alt: Benchmarks Model Page. +The file itself should be in JSONL format, with one entry per line, as in the example below. -Additionally, we also include a MA5 view (Moving Average of the last 5 measurements) in the graphs. This smoothing technique helps mitigate short-term fluctuations and should provide a clearer trend representation over time. +.. code-block:: -.. note:: - In some cases, you will find :code:`Not computed` instead of a value, or even a :code:`No metrics are available yet` message instead of the benchmark data. This is typically due to an internal issue or a rate limit, which we'll be quickly fixing. + {"prompt": "This is the first prompt", "ref_answer": "This is the first reference answer"} + {"prompt": "This is the second prompt", "ref_answer": "This is the second reference answer"} +Use at least 50 prompts to get the most accurate results. Currently there is an maximum limit of 500 prompts, for most tasks we don’t tend to see much extra detail past ~250. -Considerations and Limitations -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +Benchmarking your dataset +------------------------- +In `your dashboard `_, clicking :code:`Select benchmark` and then :code:`Benchmark your prompts` opens the interface to upload a dataset. -We try to tackle some of the more significant limitations of benchmarking inference endpoints. For example, network latency, by running the benchmarks in different regions; or unreliable point-measurements, by continuously benchmarking the endpoints and plotting their trends over time. +When the benchmark finishes, you'll receive an email, and the graph will be displayed in your `dashboard `_. -However, there are still some relevant considerations to have in mind. Our methodology at the moment is solely focused on performance, which means that we don't look at the output of the models. +The x-axis can be set to represent :code:`cost`, :code:`time-to-first-token`, or :code:`inter-token latency`, and on either a linear or log scale. -Nonetheless, even accounting for the public-facing nature of these endpoints (no gibberish allowed!), there might be some implementation differences that affect the output quality, such as quantization/compression of the models, different context window sizes, or different speculative decoding models, among others. We are working towards mitigating this as well, so stay tuned! +How does it work? +^^^^^^^^^^^^^^^^^^ +Currently, we use gpt4o-as-a-judge (cf. https://arxiv.org/abs/2306.05685), to evaluate the quality of each model’s responses. -Round Up --------- -You are now familiar with how we run our benchmarks. Next, you can explore how to `use the benchmarks, or run your own `_ through the benchmarks interface! diff --git a/docs/api/deploy_router.rst b/docs/concepts/deploy_router.rst similarity index 100% rename from docs/api/deploy_router.rst rename to docs/concepts/deploy_router.rst diff --git a/docs/concepts/endpoints.rst b/docs/concepts/endpoints.rst deleted file mode 100644 index 6f9f167..0000000 --- a/docs/concepts/endpoints.rst +++ /dev/null @@ -1,33 +0,0 @@ -Model Endpoints -=============== - -Unify lets you query model endpoints across providers. In this section, we explain what an endpoint is and how it relates to the concepts of models and providers. - -What is a Model Endpoint? -------------------------- - -A model endpoint is a model that you can interact with through an API, usually hosted by a provider. Model endpoints, particularly LLM endpoints, play a critical role when building and deploying AI applications at scale. - -A model can be offered by different providers through one or multiple endpoints. There's loads of ways to categorize providers, and the boundaries can sometimes be blurry as services overlap; but you can think of a provider as an end-to-end deployment stack that comes with unique sets of features, performance, pricing, and so on. While positive, this diversity also makes it difficult to find the most suitable endpoint for a specific use case. - -.. note:: - Check out our blog post on `cloud serving `_ if you'd like to learn more about providers. - -Unify exposes a common HTTP endpoint for all providers, allowing you to query any of them using a **consistent request format, and the same API key**. This lets you use the same model across multiple endpoints, and optimize the performance metrics you care about. - -Available Endpoints -------------------- - -We strive to integrate the latest LLMs into our platform, across as many providers exposing endpoints for said models. - -You can explore our list of supported models through the `benchmarks interface `_ where you can simply search for a model you are interested in to visualise benchmarks and all sorts of relevant information on available endpoints for the model. - -.. - If you prefer programmatic access, you can also use the - `List Models Endpoint `_, we discussed how different models perform better at different tasks, and how appropriate performance benchmarks can help steer and inform model selection for a given use-case. - -Given the diversity of prompts you can send to an LLM, it can quickly become tedious to manually swap between models for every single prompt, even when they pertain to the same broad category of tasks. - -Motivated by this, LLM routing aims to make optimal model selection automatic. With a router, each prompt is assessed individually and sent to the best model, without having to tweak the LLM pipeline. -With routing, you can focus on prompting and ensure that the best model is always on the receiving end! - -Quality routing ---------------- - -By routing to the best LLM on every prompt, the objective is to consistently achieve better outputs than using a single, all-purpose, powerful mode, at a fraction of the cost. The idea is that smaller models can be leveraged for some simpler tasks, only using larger models to handle complex queries. - -Using several datasets to benchmark the router (star-shaped datapoints) reveals that it can perform better than individual endpoints on average, without compromising on other metrics like runtime performance for e.g, as illustrated below. - -.. image:: ../images/console_dashboard.png - :align: center - :width: 650 - :alt: Console Dashboard. - -You may notice that there are more than one star-shaped datapoints on the plot. This is because the *Router* can actually take all sorts of configurations, depending on the specified constraints in terms which endpoints can be routed to, the minimum acceptable performance level for a given metric, etc. As a result, a virtually infinite number of routers can be constructed by changing these parameters, allowing you to customize the routing depending on your requirements! - -Runtime routing ---------------- - -When querying endpoints, other metrics beyond quality can be critical depending on the use-case. For e.g, cost may be important when prototyping an application, latency when building a bot where responsiveness is key, or output tokens per second if we want to generate responses as fast as possible. - -However, endpoint providers are inherently transient (You can read more about this `here `_), which means they are affected by factors like traffic, available devices, changes in the software or hardware stack, and so on. - -Ultimately, this results in a landscape where it's usually not possible to conclude that one provider is *the best*. Let's take a look at this graph from our benchmarks. - -.. image:: ../images/mixtral-providers.png - :align: center - :width: 650 - :alt: Mixtral providers. - -In this image we can see the :code:`output tokens per second` of different providers hosting a :code:`Mixtral-8x7b` public endpoint. We can see how depending on the time of the day, the *best* provider changes. - -With runtime routing, your requests are automatically redirected to the provider outperforming the other services at that very moment. This ensures the best possible value for a given metric across endpoints. - -.. image:: ../images/mixtral-router.png - :align: center - :width: 650 - :alt: Mixtral performance routing. - -Round Up --------- - -You are now familiar with routing. Next, you can `learn to use the router `_, or `build your custom router `_. diff --git a/docs/api/unify_api.rst b/docs/concepts/unify_api.rst similarity index 100% rename from docs/api/unify_api.rst rename to docs/concepts/unify_api.rst diff --git a/docs/interfaces/building_router.rst b/docs/console/building_router.rst similarity index 100% rename from docs/interfaces/building_router.rst rename to docs/console/building_router.rst diff --git a/docs/interfaces/connecting_stack.rst b/docs/console/connecting_stack.rst similarity index 100% rename from docs/interfaces/connecting_stack.rst rename to docs/console/connecting_stack.rst diff --git a/docs/interfaces/running_benchmarks.rst b/docs/console/running_benchmarks.rst similarity index 100% rename from docs/interfaces/running_benchmarks.rst rename to docs/console/running_benchmarks.rst diff --git a/docs/demos/demos b/docs/demos/demos deleted file mode 160000 index edd6e05..0000000 --- a/docs/demos/demos +++ /dev/null @@ -1 +0,0 @@ -Subproject commit edd6e0506891c288182331b3d9ea9b792276db88 diff --git a/docs/demos/langchain.rst b/docs/demos/langchain.rst deleted file mode 100644 index ce52428..0000000 --- a/docs/demos/langchain.rst +++ /dev/null @@ -1,17 +0,0 @@ -LangChain Examples -================== - -.. grid:: 1 1 3 3 - :gutter: 4 - - .. grid-item-card:: Langchain RAG Playground - :link: ./demos/LangChain/RAG_playground/README.md - - Retrieval Augmented Generation with Langchain & Unify. - -.. toctree:: - :hidden: - :maxdepth: -1 - :caption: LangChain Examples - - ./demos/LangChain/RAG_playground/README.md diff --git a/docs/demos/llamaindex.rst b/docs/demos/llamaindex.rst deleted file mode 100644 index 33e7577..0000000 --- a/docs/demos/llamaindex.rst +++ /dev/null @@ -1,24 +0,0 @@ -LlamaIndex Examples -================== - -.. grid:: 1 1 3 3 - :gutter: 4 - - .. grid-item-card:: LlamaIndex Basic Usage - :link: ./demos/LlamaIndex/BasicUsage/unify.ipynb - - Learn how to use the LlamaIndex-Unify Integration. - - .. grid-item-card:: LlamaIndex RAG Playground - :link: ./demos/LlamaIndex/RAGPlayground/README.md - - Retrieval Augmented Generation Playground built with LlamaIndex. - - -.. toctree:: - :hidden: - :maxdepth: -1 - :caption: LlamaIndex Examples - - ./demos/LlamaIndex/RAGPlayground/README.md - ./demos/LlamaIndex/BasicUsage/unify.ipynb diff --git a/docs/demos/unify.rst b/docs/demos/unify.rst deleted file mode 100644 index e809117..0000000 --- a/docs/demos/unify.rst +++ /dev/null @@ -1,48 +0,0 @@ -Python Package Examples -================== - -.. grid:: 1 1 3 3 - :gutter: 4 - - .. grid-item-card:: Building a ChatBot - :link: ./demos/Unify/ChatBot/ChatBot.ipynb - - An interactive chatbot application. - - .. grid-item-card:: Synchronous vs Asynchronous Clients - :link: ./demos/Unify/AsyncVsSync/AsyncVsSync.ipynb - - Exploring Sync vs Async Clients: Usage and Differences. - - .. grid-item-card:: LLM Wars - :link: ./demos/Unify/LLM-Wars/README.md - - LLMs face off in a Streamlit app, asking each other tough questions. - - .. grid-item-card:: Semantic Router - :link: ./demos/Unify/SemanticRouter/README.md - - LLM Routing based on semantic similarity. - - .. grid-item-card:: ChatBot Arena - :link: ./demos/Unify/Chatbot_Arena/README.md - - Ask any question to two anonymous LLMs and vote for the better one! - - .. grid-item-card:: LLM Debate App - :link: ./demos/Unify/LLM_Debate/README.md - - Provide a topic and watch two LLMs debate on it. - - -.. toctree:: - :hidden: - :maxdepth: -1 - :caption: Python Package Examples - - ./demos/Unify/ChatBot/ChatBot.ipynb - ./demos/Unify/AsyncVsSync/AsyncVsSync.ipynb - ./demos/Unify/LLM-Wars/README.md - ./demos/Unify/SemanticRouter/README.md - ./demos/Unify/Chatbot_Arena/README.md - ./demos/Unify/LLM_Debate/README.md \ No newline at end of file diff --git a/docs/index.rst b/docs/index.rst index e9a33d8..1ab0744 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -12,11 +12,11 @@ .. toctree:: :hidden: :maxdepth: -1 - :caption: API + :caption: Concepts - api/unify_api.rst - api/benchmarks.rst - api/router.rst + concepts/unify_api.rst + concepts/benchmarks.rst + concepts/router.rst .. reference/images.rst @@ -25,27 +25,18 @@ :template: top_level_toc_recursive.rst :recursive: :hide-table: - :caption: Python Client Docs + :caption: API unify -.. toctree:: - :hidden: - :maxdepth: 4 - :caption: Demos - - demos/unify.rst - demos/langchain.rst - demos/llamaindex.rst - .. toctree:: :hidden: :maxdepth: -1 - :caption: Interfaces + :caption: Console - interfaces/connecting_stack.rst - interfaces/running_benchmarks.rst - interfaces/building_router.rst + console/connecting_stack.rst + console/running_benchmarks.rst + console/building_router.rst .. .. toctree:: @@ -56,16 +47,6 @@ tools/openapi.rst tools/python_library.rst -.. toctree:: - :hidden: - :maxdepth: -1 - :caption: Concepts - - concepts/endpoints.rst - concepts/benchmarks.rst - concepts/routing.rst -.. concepts/on_prem_images.rst - .. toctree:: :hidden: :maxdepth: -1 @@ -73,5 +54,3 @@ on_prem/on_prem_access on_prem/sso.rst - -