Skip to content

Releases: llm-d/llm-d-benchmark

v0.6.0

04 May 20:24
41b6d9a

Choose a tag to compare

What's Changed

  • Full conversion to python, with a new CLI and a new declarative specification language for experiment description
    • Plugin architecture makes adding new stages to the life cycle fluent and scalable for future features.
    • User experience was enhanced with a much more meaningful logging and message display
    • Extensive health checking during and at the end of the deployment.
  • New standup method available: "Fast Model Actuator" (FMA)
    • Fast Model Actuation (FMA) is a Kubernetes-native system for efficiently managing LLM inference servers and reduces model startup latency from minutes to seconds. FMA uses two techniques: vLLM sleep/wake, where model instances move tensors from GPU to CPU memory — freeing accelerator resources while keeping the process alive for rapid wake-up and model swapping, where a persistent launcher process handles initialization upfront so instances can be swapped without full cold starts.
  • Significant improvements for perfomance data collection, including relevant changes on benchmark report
    • "Time-series" metrics on version 0.2 of the benchmark reports now include both statics summarization and link to raw collected data on csv format.
  • Tighter integration with Workload Variant Autoscaler (WVA), including the ability to deploy multiple models on the same namespace as defined within a scenario. In the same vein - allowing one or more stacks in the scenario to be deployed and torn down based on user preference.
  • Ability to provide different parameters for vllm process on different pods (by using LeaderWorkerSet (LWS) Kubernetes API).
    • Allow filling in stack details from a YAML file from harness pod.
    • Assorted corrections and robustness improvements.
  • The "capacity planner" and "configuration explorer" are now part of a new project: https://github.com/llm-d-incubation/llm-d-planner
  • Strongly enhanced development constructs including pre-commit and CICD that safe guard existing library patterns and functionality.

Regular Contributors to this release

New Contributors

What's Changed

Read more

v0.5.0

07 Mar 00:36
135f61a

Choose a tag to compare

What's Changed

  • All well-lit paths deployable via standup.sh
    • This includes wide-ep-lws and workload-autoscaling
  • Operational improvements
    • Add a "benchmark runner" (run_only.sh), for pre-existing stacks (key piece in integrating with well-lit paths on the main repository
    • Additional improvements on automatic setup of pods (via "preprocess" utility set_llmdbench_environment.py in clusters where RoCE/CDR/Infiniband (and LWS) are used.
    • Expanded early detection of crashed pods to include both "gaie-epp" and "inference-gateway" ones.
    • A new standup step, "ensure admin prerequisites" (02), concentrates the installation of all cluster-wide prerequisites (e.g., istio, gateway, lws)
    • priorityClassName support on standalone and modelservice methods for standup
    • Fixes to allow spaces in directory paths
    • Significant improvements for benchmark report
      • Version 0.2 of the benchmark reports now support "time-series" metrics.
      • Allow filling in stack details from a YAML file from harness pod.
      • Assorted corrections and robustness improvements.
  • Configuration Explorer:
    • GPU recommender (using llm-optimizer's roofline analysis)
    • Use Hugging Face's latest API to get safetensors metadata
    • Add architecture-aware activation memory estimation to capacity planner
    • Allow calculation of maximum max-model-len given memory constraints

Regular Contributors to this release

New Contributors

Full Changelog: v0.4.0...v0.5.0

v0.4.0

19 Dec 19:07
33f1108

Choose a tag to compare

What's Changed

  • Support for all well-lit paths directly in standup
    • Each well-lit path has now both a scenario and a experiment (the only exception is an experiment forwide-ep-lws)
  • Examples scenarios for gpu, aiu (Spyre), cpu-only and simulated
    • Deployable as both “standalone” (i.e., vanilla VLLM) and llm-d
  • Ability deploy multiple load-generating pods
    • Controlled via environment variable or command-line parameter
  • Tight integration with WVA (Workload Variant Autoscaler)
    • When enabled and deployed via standup, WVA configuration is done automatically
  • Operational improvements
    • Significant improvements on automatic setup of pods in cluster where RoCE/CDR/Infiniband (and LWS) are used.
    • All PVCs are now read-only
      • Better support for large model caching and sharing in a single cluster
    • Early detection of crashed pods
      • Whenever a deployment is attempted with wrong parameters, immediately detect and stop when pods crash
    • Full support for (newer) istio and kgateway
      • We no longer have to rely on an alpha version of istio. Latest istio can and is used with benchmark
  • Configuration Explorer:
    • Quantization considered in capacity planner, fixed various bugs
    • UI to parse available results according to specific SLO/criteria fully integrated
      InferenceMAX

Regular Contributors to this release

New Contributors

Full Changelog: v0.3.0...v0.4.0

v0.3.0

10 Oct 16:17
0fd73c0

Choose a tag to compare

What's Changed

  • Full support for "experiments" (design of experiments)
    • Each "well-lit" path now has both an "experiment" file (accessible via execution of e2e.sh) and a scenario (accessible via execution of both e2e.sh and standup.sh/teardown.sh).
    • All scenarios tested, and an initial experimental dataset collected and made available. The exception at this point is the "wide-ep-lws", slated for the next release
  • Code conversion (bash to python)
  • Better support for the execution of the benchmark load generating phase - run.sh - against pre-deployed stacks.
    • Automatically detect current namespace, llm-d stack URL, and served model name.
    • Do not require a hugging face token when generating load
    • Generate the standardized benchmark report taking into account that the stack was pre-deployed, and not all deployment parameters are available.
  • Benchmark report generation and data analysis
  • Documentation overhaul
  • Publicly available experimental data.
  • Configuration Explorer
    • The number of parameters required to successfully deploy a model served by an llm-d stack - while making efficient use of scarce resources such as GPUs - pointed to the need for some mechanism to help users avoiding obvious "dead ends" (i.e., standup scenarios bound to fail due to lack of resources)
    • The Configuration Explorer is a standalone tool which provides two main functionalities:
      • "capacity planner": given certain input parameters, will the llm-d stack be even capable of serving a model?
      • "configuration sweeper": given certain input parameters and workload parameters, what is the maximum/average recorded performance?
    • The "capacity planner" is presently available as an stand-alone UI and also as library fully integrated on the benchmark lifecycle (e.g., standup.sh).
  • Initial support for multiple-models with modelservice
  • More extensive CI/CD
    • Run full tests, testing all standup methods, whenever a PR is open
    • Test every single standup method and harness nightly.

Regular Contributors to this release

New Contributors

Full Changelog: v0.2.9...v0.3.0

v0.2.0

23 Jul 12:56
10938a1

Choose a tag to compare

What's Changed

  • [Run] feat: add guidellm as a new harness. by @maugustosilva
  • [Setup] feat: add support for llm-d-modelservice by @kalantar
  • [Setup/Run] feat: add "parameter sweep" example by @namasl

New Contributors

Full Changelog: v0.1.10...v0.2.0