Releases · llm-d/llm-d-benchmark

04 May 20:24

maugustosilva

v0.6.0

41b6d9a

v0.6.0 Latest

Latest

What's Changed

Full conversion to python, with a new CLI and a new declarative specification language for experiment description
- Plugin architecture makes adding new stages to the life cycle fluent and scalable for future features.
- User experience was enhanced with a much more meaningful logging and message display
- Extensive health checking during and at the end of the deployment.
New standup method available: "Fast Model Actuator" (FMA)
- Fast Model Actuation (FMA) is a Kubernetes-native system for efficiently managing LLM inference servers and reduces model startup latency from minutes to seconds. FMA uses two techniques: vLLM sleep/wake, where model instances move tensors from GPU to CPU memory — freeing accelerator resources while keeping the process alive for rapid wake-up and model swapping, where a persistent launcher process handles initialization upfront so instances can be swapped without full cold starts.
Significant improvements for perfomance data collection, including relevant changes on benchmark report
- "Time-series" metrics on version 0.2 of the benchmark reports now include both statics summarization and link to raw collected data on csv format.
Tighter integration with Workload Variant Autoscaler (WVA), including the ability to deploy multiple models on the same namespace as defined within a scenario. In the same vein - allowing one or more stacks in the scenario to be deployed and torn down based on user preference.
Ability to provide different parameters for vllm process on different pods (by using LeaderWorkerSet (LWS) Kubernetes API).
- Allow filling in stack details from a YAML file from harness pod.
- Assorted corrections and robustness improvements.
The "capacity planner" and "configuration explorer" are now part of a new project: https://github.com/llm-d-incubation/llm-d-planner
Strongly enhanced development constructs including pre-commit and CICD that safe guard existing library patterns and functionality.

Regular Contributors to this release

New Contributors

@DolevAdas made their first contribution in #742
@michael-desmond made their first contribution in #853
@jia-gao made their first contribution in #859
@ruocco made their first contribution in #867
@adinilfeld made their first contribution in #874
@Luka-D made their first contribution in #917
@forfreedomforrich-eng made their first contribution in #899
@Copilot made their first contribution in #951
@aavarghese made their first contribution in #995

What's Changed

🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #778
⬆️ Bump yq from v4.45.4 to v4.45.5 by @github-actions[bot] in #748
Fix logs for new vllm on nop harness by @manoelmarques in #781
Add memory and cache metrics #2 by @DolevAdas in #742
[Experimental] Add a new production trace replay for real-world multi-turn chat workflow by @achandrasekar in #761
Update GAIE InferencePool v1.3.0 to v1.3.1 by @diegocastanibm in #830
Fix partial metrics by @mengmeiye in #834
update istio by @diegocastanibm in #840
update vllm by @diegocastanibm in #837
update yq by @diegocastanibm in #836
update inferecemax by @diegocastanibm in #835
update kgateway by @diegocastanibm in #839
update helmfile to v1.4.1 by @diegocastanibm in #832
update wva by @diegocastanibm in #838
update inference-perf by @diegocastanibm in #833
v0.5.3 tagged release by @diegocastanibm in #831
[Standup] Add the ability to use initContainers. by @maugustosilva in #851
[Standup] Additional fixes (accelerator automatic selection) by @maugustosilva in #852
🌱 Add missing governance files per CNCF audit by @clubanderson in #783
Feat/small cluster config by @michael-desmond in #853
[Standup] Consolidate all sim scenarios (with small gateway pod) by @maugustosilva in #856
Fix metrics scrape by @mengmeiye in #854
Fix standalone preprocess env. variable by @manoelmarques in #860
Epp log scrape by @mengmeiye in #855
[Run] Add --repeat flag to repeat experiments N times with aggregation by @jia-gao in #859
remove accessLogging for helm chart schema validation error by @mengmeiye in #861
workload, inference-perf: increase tokens in sanity check. by @ruocco in #867
Stack discovery tool by @namasl in #762
AI generated scenarios POC by @kalantar in #674
[Standup] Fix for GKE with new v0.5.1 llm-d-cuda image by @maugustosilva in #868
[Run] Add pre/post workload hooks to run_only.sh by @jia-gao in #873
Declarative Python Package by @Vezio in #848
Use --serviceaccount value when creating model verification pod by @adinilfeld in #874
[Docs] Add Note for Previous Library by @Vezio in #875
feat: introduce harness namespace step in run sequence by @adinilfeld in #877
fix pd-disaggregation by @mengmeiye in #878
feat: Fix secret for monitoring epp by @Vezio in #879
fix: Extract IP values through standard status.addresses object lookups by @adinilfeld in #883
fix template for pd-disaggregation by @mengmeiye in #884
fix: Configuration file concatenation bugs and Crane resolution fallback by @adinilfeld in #888
docs: Add inline comments providing recommended storage classes by @adinilfeld in #887
Fix: Re-Enable CICD via Kind Deployment for PRs by @Vezio in #885
Remove unneeded config explorer components, consolidate analysis notebook by @namasl in #886
feat: Split CI benchmark into parallel standalone and modelservice jobs by @Vezio in #889
Fix harness metadata loss from subshell variable scoping by @Vezio in #880
[Run] Updated trivy scanner version by @maugustosilva in #890
Remove redundant metrics by @mengmeiye in #891
Add GCS results metadata injection skill and standardize skills directory by @adinilfeld in #895
Remove public IP address from Gemini skill by @adinilfeld in #896
Auto-provision RBAC and enable pod-native auth for run-only mode by @adinilfeld in #894
add metrics stat to benchmark report v0.2 by @mengmeiye in #897
Enhance Smoketest Cleanup and Document ModelService Protocols by @Vezio in #901
feature: Allows ModelService K8 Manifests to be Rendered BEFORE being Applied by @Vezio in #902
fix: Removes Redundant Version References by @Vezio in #903
fix: Render K8 Manifests for MS Early in Plan Phase by @Vezio in #907
fix(standup): skip accelerator validators for CPU-only scenarios by @Vezio in #911
fix the detection of whether a cluster is an OpenShift one by @mengmeiye in #913
fix: Update CPU Scenario by @Vezio in #912
fix: Standalone Rendering by @Vezio in #916
fix: Brings Back Pre Commit and Updates Getting Started by @Vezio in https://github.co...

Contributors

clubanderson, aavarghese, and 20 other contributors

Assets 2

07 Mar 00:36

maugustosilva

v0.5.0

135f61a

v0.5.0

What's Changed

All well-lit paths deployable via standup.sh
- This includes wide-ep-lws and workload-autoscaling
Operational improvements
- Add a "benchmark runner" (run_only.sh), for pre-existing stacks (key piece in integrating with well-lit paths on the main repository
- Additional improvements on automatic setup of pods (via "preprocess" utility set_llmdbench_environment.py in clusters where RoCE/CDR/Infiniband (and LWS) are used.
- Expanded early detection of crashed pods to include both "gaie-epp" and "inference-gateway" ones.
- A new standup step, "ensure admin prerequisites" (02), concentrates the installation of all cluster-wide prerequisites (e.g., istio, gateway, lws)
- priorityClassName support on standalone and modelservice methods for standup
- Fixes to allow spaces in directory paths
- Significant improvements for benchmark report
  - Version 0.2 of the benchmark reports now support "time-series" metrics.
  - Allow filling in stack details from a YAML file from harness pod.
  - Assorted corrections and robustness improvements.
Configuration Explorer:
- GPU recommender (using llm-optimizer's roofline analysis)
- Use Hugging Face's latest API to get safetensors metadata
- Add architecture-aware activation memory estimation to capacity planner
- Allow calculation of maximum max-model-len given memory constraints

Regular Contributors to this release

New Contributors

@rshavitt made their first contribution in #584
@huaxig made their first contribution in #635
@gushob21 made their first contribution in #640
@vknaik made their first contribution in #642
@dependabot[bot] made their first contribution in #656
@shashwatj07 made their first contribution in #687

Full Changelog: v0.4.0...v0.5.0

Contributors

achandrasekar, maugustosilva, and 15 other contributors

Assets 2

19 Dec 19:07

maugustosilva

v0.4.0

33f1108

v0.4.0

What's Changed

Support for all well-lit paths directly in standup
- Each well-lit path has now both a scenario and a experiment (the only exception is an experiment forwide-ep-lws)
Examples scenarios for gpu, aiu (Spyre), cpu-only and simulated
- Deployable as both “standalone” (i.e., vanilla VLLM) and llm-d
Ability deploy multiple load-generating pods
- Controlled via environment variable or command-line parameter
Tight integration with WVA (Workload Variant Autoscaler)
- When enabled and deployed via standup, WVA configuration is done automatically
Operational improvements
- Significant improvements on automatic setup of pods in cluster where RoCE/CDR/Infiniband (and LWS) are used.
- All PVCs are now read-only
  - Better support for large model caching and sharing in a single cluster
- Early detection of crashed pods
  - Whenever a deployment is attempted with wrong parameters, immediately detect and stop when pods crash
- Full support for (newer) istio and kgateway
  - We no longer have to rely on an alpha version of istio. Latest istio can and is used with benchmark
Configuration Explorer:
- Quantization considered in capacity planner, fixed various bugs
- UI to parse available results according to specific SLO/criteria fully integrated
  InferenceMAX

Regular Contributors to this release

New Contributors

@sjmonson made their first contribution in #479
@jjk-g made their first contribution in #510
@effi-ofer made their first contribution in #523
@NaomiEisen made their first contribution in #546
@galmasi made their first contribution in #559
@sagearc made their first contribution in #557

Full Changelog: v0.3.0...v0.4.0

Contributors

achandrasekar, galmasi, and 14 other contributors

Assets 2

10 Oct 16:17

maugustosilva

v0.3.0

0fd73c0

v0.3.0

What's Changed

Full support for "experiments" (design of experiments)
- Each "well-lit" path now has both an "experiment" file (accessible via execution of e2e.sh) and a scenario (accessible via execution of both e2e.sh and standup.sh/teardown.sh).
- All scenarios tested, and an initial experimental dataset collected and made available. The exception at this point is the "wide-ep-lws", slated for the next release
Code conversion (bash to python)
- Individual standup standup [steps] (https://github.com/llm-d/llm-d-benchmark/tree/main/setup/steps) 0,1,2,3,4,6,7,8 converted from bash to python
Better support for the execution of the benchmark load generating phase - run.sh - against pre-deployed stacks.
- Automatically detect current namespace, llm-d stack URL, and served model name.
- Do not require a hugging face token when generating load
- Generate the standardized benchmark report taking into account that the stack was pre-deployed, and not all deployment parameters are available.
Benchmark report generation and data analysis
- The standardized benchmark report had its format refined and updated to accommodate all different harnesses
- For each "well-lit" path a Jupyter Analysis Notebook (e.g., analysis_pd was created.
Documentation overhaul
- Main documentation significantly expanded.
- Individual components (e.g., Benchmarking Report and Configuration Explorer) have their own docs, indexed from the main
Publicly available experimental data.
- Experimental runs for each "well-lit" path and data is publicly available at the project's Google Drive
Configuration Explorer
- The number of parameters required to successfully deploy a model served by an llm-d stack - while making efficient use of scarce resources such as GPUs - pointed to the need for some mechanism to help users avoiding obvious "dead ends" (i.e., standup scenarios bound to fail due to lack of resources)
- The Configuration Explorer is a standalone tool which provides two main functionalities:
  - "capacity planner": given certain input parameters, will the llm-d stack be even capable of serving a model?
  - "configuration sweeper": given certain input parameters and workload parameters, what is the maximum/average recorded performance?
- The "capacity planner" is presently available as an stand-alone UI and also as library fully integrated on the benchmark lifecycle (e.g., standup.sh).
Initial support for multiple-models with modelservice
- A single stack has multiple models, and each model can be individually accessed via different URLs
- This capability relies on the llm-d-modelservice standup method
More extensive CI/CD
- Run full tests, testing all standup methods, whenever a PR is open
- Test every single standup method and harness nightly.

Regular Contributors to this release

New Contributors

@petecheslock made their first contribution in #314
@mengmeiye made their first contribution in #388
@Edwinhr716 made their first contribution in #42

Full Changelog: v0.2.9...v0.3.0

Contributors

petecheslock, achandrasekar, and 11 other contributors

Assets 2

23 Jul 12:56

maugustosilva

v0.2.0

10938a1

v0.2.0

What's Changed

[Run] feat: add guidellm as a new harness. by @maugustosilva
[Setup] feat: add support for llm-d-modelservice by @kalantar
[Setup/Run] feat: add "parameter sweep" example by @namasl

New Contributors

@Vezio made their first contribution in #145
@kalantar made their first contribution in #122

Full Changelog: v0.1.10...v0.2.0

Contributors

maugustosilva, kalantar, and 2 other contributors

Assets 2

Releases: llm-d/llm-d-benchmark

v0.6.0

What's Changed

Regular Contributors to this release

New Contributors

What's Changed

Contributors

Uh oh!

v0.5.0

What's Changed

Regular Contributors to this release

New Contributors

Contributors

Uh oh!

v0.4.0

What's Changed

Regular Contributors to this release

New Contributors

Contributors

Uh oh!

v0.3.0

What's Changed

Regular Contributors to this release

New Contributors

Contributors

Uh oh!

v0.2.0

What's Changed

New Contributors

Contributors

Uh oh!