Releases: llm-d/llm-d-benchmark
Releases · llm-d/llm-d-benchmark
v0.6.0
What's Changed
- Full conversion to
python, with a new CLI and a new declarative specification language for experiment description- Plugin architecture makes adding new stages to the life cycle fluent and scalable for future features.
- User experience was enhanced with a much more meaningful logging and message display
- Extensive health checking during and at the end of the deployment.
- New standup method available: "Fast Model Actuator" (FMA)
- Fast Model Actuation (FMA) is a Kubernetes-native system for efficiently managing LLM inference servers and reduces model startup latency from minutes to seconds. FMA uses two techniques: vLLM sleep/wake, where model instances move tensors from GPU to CPU memory — freeing accelerator resources while keeping the process alive for rapid wake-up and model swapping, where a persistent launcher process handles initialization upfront so instances can be swapped without full cold starts.
- Significant improvements for perfomance data collection, including relevant changes on benchmark report
- "Time-series" metrics on version 0.2 of the benchmark reports now include both statics summarization and link to raw collected data on csv format.
- Tighter integration with Workload Variant Autoscaler (WVA), including the ability to deploy multiple models on the same namespace as defined within a scenario. In the same vein - allowing one or more stacks in the scenario to be deployed and torn down based on user preference.
- Ability to provide different parameters for
vllmprocess on differentpods(by usingLeaderWorkerSet(LWS) Kubernetes API).- Allow filling in stack details from a YAML file from harness pod.
- Assorted corrections and robustness improvements.
- The "capacity planner" and "configuration explorer" are now part of a new project: https://github.com/llm-d-incubation/llm-d-planner
- Strongly enhanced development constructs including pre-commit and CICD that safe guard existing library patterns and functionality.
Regular Contributors to this release
- @namasl
- @kalantar
- @Vezio
- @jgchn
- @mengmeiye
- @deanlorenz
- @dmitripikus
- @manoelmarques
- @achandrasekar
- @jjk-g
- @maugustosilva
New Contributors
- @DolevAdas made their first contribution in #742
- @michael-desmond made their first contribution in #853
- @jia-gao made their first contribution in #859
- @ruocco made their first contribution in #867
- @adinilfeld made their first contribution in #874
- @Luka-D made their first contribution in #917
- @forfreedomforrich-eng made their first contribution in #899
- @Copilot made their first contribution in #951
- @aavarghese made their first contribution in #995
What's Changed
- 🌱 Remove per-repo gh-aw typo/link/upstream workflows by @clubanderson in #778
- ⬆️ Bump yq from v4.45.4 to v4.45.5 by @github-actions[bot] in #748
- Fix logs for new vllm on nop harness by @manoelmarques in #781
- Add memory and cache metrics #2 by @DolevAdas in #742
- [Experimental] Add a new production trace replay for real-world multi-turn chat workflow by @achandrasekar in #761
- Update GAIE InferencePool v1.3.0 to v1.3.1 by @diegocastanibm in #830
- Fix partial metrics by @mengmeiye in #834
- update istio by @diegocastanibm in #840
- update vllm by @diegocastanibm in #837
- update yq by @diegocastanibm in #836
- update inferecemax by @diegocastanibm in #835
- update kgateway by @diegocastanibm in #839
- update helmfile to v1.4.1 by @diegocastanibm in #832
- update wva by @diegocastanibm in #838
- update inference-perf by @diegocastanibm in #833
- v0.5.3 tagged release by @diegocastanibm in #831
- [Standup] Add the ability to use initContainers. by @maugustosilva in #851
- [Standup] Additional fixes (accelerator automatic selection) by @maugustosilva in #852
- 🌱 Add missing governance files per CNCF audit by @clubanderson in #783
- Feat/small cluster config by @michael-desmond in #853
- [Standup] Consolidate all sim scenarios (with small gateway pod) by @maugustosilva in #856
- Fix metrics scrape by @mengmeiye in #854
- Fix standalone preprocess env. variable by @manoelmarques in #860
- Epp log scrape by @mengmeiye in #855
- [Run] Add --repeat flag to repeat experiments N times with aggregation by @jia-gao in #859
- remove accessLogging for helm chart schema validation error by @mengmeiye in #861
- workload, inference-perf: increase tokens in sanity check. by @ruocco in #867
- Stack discovery tool by @namasl in #762
- AI generated scenarios POC by @kalantar in #674
- [Standup] Fix for GKE with new
v0.5.1llm-d-cudaimage by @maugustosilva in #868 - [Run] Add pre/post workload hooks to run_only.sh by @jia-gao in #873
- Declarative Python Package by @Vezio in #848
- Use --serviceaccount value when creating model verification pod by @adinilfeld in #874
- [Docs] Add Note for Previous Library by @Vezio in #875
- feat: introduce harness namespace step in run sequence by @adinilfeld in #877
- fix pd-disaggregation by @mengmeiye in #878
- feat: Fix secret for monitoring epp by @Vezio in #879
- fix: Extract IP values through standard status.addresses object lookups by @adinilfeld in #883
- fix template for pd-disaggregation by @mengmeiye in #884
- fix: Configuration file concatenation bugs and Crane resolution fallback by @adinilfeld in #888
- docs: Add inline comments providing recommended storage classes by @adinilfeld in #887
- Fix: Re-Enable CICD via Kind Deployment for PRs by @Vezio in #885
- Remove unneeded config explorer components, consolidate analysis notebook by @namasl in #886
- feat: Split CI benchmark into parallel standalone and modelservice jobs by @Vezio in #889
- Fix harness metadata loss from subshell variable scoping by @Vezio in #880
- [Run] Updated trivy scanner version by @maugustosilva in #890
- Remove redundant metrics by @mengmeiye in #891
- Add GCS results metadata injection skill and standardize skills directory by @adinilfeld in #895
- Remove public IP address from Gemini skill by @adinilfeld in #896
- Auto-provision RBAC and enable pod-native auth for run-only mode by @adinilfeld in #894
- add metrics stat to benchmark report v0.2 by @mengmeiye in #897
- Enhance Smoketest Cleanup and Document ModelService Protocols by @Vezio in #901
- feature: Allows ModelService K8 Manifests to be Rendered BEFORE being Applied by @Vezio in #902
- fix: Removes Redundant Version References by @Vezio in #903
- fix: Render K8 Manifests for MS Early in Plan Phase by @Vezio in #907
- fix(standup): skip accelerator validators for CPU-only scenarios by @Vezio in #911
- fix the detection of whether a cluster is an OpenShift one by @mengmeiye in #913
- fix: Update CPU Scenario by @Vezio in #912
- fix: Standalone Rendering by @Vezio in #916
- fix: Brings Back Pre Commit and Updates Getting Started by @Vezio in https://github.co...
v0.5.0
What's Changed
- All well-lit paths deployable via standup.sh
- This includes
wide-ep-lwsandworkload-autoscaling
- This includes
- Operational improvements
- Add a "benchmark runner" (run_only.sh), for pre-existing stacks (key piece in integrating with well-lit paths on the main repository
- Additional improvements on automatic setup of
pods(via "preprocess" utility set_llmdbench_environment.py in clusters where RoCE/CDR/Infiniband (and LWS) are used. - Expanded early detection of crashed
podsto include both "gaie-epp" and "inference-gateway" ones. - A new standup step, "ensure admin prerequisites" (02), concentrates the installation of all cluster-wide prerequisites (e.g.,
istio,gateway,lws) priorityClassNamesupport onstandaloneandmodelservicemethods for standup- Fixes to allow spaces in directory paths
- Significant improvements for benchmark report
- Version 0.2 of the benchmark reports now support "time-series" metrics.
- Allow filling in stack details from a YAML file from harness pod.
- Assorted corrections and robustness improvements.
- Configuration Explorer:
- GPU recommender (using llm-optimizer's roofline analysis)
- Use Hugging Face's latest API to get safetensors metadata
- Add architecture-aware activation memory estimation to capacity planner
- Allow calculation of maximum
max-model-lengiven memory constraints
Regular Contributors to this release
- @namasl
- @kalantar
- @Vezio
- @jgchn
- @mengmeiye
- @deanlorenz
- @dmitripikus
- @manoelmarques
- @achandrasekar
- @jjk-g
- @maugustosilva
New Contributors
- @rshavitt made their first contribution in #584
- @huaxig made their first contribution in #635
- @gushob21 made their first contribution in #640
- @vknaik made their first contribution in #642
- @dependabot[bot] made their first contribution in #656
- @shashwatj07 made their first contribution in #687
Full Changelog: v0.4.0...v0.5.0
v0.4.0
What's Changed
- Support for all well-lit paths directly in standup
- Each well-lit path has now both a scenario and a experiment (the only exception is an experiment for
wide-ep-lws)
- Each well-lit path has now both a scenario and a experiment (the only exception is an experiment for
- Examples scenarios for gpu, aiu (Spyre), cpu-only and simulated
- Deployable as both “standalone” (i.e., vanilla VLLM) and llm-d
- Ability deploy multiple load-generating pods
- Controlled via environment variable or command-line parameter
- Tight integration with WVA (Workload Variant Autoscaler)
- When enabled and deployed via standup, WVA configuration is done automatically
- Operational improvements
- Significant improvements on automatic setup of pods in cluster where RoCE/CDR/Infiniband (and LWS) are used.
- All PVCs are now read-only
- Better support for large model caching and sharing in a single cluster
- Early detection of crashed pods
- Whenever a deployment is attempted with wrong parameters, immediately detect and stop when pods crash
- Full support for (newer) istio and kgateway
- We no longer have to rely on an alpha version of istio. Latest istio can and is used with benchmark
- Configuration Explorer:
- Quantization considered in capacity planner, fixed various bugs
- UI to parse available results according to specific SLO/criteria fully integrated
InferenceMAX
Regular Contributors to this release
- @namasl
- @kalantar
- @Vezio
- @jgchn
- @mengmeiye
- @deanlorenz
- @dmitripikus
- @manoelmarques
- @achandrasekar
- @maugustosilva
New Contributors
- @sjmonson made their first contribution in #479
- @jjk-g made their first contribution in #510
- @effi-ofer made their first contribution in #523
- @NaomiEisen made their first contribution in #546
- @galmasi made their first contribution in #559
- @sagearc made their first contribution in #557
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
- Full support for "experiments" (design of experiments)
- Each "well-lit" path now has both an "experiment" file (accessible via execution of
e2e.sh) and a scenario (accessible via execution of bothe2e.shandstandup.sh/teardown.sh). - All scenarios tested, and an initial experimental dataset collected and made available. The exception at this point is the "wide-ep-lws", slated for the next release
- Each "well-lit" path now has both an "experiment" file (accessible via execution of
- Code conversion (
bashtopython)- Individual standup
standup[steps] (https://github.com/llm-d/llm-d-benchmark/tree/main/setup/steps)0,1,2,3,4,6,7,8converted frombashtopython
- Individual standup
- Better support for the execution of the benchmark load generating phase -
run.sh- against pre-deployed stacks.- Automatically detect current
namespace,llm-dstack URL, and served model name. - Do not require a hugging face token when generating load
- Generate the standardized benchmark report taking into account that the stack was pre-deployed, and not all deployment parameters are available.
- Automatically detect current
- Benchmark report generation and data analysis
- The standardized benchmark report had its format refined and updated to accommodate all different harnesses
- For each "well-lit" path a Jupyter Analysis Notebook (e.g., analysis_pd was created.
- Documentation overhaul
- Main documentation significantly expanded.
- Individual components (e.g., Benchmarking Report and Configuration Explorer) have their own docs, indexed from the main
- Publicly available experimental data.
- Experimental runs for each "well-lit" path and data is publicly available at the project's Google Drive
- Configuration Explorer
- The number of parameters required to successfully deploy a model served by an
llm-dstack - while making efficient use of scarce resources such as GPUs - pointed to the need for some mechanism to help users avoiding obvious "dead ends" (i.e., standup scenarios bound to fail due to lack of resources) - The Configuration Explorer is a standalone tool which provides two main functionalities:
- "capacity planner": given certain input parameters, will the llm-d stack be even capable of serving a model?
- "configuration sweeper": given certain input parameters and workload parameters, what is the maximum/average recorded performance?
- The "capacity planner" is presently available as an stand-alone UI and also as library fully integrated on the benchmark lifecycle (e.g.,
standup.sh).
- The number of parameters required to successfully deploy a model served by an
- Initial support for multiple-models with modelservice
- A single stack has multiple models, and each model can be individually accessed via different URLs
- This capability relies on the llm-d-modelservice standup method
- More extensive CI/CD
- Run full tests, testing all standup methods, whenever a PR is open
- Test every single standup method and harness nightly.
Regular Contributors to this release
- @namasl
- @kalantar
- @Vezio
- @jgchn
- @deanlorenz
- @manoelmarques
- @achandrasekar
- @yossiovadia
- @pancak3
- @maugustosilva
New Contributors
- @petecheslock made their first contribution in #314
- @mengmeiye made their first contribution in #388
- @Edwinhr716 made their first contribution in #42
Full Changelog: v0.2.9...v0.3.0
v0.2.0
What's Changed
- [Run] feat: add guidellm as a new harness. by @maugustosilva
- [Setup] feat: add support for llm-d-modelservice by @kalantar
- [Setup/Run] feat: add "parameter sweep" example by @namasl
New Contributors
Full Changelog: v0.1.10...v0.2.0