Major features:
- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
Major bug fixes:
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
What's Changed
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by @dependabot in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by @dependabot in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by @dependabot in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by @dependabot in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by @dependabot in #792
🔨 Other Changes
- Fix Kubernetes version string handling by stripping metadata by @Nimbus318 in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by @archlitchi in #624
- feat: support device plugin daemonset update strategy by @devenami in #628
- add ut about schedule policy by @yt-huang in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by @haitwang-cloud in #626
- add ut for the scheduler by @shijinye in #645
- docs(issue-tmpl): add FAQ link to issue templates by @Nimbus318 in #647
- fix: filter device registry to node by @lengrongfu in #639
- Add self-hosted runner by @archlitchi in #659
- fix-example-yaml by @WQL782795 in #667
- update docs by @yangshiqi in #668
- add ut for ascend by @shijinye in #664
- optimization map init in test by @lengrongfu in #678
- Optimize monitor by @for800000 in #683
- fix code lint faild by @lengrongfu in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by @Nimbus318 in #687
- fix vGPUmonitor deviceidx is always 0 by @lengrongfu in #684
- add ut for pkg/scheduler/event.go by @Penguin-zlh in #688
- add ut for nodes by @shijinye in #695
- add license for pkg/scheduler/event_test.go by @Penguin-zlh in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by @lijm87 in #575
- add ut for device/nvidia by @shijinye in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by @yt-huang in #670
- Enable Dynamic-mig feature for HAMi by @archlitchi in #708
- Fix chart can not be deployed properly by @archlitchi in #711
- Fix NodeLock issue by @archlitchi in #714
- fix example yaml by @lixd in #709
- add ut for device/cambricon by @shijinye in #712
- Update dynamic mig documents and examples by @archlitchi in #718
- random time may be zero by @shijinye in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by @jiangsanyin in #543
- doc(README): add examples for GPU sharing and update-examples by @xiaoyao in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by @yt-huang in #673
- Add design document to 'dynamic-mig' feature by @archlitchi in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by @elrondwong in #724
- add ut for pkg/util/nodelock/nodelock.go by @learner0810 in #719
- test: add ut for pkg/version/version.go by @Penguin-zlh in #677
- Update on mig mode by @archlitchi in #726
- Update documents for config & config_cn by @archlitchi in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by @jingzhe6414 in #690
- fix device-plugin-version by @learner0810 in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by @chaunceyjiang in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by @elrondwong in #735
- support configuration resources limits and requests by @flpanbin in #739
- feat(test): add TestMarshalNodeDevices scenarios by @elrondwong in #747
- print flags for device-plugin and scheduler by @flpanbin in #756
- Fix typos, add more contributors and maintainers. by @yangshiqi in #765
- Add a mind map(Chinese and English) to help understand this project by @oceanweave in #764
- [Docs] update config pages by @windsonsea in #760
- add ut for device-map by @KubeKyrie in #762
- refactor(ci): use go.mod file for Go version in workflows by @yxxhero in #766
- support set log level for device plugin by @flpanbin in #771
- feat: Restart/Upgrade device-plugin will not affect services. by @chaunceyjiang in #767
- add ut nvml devices by @KubeKyrie in #773
- add ut for device-map by @KubeKyrie in #772
- Optimize the time format layout by @learner0810 in #741
- fix: nvidia-device-plugin no version info by @chaunceyjiang in #779
- HAMi supports e2e by @Rei1010 in #775
- Proposal: enable E2E test by @Rei1010 in #633
- add ut for device/iluvatar by @shijinye in #795
- add ut for device/hygon by @shijinye in #787
- add ut for pkg/monitor/nvidia/v1 by @shijinye in #780
- refactor(logging): enhance log messages for device resource counting by @haitwang-cloud in #778
- Enrich pod health check by @Rei1010 in #801
- docs: fix broken link by @lixd in #802
- Optimize the E2E execution logic by @Rei1010 in #803
- optimize MetricsBindAddress to MetricsBindPort by @phoenixwu0229 in #796
- fix: handle the node nil issue & E2E test failure by @haitwang-cloud in #804
- add ut for device/mthreads by @shijinye in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by @lixd in #814
- [docs] Update ascend910b-support.md by @windsonsea in #816
- Refine metrics logs by @haitwang-cloud in #817
- Update mig-related logics and refine logs by @archlitchi in #833
- Add 910B4 config to device-configmap for ascend by @lijm87 in #828
- [docs] fix: glibc version requirement in README by @chinaran in #826
- Update HAMi-core for v2.5.0 by @archlitchi in #834
- FIx multi-process device memory count issue by @archlitchi in #835
- bump version to v2.5.0 by @wawa0210 in #836
- Fix CI by @archlitchi in #838
- Fix CI release by @archlitchi in #840
- Fix release ci by @archlitchi in #841
- Fix Dockerfile to make CI pass by @archlitchi in #846
- Fix E2E failure with pod status check by @Rei1010 in #847
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU by @archlitchi in #848
New Contributors
- @yt-huang made their first contribution in #638
- @shijinye made their first contribution in #645
- @WQL782795 made their first contribution in #667
- @yangshiqi made their first contribution in #668
- @for800000 made their first contribution in #683
- @Penguin-zlh made their first contribution in #688
- @lixd made their first contribution in #709
- @jiangsanyin made their first contribution in #543
- @xiaoyao made their first contribution in #665
- @elrondwong made their first contribution in #724
- @learner0810 made their first contribution in #719
- @jingzhe6414 made their first contribution in #690
- @flpanbin made their first contribution in #739
- @oceanweave made their first contribution in #764
- @windsonsea made their first contribution in #760
- @KubeKyrie made their first contribution in #762
- @yxxhero made their first contribution in #766
- @Rei1010 made their first contribution in #775
- @phoenixwu0229 made their first contribution in #796
- @chinaran made their first contribution in #826
Full Changelog: v2.4.1...v2.5.0