Skip to content

Commit ade923f

Browse files
authored
Alternate, Simplified Async Pull (#137)
* simplified RunCompletionChecker() loop exit * updated logging * updated chart to use new flag * removed previous async files * changed to sync from async for this test which may match original intent * added async puller startup log messages * fixed omission of setting imageSvc * updated ephemeral volume yaml to always pull * added final steps to fully flush the ephemeral volume and verify async pull logic functions properly via fully integrated kind test * resolved session state bug, reintroduced metrics, updated metrics test * fixed elapsed calc * added removal of pull time info after 1 minute expiration * extracted changes for dev experience PR for issue #143 * operation error metric cleanup * introduced image size metric in similar pattern to image pull time * added pull time and size logs, added error handling to image size check (after panic seen locally in kind), added size metrics, added size error count metric * added integration test for async pull feature * replaced completionChan and loop with completionFunc * removed isComplete * updated timeout to be more verbose * fixed bug where one call to StartAsyncPuller() in node_server.go was not updated to remove completionChan depth and the build targets still succeeded * captured comment updates * implemented change for exposing Image() on PullSession * added configurability of asyncPullTimeout to helm deployment yaml * fixed buG resulting from uniform extraction of image string from docker.Named struct * addressed two comments, replaced use of docker.Named.String() with Puller getter method Puller.ImageWithTag()
1 parent 3d36010 commit ade923f

File tree

23 files changed

+746
-572
lines changed

23 files changed

+746
-572
lines changed
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: containerd-async-11mins
2+
on:
3+
push:
4+
branches: [main]
5+
pull_request:
6+
branches: [main]
7+
workflow_dispatch:
8+
jobs:
9+
integration:
10+
runs-on: ubuntu-latest
11+
steps:
12+
- uses: actions/checkout@v4
13+
- name: Start a kind cluster with containerd
14+
uses: helm/[email protected]
15+
with:
16+
cluster_name: kind-${{ github.run_id }}
17+
kubectl_version: "v1.25.2"
18+
config: ./hack/ci/containerd-cluster-conf.yaml
19+
- name: Install private registry
20+
run: ./hack/ci/setup_private_registry.sh
21+
- name: Build image
22+
run: ./hack/ci/build.sh
23+
- name: Set image version
24+
run: |
25+
echo "VALUE_FILE=charts/warm-metal-csi-driver/values.yaml" >> "$GITHUB_ENV"
26+
echo "IMAGE_TAG=$(git rev-parse --short HEAD)" >> "$GITHUB_ENV"
27+
echo "HELM_NAME=wm-csi-integration-tests" >> "$GITHUB_ENV"
28+
- name: Install the CSI Driver
29+
run: |
30+
helm install ${HELM_NAME} charts/warm-metal-csi-driver -n kube-system \
31+
-f ${VALUE_FILE} \
32+
--set csiPlugin.image.tag=${IMAGE_TAG} \
33+
--set enableAsyncPull=true \
34+
--wait \
35+
--debug
36+
- name: Run integration Tests
37+
run: ./hack/ci/test.sh
38+
- name: Uninstall the CSI Driver
39+
run: helm uninstall -n kube-system ${HELM_NAME} --wait

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
VERSION ?= v1.1.0
1+
VERSION ?= v1.2.0
22

33
IMAGE_BUILDER ?= docker
44
IMAGE_BUILD_CMD ?= buildx

charts/warm-metal-csi-driver/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,9 @@ type: application
1515
# This is the chart version. This version number should be incremented each time you make changes
1616
# to the chart and its templates, including the app version.
1717
# Versions are expected to follow Semantic Versioning (https://semver.org/)
18-
version: 1.1.0
18+
version: 1.2.0
1919

2020
# This is the version number of the application being deployed. This version number should be
2121
# incremented each time you make changes to the application. Versions are not expected to
2222
# follow Semantic Versioning. They should reflect the version the application is using.
23-
appVersion: v1.1.0
23+
appVersion: v1.2.0

charts/warm-metal-csi-driver/templates/nodeplugin.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,8 +63,8 @@ spec:
6363
{{- if .Values.enableDaemonImageCredentialCache }}
6464
- --enable-daemon-image-credential-cache
6565
{{- end }}
66-
{{- if .Values.enableAsyncPullMount }}
67-
- --async-pull-mount=true
66+
{{- if .Values.enableAsyncPull }}
67+
- --async-pull-timeout={{ .Values.asyncPullTimeout }}
6868
{{- end }}
6969
- "-v={{ .Values.logLevel }}"
7070
- "--mode=node"

charts/warm-metal-csi-driver/values.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,8 @@ kubeletRoot: /var/lib/kubelet
55
snapshotRoot: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
66
logLevel: 4
77
enableDaemonImageCredentialCache:
8-
enableAsyncPullMount: false
8+
enableAsyncPull: false
9+
asyncPullTimeout: "10m"
910
pullImageSecretForDaemonset:
1011

1112
csiPlugin:

cmd/plugin/main.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ var (
5151
enableCache = flag.Bool("enable-daemon-image-credential-cache", true,
5252
"Whether to save contents of imagepullsecrets of the daemon ServiceAccount in memory. "+
5353
"If set to false, secrets will be fetched from the API server on every image pull.")
54-
asyncImagePullMount = flag.Bool("async-pull-mount", false,
55-
"Whether to pull images asynchronously (helps prevent timeout for larger images)")
54+
asyncImagePullTimeout = flag.Duration("async-pull-timeout", 0,
55+
"If positive, specifies duration allotted for async image pulls as measured from pull start time. If zero, negative, less than 30s, or omitted, the caller's timeout (usually kubelet: 2m) is used instead of this value. (additional time helps prevent timeout for larger images or slower image pull conditions)")
5656
watcherResyncPeriod = flag.Duration("watcher-resync-period", 30*time.Minute, "The resync period of the pvc watcher.")
5757
mode = flag.String("mode", "", "The mode of the driver. Valid values are: node, controller")
5858
nodePluginSA = flag.String("node-plugin-sa", "csi-image-warm-metal", "The name of the ServiceAccount used by the node plugin.")
@@ -129,7 +129,7 @@ func main() {
129129
server.Start(*endpoint,
130130
NewIdentityServer(driverVersion),
131131
nil,
132-
NewNodeServer(driver, mounter, criClient, secretStore, *asyncImagePullMount))
132+
NewNodeServer(driver, mounter, criClient, secretStore, *asyncImagePullTimeout))
133133
case controllerMode:
134134
watcher, err := watcher.New(context.Background(), *watcherResyncPeriod)
135135
if err != nil {

cmd/plugin/node_server.go

Lines changed: 67 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,14 @@ import (
44
"context"
55
"os"
66
"strings"
7+
"time"
78

89
"github.com/container-storage-interface/spec/lib/go/csi"
910
"github.com/containerd/containerd/reference/docker"
10-
"github.com/google/uuid"
1111
"github.com/warm-metal/container-image-csi-driver/pkg/backend"
1212
"github.com/warm-metal/container-image-csi-driver/pkg/metrics"
13-
"github.com/warm-metal/container-image-csi-driver/pkg/mountexecutor"
14-
"github.com/warm-metal/container-image-csi-driver/pkg/mountstatus"
15-
"github.com/warm-metal/container-image-csi-driver/pkg/pullexecutor"
13+
"github.com/warm-metal/container-image-csi-driver/pkg/remoteimage"
14+
"github.com/warm-metal/container-image-csi-driver/pkg/remoteimageasync"
1615
"github.com/warm-metal/container-image-csi-driver/pkg/secret"
1716
csicommon "github.com/warm-metal/csi-drivers/pkg/csi-common"
1817
"google.golang.org/grpc/codes"
@@ -31,36 +30,36 @@ const (
3130

3231
type ImagePullStatus int
3332

34-
func NewNodeServer(driver *csicommon.CSIDriver, mounter backend.Mounter, imageSvc cri.ImageServiceClient, secretStore secret.Store, asyncImagePullMount bool) *NodeServer {
35-
return &NodeServer{
36-
DefaultNodeServer: csicommon.NewDefaultNodeServer(driver),
37-
mounter: mounter,
38-
secretStore: secretStore,
39-
asyncImagePullMount: asyncImagePullMount,
40-
mountExecutor: mountexecutor.NewMountExecutor(&mountexecutor.MountExecutorOptions{
41-
AsyncMount: asyncImagePullMount,
42-
Mounter: mounter,
43-
}),
44-
pullExecutor: pullexecutor.NewPullExecutor(&pullexecutor.PullExecutorOptions{
45-
AsyncPull: asyncImagePullMount,
46-
ImageServiceClient: imageSvc,
47-
SecretStore: secretStore,
48-
Mounter: mounter,
49-
}),
50-
}
33+
func NewNodeServer(driver *csicommon.CSIDriver, mounter backend.Mounter, imageSvc cri.ImageServiceClient, secretStore secret.Store, asyncImagePullTimeout time.Duration) *NodeServer {
34+
ns := NodeServer{
35+
DefaultNodeServer: csicommon.NewDefaultNodeServer(driver),
36+
mounter: mounter,
37+
imageSvc: imageSvc,
38+
secretStore: secretStore,
39+
asyncImagePullTimeout: asyncImagePullTimeout,
40+
asyncImagePuller: nil,
41+
}
42+
if asyncImagePullTimeout >= time.Duration(30*time.Second) {
43+
klog.Infof("Starting node server in Async mode with %v timeout", asyncImagePullTimeout)
44+
ns.asyncImagePuller = remoteimageasync.StartAsyncPuller(context.TODO(), 100)
45+
} else {
46+
klog.Info("Starting node server in Sync mode")
47+
ns.asyncImagePullTimeout = 0 // set to default value
48+
}
49+
return &ns
5150
}
5251

5352
type NodeServer struct {
5453
*csicommon.DefaultNodeServer
55-
mounter backend.Mounter
56-
secretStore secret.Store
57-
asyncImagePullMount bool
58-
mountExecutor *mountexecutor.MountExecutor
59-
pullExecutor *pullexecutor.PullExecutor
54+
mounter backend.Mounter
55+
imageSvc cri.ImageServiceClient
56+
secretStore secret.Store
57+
asyncImagePullTimeout time.Duration
58+
asyncImagePuller remoteimageasync.AsyncPuller
6059
}
6160

6261
func (n NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest) (resp *csi.NodePublishVolumeResponse, err error) {
63-
valuesLogger := klog.LoggerWithValues(klog.NewKlogr(), "pod-name", req.VolumeContext["pod-name"], "namespace", req.VolumeContext["namespace"], "uid", req.VolumeContext["uid"], "request-id", uuid.NewString())
62+
valuesLogger := klog.LoggerWithValues(klog.NewKlogr(), "pod-name", req.VolumeContext["pod-name"], "namespace", req.VolumeContext["namespace"], "uid", req.VolumeContext["uid"])
6463
valuesLogger.Info("Incoming NodePublishVolume request", "request string", req.String())
6564
if len(req.VolumeId) == 0 {
6665
err = status.Error(codes.InvalidArgument, "VolumeId is missing")
@@ -122,56 +121,59 @@ func (n NodeServer) NodePublishVolume(ctx context.Context, req *csi.NodePublishV
122121
image = req.VolumeContext[ctxKeyImage]
123122
}
124123

125-
namedRef, err := docker.ParseDockerRef(image)
126-
if err != nil {
127-
klog.Errorf("unable to normalize image %q: %s", image, err)
128-
return
129-
}
130-
131124
pullAlways := strings.ToLower(req.VolumeContext[ctxKeyPullAlways]) == "true"
132125

133-
po := &pullexecutor.PullOptions{
134-
Context: ctx,
135-
NamedRef: namedRef,
136-
PullAlways: pullAlways,
137-
Image: image,
138-
PullSecrets: req.Secrets,
139-
Logger: valuesLogger,
140-
}
141-
142-
if e := n.pullExecutor.StartPulling(po); e != nil {
143-
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, e)
126+
keyring, err := n.secretStore.GetDockerKeyring(ctx, req.Secrets)
127+
if err != nil {
128+
err = status.Errorf(codes.Aborted, "unable to fetch keyring: %s", err)
144129
return
145130
}
146131

147-
if e := n.pullExecutor.WaitForPull(po); e != nil {
148-
err = status.Errorf(codes.DeadlineExceeded, e.Error())
132+
namedRef, err := docker.ParseDockerRef(image)
133+
if err != nil {
134+
klog.Errorf("unable to normalize image %q: %s", image, err)
149135
return
150136
}
151137

152-
if mountstatus.Get(req.VolumeId) == mountstatus.Mounted {
153-
return &csi.NodePublishVolumeResponse{}, nil
154-
}
155-
156-
o := &mountexecutor.MountOptions{
157-
Context: ctx,
158-
NamedRef: namedRef,
159-
VolumeId: req.VolumeId,
160-
TargetPath: req.TargetPath,
161-
VolumeCapability: req.VolumeCapability,
162-
ReadOnly: req.Readonly,
163-
Logger: valuesLogger,
138+
//NOTE: we are relying on n.mounter.ImageExists() to return false when
139+
// a first-time pull is in progress, else this logic may not be
140+
// correct. should test this.
141+
if pullAlways || !n.mounter.ImageExists(ctx, namedRef) {
142+
klog.Errorf("pull image %q", image)
143+
puller := remoteimage.NewPuller(n.imageSvc, namedRef, keyring)
144+
145+
if n.asyncImagePuller != nil {
146+
var session *remoteimageasync.PullSession
147+
session, err = n.asyncImagePuller.StartPull(image, puller, n.asyncImagePullTimeout)
148+
if err != nil {
149+
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
150+
metrics.OperationErrorsCount.WithLabelValues("pull-async-start").Inc()
151+
return
152+
}
153+
if err = n.asyncImagePuller.WaitForPull(session, ctx); err != nil {
154+
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
155+
metrics.OperationErrorsCount.WithLabelValues("pull-async-wait").Inc()
156+
return
157+
}
158+
} else {
159+
if err = puller.Pull(ctx); err != nil {
160+
err = status.Errorf(codes.Aborted, "unable to pull image %q: %s", image, err)
161+
metrics.OperationErrorsCount.WithLabelValues("pull-sync-call").Inc()
162+
return
163+
}
164+
}
164165
}
165166

166-
if e := n.mountExecutor.StartMounting(o); e != nil {
167-
err = status.Error(codes.Internal, e.Error())
167+
ro := req.Readonly ||
168+
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY ||
169+
req.VolumeCapability.AccessMode.Mode == csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY
170+
if err = n.mounter.Mount(ctx, req.VolumeId, backend.MountTarget(req.TargetPath), namedRef, ro); err != nil {
171+
err = status.Error(codes.Internal, err.Error())
172+
metrics.OperationErrorsCount.WithLabelValues("mount").Inc()
168173
return
169174
}
170175

171-
if e := n.mountExecutor.WaitForMount(o); e != nil {
172-
err = status.Errorf(codes.DeadlineExceeded, e.Error())
173-
return
174-
}
176+
valuesLogger.Info("Successfully completed NodePublishVolume request", "request string", req.String())
175177

176178
return &csi.NodePublishVolumeResponse{}, nil
177179
}
@@ -194,17 +196,11 @@ func (n NodeServer) NodeUnpublishVolume(ctx context.Context, req *csi.NodeUnpubl
194196
}
195197

196198
if err = n.mounter.Unmount(ctx, req.VolumeId, backend.MountTarget(req.TargetPath)); err != nil {
197-
// TODO(vadasambar): move this to mountexecutor once mountexecutor has `StartUnmounting` function
198-
metrics.OperationErrorsCount.WithLabelValues("StartUnmounting").Inc()
199+
metrics.OperationErrorsCount.WithLabelValues("unmount").Inc()
199200
err = status.Error(codes.Internal, err.Error())
200201
return
201202
}
202203

203-
// Clear the mountstatus since the volume has been unmounted
204-
// Not doing this will make mount not work properly if the same volume is
205-
// attempted to mount twice
206-
mountstatus.Delete(req.VolumeId)
207-
208204
return &csi.NodeUnpublishVolumeResponse{}, nil
209205
}
210206

cmd/plugin/node_server_test.go

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ func TestNodePublishVolumeAsync(t *testing.T) {
4141
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
4242
assert.NotNil(t, driver)
4343

44-
asyncImagePulls := true
44+
asyncImagePulls := 15 * time.Minute //TODO: determine intended value for this in the context of this test
4545
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)
4646

4747
// based on kubelet's csi mounter plugin code
@@ -168,7 +168,7 @@ func TestNodePublishVolumeSync(t *testing.T) {
168168
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
169169
assert.NotNil(t, driver)
170170

171-
asyncImagePulls := false
171+
asyncImagePulls := 0 * time.Minute //TODO: determine intended value for this in the context of this test
172172
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)
173173

174174
// based on kubelet's csi mounter plugin code
@@ -300,7 +300,7 @@ func TestMetrics(t *testing.T) {
300300
driver := csicommon.NewCSIDriver(driverName, driverVersion, "fake-node")
301301
assert.NotNil(t, driver)
302302

303-
asyncImagePulls := true
303+
asyncImagePulls := 15 * time.Minute //TODO: determine intended value for this in the context of this test
304304
ns := NewNodeServer(driver, mounter, criClient, &testSecretStore{}, asyncImagePulls)
305305

306306
// based on kubelet's csi mounter plugin code
@@ -437,7 +437,7 @@ func TestMetrics(t *testing.T) {
437437
assert.NoError(t, err)
438438
respBody := string(b1)
439439
assert.Contains(t, respBody, metrics.ImagePullTimeKey)
440-
assert.Contains(t, respBody, metrics.ImageMountTimeKey)
440+
assert.Contains(t, respBody, metrics.ImagePullTimeHist)
441441
assert.Contains(t, respBody, metrics.OperationErrorsCountKey)
442442

443443
// give some time before stopping the server

go.mod

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ require (
1414
github.com/pkg/errors v0.9.1
1515
github.com/prometheus/client_golang v1.12.1
1616
github.com/spf13/pflag v1.0.5
17-
github.com/stretchr/testify v1.8.0
17+
github.com/stretchr/testify v1.8.4
1818
github.com/warm-metal/csi-drivers v0.5.0-alpha.0.0.20210404173852-9ec9cb097dd2
1919
golang.org/x/net v0.0.0-20221004154528-8021a29435af
2020
google.golang.org/grpc v1.50.0

go.sum

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -826,8 +826,9 @@ github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5
826826
github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
827827
github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
828828
github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg=
829-
github.com/stretchr/testify v1.8.0 h1:pSgiaMZlXftHpm5L7V1+rVB+AZJydKsMxsQBIJw4PKk=
830829
github.com/stretchr/testify v1.8.0/go.mod h1:yNjHg4UonilssWZ8iaSj1OCr/vHnekPRkoO+kdMU+MU=
830+
github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=
831+
github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=
831832
github.com/subosito/gotenv v1.2.0/go.mod h1:N0PQaV/YGNqwC0u51sEeR/aUtSLEXKX9iv69rRypqCw=
832833
github.com/syndtr/gocapability v0.0.0-20170704070218-db04d3cc01c8/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=
833834
github.com/syndtr/gocapability v0.0.0-20180916011248-d98352740cb2/go.mod h1:hkRG7XYTFWNJGYcbNJQlaLq0fg1yr4J4t/NcTQtrfww=

0 commit comments

Comments
 (0)