Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(lmeval): Add support for S3 offline assets #399

Merged

Conversation

ruivieira
Copy link
Member

This PR adds support for using offline assets (i.e. models, datasets) in S3-compatible storage from LMEval.

An example of the new CRD is:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  allowOnline: false
  model: hf
  modelArgs:
    - name: pretrained
      value: /opt/app-root/src/hf_home/flan
  taskList:
    taskNames:
      - arc_easy
  logSamples: true
  offline:
    storage:
      s3:
        accessKeyId:
          name: s3-secret
          key: AWS_ACCESS_KEY_ID
        secretAccessKey:
          name: s3-secret
          key: AWS_SECRET_ACCESS_KEY
        bucket:
          name: s3-secret
          key: AWS_S3_BUCKET
        endpoint:
          name: s3-secret
          key: AWS_S3_ENDPOINT
        region:
          name: s3-secret
          key: AWS_DEFAULT_REGION
        path: "myassets"

We assume the following in this example:

  • A Secret named s3-secret exists in the same namespace and contains the keys AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_S3_BUCKET, AWS_S3_ENDPOINT and AWS_DEFAULT_REGION.
    • Note: the keys can be named whatever the user wants, as long as they represent what the CR field expects (e.g. a secret key in spec.offline.storage.s3.endpoint can be called anything, as long as its values is the value for the endpoint.
  • An S3 bucket named as the value of AWS_S3_BUCKET exists (as an example a bucket called models)
  • If there are prefixes (e.g. the data is in s3://models/myassets/model1 and s3://models/myassets/dataset2) they can be specified in path (e.g. myexperiment)

This will allow the LMEval Job to download all assets from s3://models/myassets locally and use them for offline evaluation.

This PR has a corresponding PR on https://github.com/opendatahub-io/lm-evaluation-harness with the script which downloads the assets given the information on the CR.

The path from LMEval's POV will be relative to the local storage /opt/app-root/src/hf_home. i.e.

  • s3://models/myassets/model1 will be available at /opt/app-root/src/hf_home/model1
  • s3://models/myassets/another/model2 will be available at /opt/app-root/src/hf_home/another/model2

Copy link

openshift-ci bot commented Feb 4, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ruivieira ruivieira requested a review from yhwang February 4, 2025 10:30
@ruivieira ruivieira self-assigned this Feb 4, 2025
@ruivieira ruivieira added the kind/enhancement New feature or request label Feb 4, 2025
Copy link
Collaborator

@yhwang yhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

Copy link

github-actions bot commented Feb 4, 2025

PR image build and manifest generation completed successfully!

📦 PR image: quay.io/trustyai/trustyai-service-operator-ci:5b83f36b5dce91f2e071decdaad80199c949d7f0

📦 LMES driver image: quay.io/trustyai/ta-lmes-driver:5b83f36b5dce91f2e071decdaad80199c949d7f0

📦 LMES job image: quay.io/trustyai/ta-lmes-job:5b83f36b5dce91f2e071decdaad80199c949d7f0

🗂️ CI manifests

devFlags:
  manifests:
    - contextDir: config
      sourcePath: ''
      uri: https://api.github.com/repos/trustyai-explainability/trustyai-service-operator-ci/tarball/operator-5b83f36b5dce91f2e071decdaad80199c949d7f0

@yhwang
Copy link
Collaborator

yhwang commented Feb 5, 2025

@ruivieira one question about the s3_downloader.py script. I don't see that in the downstream lm-evaluation-harness repo. But I assume you will make it available. is it possible to directly embed the s3 download function into the driver?

@openshift-ci openshift-ci bot removed the lgtm label Feb 5, 2025
@ruivieira
Copy link
Member Author

@yhwang I'll do the PR for lm-evaluation-harness shortly.

@ruivieira one question about the s3_downloader.py script. I don't see that in the downstream lm-evaluation-harness repo. But I assume you will make it available. is it possible to directly embed the s3 download function into the driver?

Good point, it's is possible IMO, I think the above PR (apart from the obvious download methods) would have to be changed so that:

  • AWS/S3 credentials/info is available as env vars in the driver container (instead of the job's)
  • Driver can access /opt/app-root/src/hf_home at runtime

However, if you agree we could have this method now (decoupled from the driver) and investigate moving it to the driver in a future iteration (this would be a non-breaking change, since the download mechanism would be purely internal, the CR API would be the same)

@ruivieira ruivieira marked this pull request as ready for review February 5, 2025 10:16
@yhwang
Copy link
Collaborator

yhwang commented Feb 5, 2025

@ruivieira totally agree! I only wanted to know the possibility when I asked the question.

If the SSL Verify field was omitted, the controller would crash trying to convert it to a boolean.
@ruivieira ruivieira linked an issue Feb 11, 2025 that may be closed by this pull request
@ruivieira ruivieira linked an issue Feb 11, 2025 that may be closed by this pull request
@ruivieira ruivieira linked an issue Feb 11, 2025 that may be closed by this pull request
@ruivieira
Copy link
Member Author

/retest

@ruivieira
Copy link
Member Author

/test images

Copy link

openshift-ci bot commented Feb 12, 2025

@ruivieira: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/trustyai-service-operator-e2e f4790f1 link true /test trustyai-service-operator-e2e
ci/prow/images f4790f1 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

…upport

# Conflicts:
#	cmd/lmes_driver/main.go
#	controllers/lmes/driver/driver.go
@openshift-ci openshift-ci bot removed the lgtm label Feb 12, 2025
@openshift-ci openshift-ci bot added the lgtm label Feb 12, 2025
@openshift-ci openshift-ci bot removed the lgtm label Feb 12, 2025
@openshift-ci openshift-ci bot added the lgtm label Feb 12, 2025
Copy link

openshift-ci bot commented Feb 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: RobGeada, yhwang

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ruivieira ruivieira merged commit ffe9cf5 into trustyai-explainability:main Feb 12, 2025
5 of 7 checks passed
@ruivieira ruivieira linked an issue Feb 17, 2025 that may be closed by this pull request
ruivieira pushed a commit to ruivieira/trustyai-service-operator that referenced this pull request Feb 18, 2025
…ices/konflux/component-updates/component-update-ta-lmes-driver-v2-18

chore(deps): update ta-lmes-driver-v2-18 to 0caea4f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LMES controller support for s3 Support for S3 model storage
3 participants