Skip to content

Commit 0390d86

Browse files
Merge pull request #65 from christinaexyou/add-lmeval-lls-custom-data-tutorial
Add LMEval LLS with custom task tutorial
2 parents 736ab9d + d771b75 commit 0390d86

File tree

4 files changed

+175
-2
lines changed

4 files changed

+175
-2
lines changed

docs/modules/ROOT/nav.adoc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,9 @@
1717
** xref:gorch-tutorial.adoc[]
1818
*** xref:hf-serving-runtime-tutorial.adoc[Using Hugging Face models with GuardrailsOrchestrator]
1919
** xref:tutorials-llama-stack-section.adoc[]
20-
*** xref:lmeval-lls-tutorial.adoc[]
20+
*** xref:lmeval-lls-tutorial.adoc[Getting Started with LM-Eval on Llama-Stack]
2121
*** xref:trustyai-fms-lls-tutorial.adoc[Getting started with trustyai_fms and llama-stack]
22+
*** xref:lmeval-lls-tutorial-custom-data.adoc[Running Custom Evaluations with LMEval Llama Stack External Eval Provider]
2223
* Components
2324
** xref:trustyai-service.adoc[]
2425
** xref:trustyai-operator.adoc[]
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
= Running Custom Evaluations with LMEval Llama Stack External Eval Provider
2+
:description: Learn how to evaluate your language model using the LMEval Llama Stack External Eval Provider with a custom dataset.
3+
:keywords: LMEval, Llama Stack, model evaluation
4+
5+
== Prerequisites
6+
7+
* Admin access to an OpenShift cluster
8+
* The TrustyAI operator installed in your OpenShift cluster
9+
* KServe set to Raw Deployment mode
10+
* A language model deployed on vLLM Serving Runtime in your OpenShift cluster
11+
12+
== Overview
13+
This tutorial demonstrates how to evaluate a language model using the https://github.com/trustyai-explainability/llama-stack-provider-lmeval[LMEval Llama Stack External Eval Provider] on a custom dataset. While Eleuther's https://github.com/EleutherAI/lm-evaluation-harness[lm-evaluation-harness] comes with 100+ out-of-the-box tasks, one might want to create a custom task to better evaluate the knowledge and behavior of their model. In order to run evaluations over a custom task, we need to **1) upload the task dataset to our OpenShift Cluster** and **2) register it as a benchmark with Llama Stack**.
14+
15+
In this tutorial, you will learn how to:
16+
17+
* Register a custom benchmark dataset
18+
* Run a benchmark evaluation job on a language model
19+
20+
== Usage
21+
This tutorial extends xref:lmeval-lls-tutorial.adoc[Getting Started with LMEval Llama Stack External Provider] so see the **Usage** and **Configuring the Llama Stack Server** section there to start your Llama Stack server
22+
23+
== Upload Your Custom Task Dataset to OpenShift
24+
25+
With the Llama Stack server running, create a Python script or Jupyter notebook to interact with the server and run an evaluation.
26+
27+
28+
Create a PersistentVolumeClaim (PVC) object named `my-pvc` to store your task dataset on your OpenShift cluster:
29+
30+
[source,bash]
31+
----
32+
oc apply -n <MODEL_NAMESPACE> -f << EOF
33+
apiVersion: v1
34+
kind: PersistentVolumeClaim
35+
metadata:
36+
name: my-pvc
37+
spec:
38+
accessModes:
39+
- ReadWriteOnce
40+
resources:
41+
requests:
42+
storage: 5Gi
43+
EOF
44+
----
45+
46+
Create a pod object named `dataset-storage-pod` to download the task dataset into the PVC:
47+
48+
[source, bash]
49+
----
50+
oc apply -n <MODEL_NAMESPACE> << EOF
51+
apiVersion: v1
52+
kind: Pod
53+
metadata:
54+
name: dataset-storage-pod
55+
spec:
56+
containers:
57+
- name: dataset-container
58+
image: 'quay.io/prometheus/busybox:latest'
59+
command: ["/bin/sh", "-c", "sleep 3600"]
60+
volumeMounts:
61+
- mountPath: "/data/upload_files"
62+
name: dataset-storage
63+
volumes:
64+
- name: dataset-storage
65+
persistentVolumeClaim:
66+
claimName: my-pvc
67+
EOF
68+
----
69+
70+
Copy your locally stored task dataset to the Pod. In this example, the dataset is named `example-dk-bench-input-bmo.jsonl` and we are copying it to the `dataset-storage-pod` under the path `/data/upload_files/`:
71+
72+
[source,bash]
73+
----
74+
oc cp example-dk-bench-input-bmo.jsonl dataset-storage-pod:/data/upload_files/example-dk-bench-input-bmo.jsonl -n <MODEL_NAMESPACE>
75+
----
76+
[NOTE]
77+
Replace <MODEL_NAMESPACE> with the namespace where the language model you wish to evaluate lives
78+
79+
== Register the Custom Dataset as a Benchmark
80+
Once the dataset is uploaded to the PVC, we can register it as a benchmark for evaluations. At a minimum, we need to provide the following metadata:
81+
82+
* The https://github.com/trustyai-explainability/lm-eval-tasks[TrustyAI LM-Eval Tasks] GitHub url, branch, commit SHA, and path of the custom task
83+
* The location of the custom task file in our PVC
84+
85+
[source,python]
86+
----
87+
client.benchmarks.register(
88+
benchmark_id="trustyai_lmeval::dk-bench",
89+
dataset_id="trustyai_lmeval::dk-bench",
90+
scoring_functions=["string"],
91+
provider_benchmark_id="string",
92+
provider_id="trustyai_lmeval",
93+
metadata={
94+
"custom_task": {
95+
"git": {
96+
"url": "https://github.com/trustyai-explainability/lm-eval-tasks.git",
97+
"branch": "main",
98+
"commit": "8220e2d73c187471acbe71659c98bccecfe77958",
99+
"path": "tasks/",
100+
}
101+
},
102+
"env": {
103+
# Path of the dataset inside the PVC
104+
"DK_BENCH_DATASET_PATH": "/opt/app-root/src/hf_home/example-dk-bench-input-bmo.jsonl",
105+
"JUDGE_MODEL_URL": "http://phi-3-predictor:8080/v1/chat/completions",
106+
# For simplicity, we use the same model as the one being evaluated
107+
"JUDGE_MODEL_NAME": "phi-3",
108+
"JUDGE_API_KEY": "",
109+
},
110+
"tokenized_requests": False,
111+
"tokenizer": "google/flan-t5-small",
112+
"input": {"storage": {"pvc": "my-pvc"}}
113+
},
114+
)
115+
----
116+
117+
Run a benchmark evaluation on your model:
118+
119+
[source,python]
120+
----
121+
job = client.eval.run_eval(
122+
benchmark_id="trustyai_lmeval::dk-bench",
123+
benchmark_config={
124+
"eval_candidate": {
125+
"type": "model",
126+
"model": "phi-3",
127+
"provider_id": "trustyai_lmeval",
128+
"sampling_params": {
129+
"temperature": 0.7,
130+
"top_p": 0.9,
131+
"max_tokens": 256
132+
},
133+
},
134+
"num_examples": 1000,
135+
},
136+
)
137+
138+
print(f"Starting job '{job.job_id}'")
139+
----
140+
141+
Monitor the status of the evaluation job. The job will run asynchronously, so you can check its status periodically:
142+
143+
[source,python]
144+
----
145+
def get_job_status(job_id, benchmark_id):
146+
return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
147+
148+
while True:
149+
job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk_bench")
150+
print(job)
151+
152+
if job.status in ['failed', 'completed']:
153+
print(f"Job ended with status: {job.status}")
154+
break
155+
156+
time.sleep(20)
157+
----
158+
159+
Get the job's results:
160+
161+
[source,python]
162+
----
163+
pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::dk-bench").scores)
164+
----
165+
166+
== See Also
167+
168+
* xref:lmeval-lls-tutorial.adoc[Getting Started with LM-Eval on Llama Stack]
169+
170+
* https://github.com/trustyai-explainability/lm-eval-tasks[TrustyAI LM-Eval Tasks]

docs/modules/ROOT/pages/lmeval-lls-tutorial.adoc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ Install the LMEval Llama Stack External Eval Provider from PyPi:
3333
pip install llama-stack-provider-lmeval
3434
----
3535

36-
== Configuing the Llama Stack Server
36+
== Configuring the Llama Stack Server
3737
Set the `VLLM_URL` and `TRUSTYAI_LM_EVAL_NAMESPACE` environment variables in your terminal. The `VLLM_URL` value should be the `v1/completions` endpoint of your model route and the `TRUSTYAI_LM_EVAL_NAMESPACE` should be the namespace where your model is deployed. For example:
3838

3939
[source,bash]

docs/modules/ROOT/pages/tutorials-llama-stack-section.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,5 @@ This section contains tutorials for working with Llama Stack in TrustyAI. These
55
== Available Tutorials
66

77
* xref:lmeval-lls-tutorial.adoc[Getting Started with LMEval Llama Stack External Eval Provider] - Learn how to evaluate your language model using the LMEval Llama Stack External Eval Provider
8+
9+
* xref:lmeval-lls-tutorial-custom-data.adoc[Running Custom Evaluations with LMEval Llama Stack External Eval Provider] - Learn how to evaluate your language model using the LMEval Llama Stack External Eval Provider over a custom task

0 commit comments

Comments
 (0)