From 3212e99cc8655d471a5a9dd1cffe204cb33233fa Mon Sep 17 00:00:00 2001 From: tje Date: Wed, 28 Aug 2024 11:38:21 +0100 Subject: [PATCH] datasets examples in shell/python --- benchmarking/datasets.mdx | 72 +++++++++++++++++++++------------------ 1 file changed, 38 insertions(+), 34 deletions(-) diff --git a/benchmarking/datasets.mdx b/benchmarking/datasets.mdx index 583d2e202..7247f9dcd 100644 --- a/benchmarking/datasets.mdx +++ b/benchmarking/datasets.mdx @@ -52,10 +52,12 @@ This data is especially important, as it represents the *true distribution* obse before deployment. It's easy to extract any prompt queries previously made to the API, -via the [X]() endpoint, as explained [here](). -For example, the last 100 prompts for subject `Y` can be extracted as follows: +via the [`prompt_history`](benchmarks/get_prompt_history) endpoint, as explained [here](). +For example, the last 100 prompts with the tag `physics` can be extracted as follows: -CODE +```python +phyiscs_prompts = client.prompt_history(tag="phyiscs", limit=100) +``` We can then add this to the local `.jsonl` file as follows: @@ -64,19 +66,19 @@ CODE ## Uploading Datasets As shown above, the representation for prompt datasets is `.jsonl`, -which is a effectively a list of json structures (or in Python, a list of dicts). +which is a file format where each line is a json object (or in Python, a list of dicts). Lets upload our `english_language.jsonl` dataset. We can do this via the REST API as follows: -``` -import requests -url = "https://api.unify.ai/v0/dataset" -headers = {"Authorization": "Bearer $UNIFY_API_KEY",} -data = {"name": "english_language"} -files = {"file": open('/path/to/english_language.jsonl' ,'rb')} -response = requests.post(url, data=data, files=files, headers=headers) +```shell +curl --request POST \ + --url 'https://api.unify.ai/v0/dataset' \ + --header 'Authorization: Bearer ' \ + --header 'Content-Type: multipart/form-data' \ + --form 'file=@english_language.jsonl'\ + --form 'name=english_language' ``` Or we can create a `Dataset` instance in Python, @@ -90,31 +92,32 @@ We can delete the dataset just as easily as we created it. First, using the REST API: -``` -import requests -url = "https://api.unify.ai/v0/dataset" -headers = {"Authorization": "Bearer $UNIFY_API_KEY"} -data = {"name": "english_language"} -response = requests.delete(url, params=data, headers=headers) +```shell +curl --request DELETE \ + --url 'https://api.unify.ai/v0/dataset?name=english_language' \ + --header 'Authorization: Bearer ' + ``` Or via Python: -CODE - +```python +client.datasets.delete(name="english_language") +``` ## Listing Datasets We can retrieve a list of our uploaded datasets using the `/dataset/list` endpoint. +```shell +curl --request GET \ + --url 'https://api.unify.ai/v0/dataset/list' \ + --header 'Authorization: Bearer ' ``` -import requests -url = "https://api.unify.ai/v0/dataset/list" -headers = {"Authorization": "Bearer $UNIFY_API_KEY"} -response = requests.get(url, headers=headers) -print(response.text) -``` - +```python +datasets = client.datasets.list() +print(datasets) +``` ## Renaming Datasets @@ -126,23 +129,24 @@ and `english language`. We can easily rename the dataset without deleting and re-uploading, via the following REST API command: -``` -import requests -url = "https://api.unify.ai/v0/dataset/rename" -headers = {"Authorization": "Bearer $UNIFY_API_KEY"} -data = {"name": "english", "new_name": "english_literature"} -response = requests.post(url, params=data, headers=headers) +```shell +curl --request POST \ + --url 'https://api.unify.ai/v0/dataset/rename?name=english&new_name=english_literature' \ + --header 'Authorization: Bearer $UNIFY_KEY' + ``` Or via Python: -CODE +```python +client.datasets.rename(name="english", new_name="english_literature") +``` ## Appending to Datasets As explained above, we might want to add to an existing dataset, either because we have [generated some synthetic examples](), or perhaps because we have some relevant -[production traffic](). +[production traffic](datasets#production-data). In the examples above, we simply appended to these datasets locally, before then uploading the full `.jsonl` file. However,