Skip to content

Commit d776668

Browse files
committed
updates
1 parent 4030ff9 commit d776668

File tree

2 files changed

+24
-3
lines changed

2 files changed

+24
-3
lines changed

inference/README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,18 @@ When you have users that send queries in real time - this is Online Inference. E
4242
When you have a file with prompts that you need to run inference on - this is Offline Inference. Examples: benchmark evaluation, synthetic data generation. In this case the inference server is often not needed and the inference is run directly in the same program that sends the query (client and server in one application).
4343

4444

45+
### Tasks
4546

47+
#### Input-grounded tasks
4648

49+
Input-grounded tasks are those where the generated response is derived mainly from the prompt, i.e. the main source of knowledge is contained in the prompt. These include:
50+
51+
- Translation
52+
- Summarization
53+
- Document QA
54+
- Multi-turn chat
55+
- Code editing
56+
- Speech recognition (audio transcription)
4757

4858

4959
### Batching
@@ -222,13 +232,13 @@ When there is a partial mismatch we can go back to the draft model and feed it a
222232

223233
The draft model ideally should be trained on the same data (or least data from a similar distribution) and its tokenizer has to be the same as the large model.
224234

225-
Speculative decoding gives the highest return on input-grounded tasks, such as translation and summarization, because in those tasks the range of possible outputs is much smaller and the draft model is much more likely to match the big model.
235+
Speculative decoding gives the highest return on [input-grounded tasks](#input-grounded-tasks), such as translation, summarization, document QA, multi-turn chat because in those tasks the range of possible outputs is much smaller and the draft model is much more likely to match the big model.
226236

227237
For the same reason it works best in when used in [greedy decoding](#greedy-decoding), as there is the least amount of possible variations during generation. If not using greedy decoding, you will want to have the value of [temperature](#temperature) close to 0.
228238

229239
Here is a good indepth dive into this subject: [Assisted Generation: a new direction toward low-latency text generation](https://huggingface.co/blog/assisted-generation).
230240

231-
241+
One other much simpler solution for [input-grounded tasks](#input-grounded-tasks), is to use [ngram prompt lookup decoding](https://github.com/apoorvumang/prompt-lookup-decoding). In this approach there is no need for a draft model, instead the prompt is searched for matching strings to generate candidates. In some situations it's said to speed decoding up by 2x+.
232242

233243

234244

orchestration/slurm/users.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -369,7 +369,7 @@ See the table at the top of this document for which partition is which.
369369
- drng: the node is running a job, but will after completion not be available due to an administrative reason
370370

371371

372-
### node state codes
372+
### Node state codes
373373

374374
The node state could be followed by a single character which has a special meaning. It is one of:
375375

@@ -383,6 +383,17 @@ The node state could be followed by a single character which has a special meani
383383
- `^`: The node reboot was issued.
384384
- `-`: The node is planned by the backfill scheduler for a higher priority job.
385385

386+
### Job state codes
387+
388+
- `CD` | Completed: The job has completed successfully.
389+
- `CG` | Completing: The job is finishing but some processes are still active.
390+
- `F` | Failed: The job terminated with a non-zero exit code and failed to execute.
391+
- `PD` | Pending: The job is waiting for resource allocation. It will eventually run.
392+
- `PR` | Preempted: The job was terminated because of preemption by another job.
393+
- `R` | Running: The job currently is allocated to a node and is running.
394+
- `S` | Suspended: A running job has been stopped with its cores released to other jobs.
395+
- `ST` | Stopped: A running job has been stopped with its cores retained.
396+
386397

387398
### drained nodes
388399

0 commit comments

Comments
 (0)