updates

stas00 · stas00 · commit d776668f5753 · 2024-09-16T11:05:51.000-07:00
diff --git a/inference/README.md b/inference/README.md
@@ -42,8 +42,18 @@ When you have users that send queries in real time - this is Online Inference. E
 When you have a file with prompts that you need to run inference on - this is Offline Inference. Examples: benchmark evaluation, synthetic data generation. In this case the inference server is often not needed and the inference is run directly in the same program that sends the query (client and server in one application).
 
 
+### Tasks
 
+#### Input-grounded tasks
 
+Input-grounded tasks are those where the generated response is derived mainly from the prompt, i.e. the main source of knowledge is contained in the prompt. These include:
+
+- Translation
+- Summarization
+- Document QA
+- Multi-turn chat
+- Code editing
+- Speech recognition (audio transcription)
 
 
 ### Batching
@@ -222,13 +232,13 @@ When there is a partial mismatch we can go back to the draft model and feed it a
 
 The draft model ideally should be trained on the same data (or least data from a similar distribution) and its tokenizer has to be the same as the large model.
 
-Speculative decoding gives the highest return on input-grounded tasks, such as translation and summarization, because in those tasks the range of possible outputs is much smaller and the draft model is much more likely to match the big model.
+Speculative decoding gives the highest return on [input-grounded tasks](#input-grounded-tasks), such as translation, summarization, document QA, multi-turn chat because in those tasks the range of possible outputs is much smaller and the draft model is much more likely to match the big model.
 
 For the same reason it works best in when used in [greedy decoding](#greedy-decoding), as there is the least amount of possible variations during generation. If not using greedy decoding, you will want to have the value of  [temperature](#temperature) close to 0.
 
 Here is a good indepth dive into this subject: [Assisted Generation: a new direction toward low-latency text generation](https://huggingface.co/blog/assisted-generation).
 
-
+One other much simpler solution for [input-grounded tasks](#input-grounded-tasks), is to use [ngram prompt lookup decoding](https://github.com/apoorvumang/prompt-lookup-decoding). In this approach there is no need for a draft model, instead the prompt is searched for matching strings to generate candidates. In some situations it's said to speed decoding up by 2x+.
 
 
 
diff --git a/orchestration/slurm/users.md b/orchestration/slurm/users.md
@@ -369,7 +369,7 @@ See the table at the top of this document for which partition is which.
 - drng: the node is running a job, but will after completion not be available due to an administrative reason
 
 
-### node state codes
+### Node state codes
 
 The node state could be followed by a single character which has a special meaning. It is one of:
 
@@ -383,6 +383,17 @@ The node state could be followed by a single character which has a special meani
 - `^`: The node reboot was issued.
 - `-`: The node is planned by the backfill scheduler for a higher priority job.
 
+### Job state codes
+
+- `CD` | Completed: The job has completed successfully.
+- `CG` | Completing: The job is finishing but some processes are still active.
+- `F` | Failed: The job terminated with a non-zero exit code and failed to execute.
+- `PD` | Pending: The job is waiting for resource allocation. It will eventually run.
+- `PR` | Preempted: The job was terminated because of preemption by another job.
+- `R` | Running: The job currently is allocated to a node and is running.
+- `S` | Suspended: A running job has been stopped with its cores released to other jobs.
+- `ST` | Stopped: A running job has been stopped with its cores retained.
+
 
 ### drained nodes