Performance collected from demo/demo.py and accuracy collected from tests/test_llama_accuracy.py. You can generate this table by running these tests with the lt
tool (tell it to run accuracy,demo
) and pressing m
whilst in the results section to export to markdown.
Note that test_llama_accuracy.py
parses the below to determine expected values.
This configuration uses bfp4 MLP FF1+FF3 for all models.
Model | Device | Top-1 (%) | Top-5 (%) | Speed (t/s/u) |
---|---|---|---|---|
1b | N150 | 79 | 98 | 90.5 |
1b | N300 | 81 | 98 | 101.7 |
1b | T3K | 81 | 98 | 97.5 |
3b | N150 | 85 | 96 | 49.0 |
3b | N300 | 88 | 97 | 56.9 |
3b | T3K | 88 | 97 | 54.5 |
8b | N150 | 86 | 98 | 28.4 |
8b | N300 | 84 | 98 | 38.6 |
8b | T3K | 84 | 98 | 52.6 |
11b | N300 | 86 | 97 | 38.6 |
11b | T3K | 84 | 98 | 52.6 |
70b | T3K | 95 | 100 | 14.3 |
This configuration uses bfp4 MLP FF1+FF3 only for the 3.1-70B model.
Model | Device | Top-1 (%) | Top-5 (%) | Speed (t/s/u) |
---|---|---|---|---|
1b | N150 | 77 | 96 | 85.8 |
1b | N300 | 80 | 98 | 98.6 |
1b | T3K | 78 | 98 | 97.2 |
3b | N150 | 88 | 98 | 44.1 |
3b | N300 | 88 | 98 | 53.9 |
3b | T3K | 88 | 98 | 54.8 |
8b | N150 | 89 | 98 | 23.5 |
8b | N300 | 90 | 98 | 34.1 |
8b | T3K | 88 | 97 | 49.9 |
11b | N300 | 90 | 97 | 33.8 |
11b | T3K | 88 | 97 | 52.6 |
70b | T3K | 95 | 100 | 14.5 |