diff --git a/docs/benchmarks/image_classification/get-resnet50-data.md b/docs/benchmarks/image_classification/get-resnet50-data.md index d1f83ae9a..900379d5a 100644 --- a/docs/benchmarks/image_classification/get-resnet50-data.md +++ b/docs/benchmarks/image_classification/get-resnet50-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Image Classification using ResNet50 ## Dataset diff --git a/docs/benchmarks/image_classification/mobilenets.md b/docs/benchmarks/image_classification/mobilenets.md index d676300df..f276008ef 100644 --- a/docs/benchmarks/image_classification/mobilenets.md +++ b/docs/benchmarks/image_classification/mobilenets.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Image Classification using Mobilenet models Mobilenet models are not official MLPerf models and so cannot be used for a Closed division MLPerf inference submission. But since they can be run with Imagenet dataset, we are allowed to use them for Open division submission. Only CPU runs are supported now. diff --git a/docs/benchmarks/image_classification/resnet50.md b/docs/benchmarks/image_classification/resnet50.md index 26875258d..62b966e0d 100644 --- a/docs/benchmarks/image_classification/resnet50.md +++ b/docs/benchmarks/image_classification/resnet50.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Image Classification using ResNet50 === "MLCommons-Python" diff --git a/docs/benchmarks/index.md b/docs/benchmarks/index.md deleted file mode 100644 index 39e3e72ac..000000000 --- a/docs/benchmarks/index.md +++ /dev/null @@ -1,27 +0,0 @@ -# MLPerf Inference Benchmarks - -Please visit the individual benchmark links to see the run commands using the unified CM interface. - -1. [Image Classification](image_classification/resnet50.md) using ResNet50-v1.5 model and Imagenet-2012 (224x224) validation dataset. Dataset size is 50,000 and QSL size is 1024. Reference model accuracy is 76.46%. Server scenario latency constraint is 15ms. - -2. [Text to Image](text_to_image/sdxl.md) using Stable Diffusion model and subset of Coco2014 dataset. Dataset size is 5000 amd QSL size is the same. Required accuracy for closed division is (23.01085758 <= FID <= 23.95007626, 32.68631873 <= CLIP <= 31.81331801). - -3. [Object Detection](object_detection/retinanet.md) using Retinanet model and OpenImages dataset.Dataset size is 24781 and QSL size is 64. Reference model accuracy is 0.3755 mAP. Server scenario latency constraint is 100ms. - -4. [Medical Image Segmentation](medical_imaging/3d-unet.md) using 3d-unet model and KiTS2019 dataset. Dataset size is 42 and QSL size is the same. Reference model accuracy is 0.86330 mean DIXE score. Server scenario is not applicable. - -5. [Question Answering](language/bert.md) using Bert-Large model and Squad v1.1 dataset with 384 sequence length. Dataset size is 10833 and QSL size is the same. Reference model accuracy is f1 score = 90.874%. Server scenario latency constraint is 130ms. - -6. [Text Summarization](language/gpt-j.md) using GPT-J model and CNN Daily Mail v3.0.0 dataset. Dataset size is 13368 amd QSL size is the same. Reference model accuracy is (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881, gen_len=4016878). Server scenario latency sconstraint is 20s. - -7. [Question Answering](language/llama2-70b.md) using LLAMA2-70b model and OpenORCA (GPT-4 split, max_seq_len=1024) dataset. Dataset size is 24576 and QSL size is the same. Reference model accuracy is (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms. - -8. [Question Answering, Math and Code Generation](language/mixtral-8x7b.md) using Mixtral-8x7B model and OpenORCA (5k samples of GPT-4 split, max_seq_len=2048), GSM8K (5k samples of the validation split, max_seq_len=2048), MBXP (5k samples of the validation split, max_seq_len=2048) datasets. Dataset size is 15000 and QSL size is the same. Reference model accuracy is (rouge1=45.4911, rouge2=23.2829, rougeL=30.3615, gsm8k accuracy = 73.78, mbxp accuracy = 60.12, tokens_per_sample=294.45). Server scenario latency constraint is TTFT=2000ms, TPOT=200ms. - -9. [Recommendation](recommendation/dlrm-v2.md) using DLRMv2 model and Synthetic Multihot Criteo dataset. Dataset size is 204800 and QSL size is the same. Reference model accuracy is AUC=80.31%. Server scenario latency constraint is 60 ms. - -All the nine benchmarks can participate in the datacenter category. -All the nine benchmarks except DLRMv2, LLAMA2 and Mixtral-8x7B and can participate in the edge category. - -`bert`, `llama2-70b`, `dlrm_v2` and `3d-unet` has a high accuracy (99.9%) variant, where the benchmark run must achieve a higher accuracy of at least `99.9%` of the FP32 reference model -in comparison with the `99%` default accuracy requirement. diff --git a/docs/benchmarks/language/bert.md b/docs/benchmarks/language/bert.md index 782340b3d..57b1cf224 100644 --- a/docs/benchmarks/language/bert.md +++ b/docs/benchmarks/language/bert.md @@ -1,36 +1,34 @@ +--- +hide: + - toc +--- + # Question Answering using Bert-Large === "MLCommons-Python" ## MLPerf Reference Implementation in Python - BERT-99 {{ mlperf_inference_implementation_readme (4, "bert-99", "reference") }} - BERT-99.9 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "reference") }} === "Nvidia" ## Nvidia MLPerf Implementation - BERT-99 {{ mlperf_inference_implementation_readme (4, "bert-99", "nvidia") }} - BERT-99.9 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "nvidia") }} === "Intel" ## Intel MLPerf Implementation - BERT-99 + {{ mlperf_inference_implementation_readme (4, "bert-99", "intel") }} - BERT-99.9 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "intel") }} === "Qualcomm" ## Qualcomm AI100 MLPerf Implementation - BERT-99 {{ mlperf_inference_implementation_readme (4, "bert-99", "qualcomm") }} - BERT-99.9 {{ mlperf_inference_implementation_readme (4, "bert-99.9", "qualcomm") }} diff --git a/docs/benchmarks/language/get-bert-data.md b/docs/benchmarks/language/get-bert-data.md index f5462b181..fed637572 100644 --- a/docs/benchmarks/language/get-bert-data.md +++ b/docs/benchmarks/language/get-bert-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Question Answering using Bert-Large ## Dataset diff --git a/docs/benchmarks/language/get-gptj-data.md b/docs/benchmarks/language/get-gptj-data.md index 9ea31feb4..90591fb76 100644 --- a/docs/benchmarks/language/get-gptj-data.md +++ b/docs/benchmarks/language/get-gptj-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Text Summarization using GPT-J ## Dataset diff --git a/docs/benchmarks/language/get-llama2-70b-data.md b/docs/benchmarks/language/get-llama2-70b-data.md index 4b04f7068..0214d95a5 100644 --- a/docs/benchmarks/language/get-llama2-70b-data.md +++ b/docs/benchmarks/language/get-llama2-70b-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Text Summarization using LLAMA2-70b ## Dataset @@ -23,4 +28,8 @@ Get the Official MLPerf LLAMA2-70b Model ``` cm run script --tags=get,ml-model,llama2-70b,_pytorch -j ``` + +!!! tip + + Downloading llama2-70B model from Hugging Face will prompt you to enter the Hugging Face username and password. Please note that the password required is the [**access token**](https://huggingface.co/settings/tokens) generated for your account. Additionally, ensure that your account has access to the [llama2-70B](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) model. diff --git a/docs/benchmarks/language/get-mixtral-8x7b-data.md b/docs/benchmarks/language/get-mixtral-8x7b-data.md index 601e5c1dd..1b2df1b9e 100644 --- a/docs/benchmarks/language/get-mixtral-8x7b-data.md +++ b/docs/benchmarks/language/get-mixtral-8x7b-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + ## Dataset The benchmark implementation run command will automatically download the preprocessed validation and calibration datasets. In case you want to download only the datasets, you can use the below commands. diff --git a/docs/benchmarks/language/gpt-j.md b/docs/benchmarks/language/gpt-j.md index 2eefbbc79..4dcb3d70e 100644 --- a/docs/benchmarks/language/gpt-j.md +++ b/docs/benchmarks/language/gpt-j.md @@ -1,31 +1,31 @@ +--- +hide: + - toc +--- + # Text Summarization using GPT-J === "MLCommons-Python" ## MLPerf Reference Implementation in Python - - GPT-J-99 + {{ mlperf_inference_implementation_readme (4, "gptj-99", "reference") }} - GPTJ-99.9 {{ mlperf_inference_implementation_readme (4, "gptj-99.9", "reference") }} === "Nvidia" ## Nvidia MLPerf Implementation - GPTJ-99 {{ mlperf_inference_implementation_readme (4, "gptj-99", "nvidia") }} - GPTJ-99.9 {{ mlperf_inference_implementation_readme (4, "gptj-99.9", "nvidia") }} === "Intel" ## Intel MLPerf Implementation - GPTJ-99 {{ mlperf_inference_implementation_readme (4, "gptj-99", "intel") }} @@ -33,7 +33,6 @@ === "Qualcomm" ## Qualcomm AI100 MLPerf Implementation - GPTJ-99 {{ mlperf_inference_implementation_readme (4, "gptj-99", "qualcomm") }} diff --git a/docs/benchmarks/language/llama2-70b.md b/docs/benchmarks/language/llama2-70b.md index f1785dcb3..0d9a0504d 100644 --- a/docs/benchmarks/language/llama2-70b.md +++ b/docs/benchmarks/language/llama2-70b.md @@ -1,28 +1,28 @@ +--- +hide: + - toc +--- + # Text Summarization using LLAMA2-70b === "MLCommons-Python" ## MLPerf Reference Implementation in Python - LLAMA2-70b-99 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "reference") }} - LLAMA2-70b-99.9 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "reference") }} === "Nvidia" ## Nvidia MLPerf Implementation - LLAMA2-70b-99 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "nvidia") }} - LLAMA2-70b-99.9 {{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "nvidia") }} +=== "Neural Magic" + ## Neural Magic MLPerf Implementation + +{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "neuralmagic") }} -=== "Qualcomm" - ## Qualcomm AI100 MLPerf Implementation - - LLAMA2-70b-99 -{{ mlperf_inference_implementation_readme (4, "llama2-70b-99", "qualcomm") }} - +{{ mlperf_inference_implementation_readme (4, "llama2-70b-99.9", "neuralmagic") }} \ No newline at end of file diff --git a/docs/benchmarks/language/mixtral-8x7b.md b/docs/benchmarks/language/mixtral-8x7b.md index eb138dc46..9f3bf2992 100644 --- a/docs/benchmarks/language/mixtral-8x7b.md +++ b/docs/benchmarks/language/mixtral-8x7b.md @@ -1,6 +1,9 @@ +--- +hide: + - toc +--- === "MLCommons-Python" ## MLPerf Reference Implementation in Python - MIXTRAL-8x7b {{ mlperf_inference_implementation_readme (4, "mixtral-8x7b", "reference") }} \ No newline at end of file diff --git a/docs/benchmarks/medical_imaging/3d-unet.md b/docs/benchmarks/medical_imaging/3d-unet.md index b58ea7f2e..01a54c63e 100644 --- a/docs/benchmarks/medical_imaging/3d-unet.md +++ b/docs/benchmarks/medical_imaging/3d-unet.md @@ -1,33 +1,32 @@ +--- +hide: + - toc +--- + # Medical Imaging using 3d-unet (KiTS 2019 kidney tumor segmentation task) === "MLCommons-Python" ## MLPerf Reference Implementation in Python - 3d-unet-99 {{ mlperf_inference_implementation_readme (4, "3d-unet-99", "reference") }} - 3d-unet-99.9 {{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "reference") }} === "Nvidia" ## Nvidia MLPerf Implementation - 3d-unet-99 {{ mlperf_inference_implementation_readme (4, "3d-unet-99", "nvidia") }} - 3d-unet-99.9 {{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "nvidia") }} === "Intel" ## Intel MLPerf Implementation - 3d-unet-99 {{ mlperf_inference_implementation_readme (4, "3d-unet-99", "intel") }} - 3d-unet-99.9 {{ mlperf_inference_implementation_readme (4, "3d-unet-99.9", "intel") }} diff --git a/docs/benchmarks/medical_imaging/get-3d-unet-data.md b/docs/benchmarks/medical_imaging/get-3d-unet-data.md index efc6ce6ed..6c361f6f1 100644 --- a/docs/benchmarks/medical_imaging/get-3d-unet-data.md +++ b/docs/benchmarks/medical_imaging/get-3d-unet-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Medical Imaging using 3d-unet (KiTS 2019 kidney tumor segmentation task) ## Dataset @@ -7,9 +12,14 @@ The benchmark implementation run command will automatically download the validat === "Validation" 3d-unet validation run uses the KiTS19 dataset performing [KiTS 2019](https://kits19.grand-challenge.org/) kidney tumor segmentation task - ### Get Validation Dataset + ### Get Validation Dataset(Original) + ``` + cm run script --tags=get,dataset,kits19,_validation -j + ``` + + ### Get Validation Dataset(Preprocessed) ``` - cm run script --tags=get,dataset,kits19,validation -j + cm run script --tags=get,dataset,kits19,preprocessed -j ``` ## Model diff --git a/docs/benchmarks/object_detection/get-retinanet-data.md b/docs/benchmarks/object_detection/get-retinanet-data.md index f2d432210..900fd572a 100644 --- a/docs/benchmarks/object_detection/get-retinanet-data.md +++ b/docs/benchmarks/object_detection/get-retinanet-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Object Detection using Retinanet ## Dataset diff --git a/docs/benchmarks/object_detection/retinanet.md b/docs/benchmarks/object_detection/retinanet.md index 383a2ec1b..699d92050 100644 --- a/docs/benchmarks/object_detection/retinanet.md +++ b/docs/benchmarks/object_detection/retinanet.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Object Detection using Retinanet === "MLCommons-Python" diff --git a/docs/benchmarks/recommendation/dlrm-v2.md b/docs/benchmarks/recommendation/dlrm-v2.md index 18266615f..ce3081077 100644 --- a/docs/benchmarks/recommendation/dlrm-v2.md +++ b/docs/benchmarks/recommendation/dlrm-v2.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Recommendation using DLRM v2 @@ -5,18 +10,20 @@ === "MLCommons-Python" ## MLPerf Reference Implementation in Python - DLRM-v2-99 -{{ mlperf_inference_implementation_readme (4, "dlrm_v2-99", "reference") }} +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99", "reference") }} - DLRM-v2-99.9 -{{ mlperf_inference_implementation_readme (4, "dlrm_v2-99.9", "reference") }} +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "reference") }} === "Nvidia" ## Nvidia MLPerf Implementation - - DLRM-v2-99 -{{ mlperf_inference_implementation_readme (4, "dlrm_v2-99", "nvidia") }} - DLRM-v2-99.9 -{{ mlperf_inference_implementation_readme (4, "dlrm_v2-99.9", "nvidia") }} +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99", "nvidia") }} + +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "nvidia") }} + +=== "Intel" + ## Intel MLPerf Implementation + +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99", "intel") }} +{{ mlperf_inference_implementation_readme (4, "dlrm-v2-99.9", "intel") }} \ No newline at end of file diff --git a/docs/benchmarks/recommendation/get-dlrm_v2-data.md b/docs/benchmarks/recommendation/get-dlrm-v2-data.md similarity index 83% rename from docs/benchmarks/recommendation/get-dlrm_v2-data.md rename to docs/benchmarks/recommendation/get-dlrm-v2-data.md index 97464a164..1c44ec471 100644 --- a/docs/benchmarks/recommendation/get-dlrm_v2-data.md +++ b/docs/benchmarks/recommendation/get-dlrm-v2-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Recommendation using DLRM v2 ## Dataset @@ -9,7 +14,7 @@ The benchmark implementation run command will automatically download the validat ### Get Validation Dataset ``` - cm run script --tags=get,dataset,criteo,validation -j + cm run script --tags=get,dataset,criteo,_validation -j ``` ## Model The benchmark implementation run command will automatically download the required model and do the necessary conversions. In case you want to only download the official model, you can use the below commands. @@ -20,6 +25,6 @@ Get the Official MLPerf DLRM v2 Model ### Pytorch ``` - cm run script --tags=get,ml-model,dlrm_v2,_pytorch -j + cm run script --tags=get,ml-model,dlrm,_pytorch -j ``` diff --git a/docs/benchmarks/text_to_image/get-sdxl-data.md b/docs/benchmarks/text_to_image/get-sdxl-data.md index 830465d44..f0d1376bd 100644 --- a/docs/benchmarks/text_to_image/get-sdxl-data.md +++ b/docs/benchmarks/text_to_image/get-sdxl-data.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Text to Image using Stable Diffusion ## Dataset diff --git a/docs/benchmarks/text_to_image/sdxl.md b/docs/benchmarks/text_to_image/sdxl.md index 2d84838d4..575f4cabb 100644 --- a/docs/benchmarks/text_to_image/sdxl.md +++ b/docs/benchmarks/text_to_image/sdxl.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Text to Image using Stable Diffusion @@ -15,9 +20,3 @@ ## Intel MLPerf Implementation {{ mlperf_inference_implementation_readme (4, "sdxl", "intel") }} - -=== "Qualcomm" - ## Qualcomm AI100 MLPerf Implementation - -{{ mlperf_inference_implementation_readme (4, "sdxl", "qualcomm") }} - diff --git a/docs/changelog/index.md b/docs/changelog/index.md index f68abc5b1..6033a4826 100644 --- a/docs/changelog/index.md +++ b/docs/changelog/index.md @@ -1,2 +1,7 @@ +--- +hide: + - toc +--- + # What's New, What's Coming diff --git a/docs/demos/index.md b/docs/demos/index.md index 1c23a5f60..a9e6424ea 100644 --- a/docs/demos/index.md +++ b/docs/demos/index.md @@ -1,2 +1,7 @@ +--- +hide: + - toc +--- + # Demos diff --git a/docs/index.md b/docs/index.md deleted file mode 120000 index 32d46ee88..000000000 --- a/docs/index.md +++ /dev/null @@ -1 +0,0 @@ -../README.md \ No newline at end of file diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 000000000..11f2a52c2 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,168 @@ +# MLPerf Inference Benchmarks + +## Overview +The currently valid [MLPerf Inference Benchmarks](index_gh.md) as of MLPerf inference v4.0 round are listed below, categorized by tasks. Under each model you can find its details like the dataset used, reference accuracy, server latency constraints etc. + +--- + +## Image Classification +### [ResNet50-v1.5](benchmarks/image_classification/resnet50.md) +- **Dataset**: Imagenet-2012 (224x224) Validation + - **Dataset Size**: 50,000 + - **QSL Size**: 1,024 +- **Number of Parameters**: 25.6 million +- **FLOPs**: 3.8 billion +- **Reference Model Accuracy**: 76.46% ACC +- **Server Scenario Latency Constraint**: 15ms +- **Equal Issue mode**: False +- **High accuracy variant**: No +- **Submission Category**: Datacenter, Edge + +--- + +## Text to Image +### [Stable Diffusion](benchmarks/text_to_image/sdxl.md) +- **Dataset**: Subset of Coco2014 + - **Dataset Size**: 5,000 + - **QSL Size**: 5,000 +- **Number of Parameters**: 3.5 billion +- **FLOPs**: 1.28 - 2.4 trillion +- **Reference Model Accuracy (fp32)**: CLIP: 31.74981837, FID: 23.48046692 +- **Required Accuracy (Closed Division)**: + - CLIP: 31.68631873 ≤ CLIP ≤ 31.81331801 (within 0.2% of the reference model CLIP score) + - FID: 23.01085758 ≤ FID ≤ 23.95007626 (within 2% of the reference model FID score) +- **Equal Issue mode**: False +- **High accuracy variant**: No +- **Submission Category**: Datacenter, Edge + +--- + +## Object Detection +### [Retinanet](benchmarks/object_detection/retinanet.md) +- **Dataset**: OpenImages + - **Dataset Size**: 24,781 + - **QSL Size**: 64 +- **Number of Parameters**: TBD +- **Reference Model Accuracy (fp32) **: 0.3755 mAP +- **Server Scenario Latency Constraint**: 100ms +- **Equal Issue mode**: False +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter, Edge + +--- + +## Medical Image Segmentation +### [3d-unet](benchmarks/medical_imaging/3d-unet.md) +- **Dataset**: KiTS2019 + - **Dataset Size**: 42 + - **QSL Size**: 42 +- **Number of Parameters**: 32.5 million +- **FLOPs**: 100-300 billion +- **Reference Model Accuracy (fp32) **: 0.86330 Mean DICE Score +- **Server Scenario**: Not Applicable +- **Equal Issue mode**: True +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter, Edge + +--- + +## Language Tasks + +### Question Answering + +#### [Bert-Large](benchmarks/language/bert.md) +- **Dataset**: Squad v1.1 (384 Sequence Length) + - **Dataset Size**: 10,833 + - **QSL Size**: 10,833 +- **Number of Parameters**: 340 million +- **FLOPs**: ~128 billion +- **Reference Model Accuracy (fp32) **: F1 Score = 90.874% +- **Server Scenario Latency Constraint**: 130ms +- **Equal Issue mode**: False +- **High accuracy variant**: yes +- **Submission Category**: Datacenter, Edge + +#### [LLAMA2-70B](benchmarks/language/llama2-70b.md) +- **Dataset**: OpenORCA (GPT-4 split, max_seq_len=1024) + - **Dataset Size**: 24,576 + - **QSL Size**: 24,576 +- **Number of Parameters**: 70 billion +- **FLOPs**: ~500 trillion +- **Reference Model Accuracy (fp32) **: + - Rouge1: 44.4312 + - Rouge2: 22.0352 + - RougeL: 28.6162 + - Tokens_per_sample: 294.45 +- **Server Scenario Latency Constraint**: + - TTFT: 2000ms + - TPOT: 200ms +- **Equal Issue mode**: True +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter + +### Text Summarization + +#### [GPT-J](benchmarks/language/gpt-j.md) +- **Dataset**: CNN Daily Mail v3.0.0 + - **Dataset Size**: 13,368 + - **QSL Size**: 13,368 +- **Number of Parameters**: 6 billion +- **FLOPs**: ~148 billion +- **Reference Model Accuracy (fp32) **: + - Rouge1: 42.9865 + - Rouge2: 20.1235 + - RougeL: 29.9881 + - Gen_len: 4,016,878 +- **Server Scenario Latency Constraint**: 20s +- **Equal Issue mode**: True +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter, Edge + +### Mixed Tasks (Question Answering, Math, and Code Generation) + +#### [Mixtral-8x7B](benchmarks/language/mixtral-8x7b.md) +- **Datasets**: + - OpenORCA (5k samples of GPT-4 split, max_seq_len=2048) + - GSM8K (5k samples of the validation split, max_seq_len=2048) + - MBXP (5k samples of the validation split, max_seq_len=2048) + - **Dataset Size**: 15,000 + - **QSL Size**: 15,000 +- **Number of Parameters**: 47 billion +- **Reference Model Accuracy (fp16) **: + - OpenORCA + - Rouge1: 45.4911 + - Rouge2: 23.2829 + - RougeL: 30.3615 + - GSM8K Accuracy: 73.78% + - MBXP Accuracy: 60.12% + - Tokens_per_sample: 294.45 +- **Server Scenario Latency Constraint**: + - TTFT: 2000ms + - TPOT: 200ms +- **Equal Issue mode**: True +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter + +--- + +## Recommendation +### [DLRM_v2](benchmarks/recommendation/dlrm-v2.md) +- **Dataset**: Synthetic Multihot Criteo + - **Dataset Size**: 204,800 + - **QSL Size**: 204,800 +- **Number of Parameters**: ~23 billion +- **Reference Model Accuracy**: AUC = 80.31% +- **Server Scenario Latency Constraint**: 60ms +- **Equal Issue mode**: False +- **High accuracy variant**: Yes +- **Submission Category**: Datacenter + +--- + +## Submission Categories +- **Datacenter Category**: All the current inference benchmarks are applicable to the datacenter category. +- **Edge Category**: All benchmarks except DLRMv2, LLAMA2-70B, and Mixtral-8x7B are applicable to the edge category. + +## High Accuracy Variants +- **Benchmarks**: `bert`, `llama2-70b`, `gpt-j`, `dlrm_v2`, and `3d-unet` have a normal accuracy variant as well as a high accuracy variant. +- **Requirement**: Must achieve at least 99.9% of the reference model accuracy, compared to the default 99% accuracy requirement. diff --git a/docs/index_gh.md b/docs/index_gh.md new file mode 120000 index 000000000..32d46ee88 --- /dev/null +++ b/docs/index_gh.md @@ -0,0 +1 @@ +../README.md \ No newline at end of file diff --git a/docs/install/index.md b/docs/install/index.md index 9575e86c8..60377adee 100644 --- a/docs/install/index.md +++ b/docs/install/index.md @@ -1,3 +1,8 @@ +--- +hide: + - toc +--- + # Installation We use MLCommons CM Automation framework to run MLPerf inference benchmarks. @@ -15,5 +20,12 @@ CM needs `git`, `python3-pip` and `python3-venv` installed on your system. If an pip install cm4mlops ``` +## To work on custom GitHub repo and branch + +```bash + pip install cmind && cm init --quiet --repo=mlcommons@cm4mlops --branch=mlperf-inference +``` + +Here, repo is in the format `githubUsername@githubRepo`. -Now, you are ready to use the `cm` commands to run MLPerf inference as given in the [benchmarks](../benchmarks/index.md) page +Now, you are ready to use the `cm` commands to run MLPerf inference as given in the [benchmarks](../index.md) page diff --git a/docs/submission/index.md b/docs/submission/index.md index 94287f5d2..a75bc3259 100644 --- a/docs/submission/index.md +++ b/docs/submission/index.md @@ -1,67 +1,122 @@ -If you follow the `cm run` commands under the individual model pages in the [benchmarks](../benchmarks/index.md) directory, all the valid results will get aggregated to the `cm cache` folder. Once all the results across all the modelsare ready you can use the following command to generate a valid submission tree compliant with the [MLPerf requirements](https://github.com/mlcommons/policies/blob/master/submission_rules.adoc#inference-1). +--- +hide: + - toc +--- + +=== "CM based benchmark" + If you have followed the `cm run` commands under the individual model pages in the [benchmarks](../index.md) directory, all the valid results will get aggregated to the `cm cache` folder. The following command could be used to browse the structure of inference results folder generated by CM. + ### Get results folder structure + ```bash + cm find cache --tags=get,mlperf,inference,results,dir | xargs tree + ``` +=== "Non CM based benchmark" + If you have not followed the `cm run` commands under the individual model pages in the [benchmarks](../index.md) directory, please make sure that the result directory is structured in the following way. + ``` + └── System description ID(SUT Name) + ├── system_meta.json + └── Benchmark + └── Scenario + ├── Performance + | └── run_x/#1 run for all scenarios + | ├── mlperf_log_summary.txt + | └── mlperf_log_detail.txt + ├── Accuracy + | ├── mlperf_log_summary.txt + | ├── mlperf_log_detail.txt + | ├── mlperf_log_accuracy.json + | └── accuracy.txt + └── Compliance_Test_ID + ├── Performance + | └── run_x/#1 run for all scenarios + | ├── mlperf_log_summary.txt + | └── mlperf_log_detail.txt + ├── Accuracy + | ├── baseline_accuracy.txt + | ├── compliance_accuracy.txt + | ├── mlperf_log_accuracy.json + | └── accuracy.txt + ├── verify_performance.txt + └── verify_accuracy.txt #for TEST01 only + ``` + +
+ Click here if you are submitting in open division + + * The `model_mapping.json` should be included inside the SUT folder which is used to map the custom model full name to the official model name. The format of json file is: + + ``` + { + "custom_model_name_for_model1":"official_model_name_for_model1", + "custom_model_name_for_model2":"official_model_name_for_model2", + + } + ``` +
+ +Once all the results across all the models are ready you can use the following command to generate a valid submission tree compliant with the [MLPerf requirements](https://github.com/mlcommons/policies/blob/master/submission_rules.adoc#inference-1). ## Generate actual submission tree === "Closed Edge" ### Closed Edge Submission ```bash - cm run script -tags=generate,inference,submission \ - --clean \ - --preprocess_submission=yes \ - --run-checker \ - --submitter=MLCommons \ - --tar=yes \ - --env.CM_TAR_OUTFILE=submission.tar.gz \ - --division=closed \ - --category=edge \ - --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ - --quiet + cm run script --tags=generate,inference,submission \ + --clean \ + --preprocess_submission=yes \ + --run-checker \ + --submitter=MLCommons \ + --tar=yes \ + --env.CM_TAR_OUTFILE=submission.tar.gz \ + --division=closed \ + --category=edge \ + --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ + --quiet ``` === "Closed Datacenter" ### Closed Datacenter Submission ```bash - cm run script -tags=generate,inference,submission \ - --clean \ - --preprocess_submission=yes \ - --run-checker \ - --submitter=MLCommons \ - --tar=yes \ - --env.CM_TAR_OUTFILE=submission.tar.gz \ - --division=closed \ - --category=datacenter \ - --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ - --quiet + cm run script --tags=generate,inference,submission \ + --clean \ + --preprocess_submission=yes \ + --run-checker \ + --submitter=MLCommons \ + --tar=yes \ + --env.CM_TAR_OUTFILE=submission.tar.gz \ + --division=closed \ + --category=datacenter \ + --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ + --quiet ``` === "Open Edge" ### Open Edge Submission ```bash - cm run script -tags=generate,inference,submission \ - --clean \ - --preprocess_submission=yes \ - --run-checker \ - --submitter=MLCommons \ - --tar=yes \ - --env.CM_TAR_OUTFILE=submission.tar.gz \ - --division=open \ - --category=edge \ - --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ - --quiet + cm run script --tags=generate,inference,submission \ + --clean \ + --preprocess_submission=yes \ + --run-checker \ + --submitter=MLCommons \ + --tar=yes \ + --env.CM_TAR_OUTFILE=submission.tar.gz \ + --division=open \ + --category=edge \ + --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ + --quiet ``` === "Open Datacenter" ### Closed Datacenter Submission ```bash - cm run script -tags=generate,inference,submission \ - --clean \ - --preprocess_submission=yes \ - --run-checker \ - --submitter=MLCommons \ - --tar=yes \ - --env.CM_TAR_OUTFILE=submission.tar.gz \ - --division=open \ - --category=datacenter \ - --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ - --quiet + cm run script --tags=generate,inference,submission \ + --clean \ + --preprocess_submission=yes \ + --run-checker \ + --submitter=MLCommons \ + --tar=yes \ + --env.CM_TAR_OUTFILE=submission.tar.gz \ + --division=open \ + --category=datacenter \ + --env.CM_DETERMINE_MEMORY_CONFIGURATION=yes \ + --quiet ``` * Use `--hw_name="My system name"` to give a meaningful system name. Examples can be seen [here](https://github.com/mlcommons/inference_results_v3.0/tree/main/open/cTuning/systems) diff --git a/docs/usage/index.md b/docs/usage/index.md index 2e92a6e00..01d54c3aa 100644 --- a/docs/usage/index.md +++ b/docs/usage/index.md @@ -1 +1,6 @@ +--- +hide: + - toc +--- + # Using CM for MLPerf Inference diff --git a/language/llama2-70b/SUT_API.py b/language/llama2-70b/SUT_API.py new file mode 100644 index 000000000..da28d98cd --- /dev/null +++ b/language/llama2-70b/SUT_API.py @@ -0,0 +1,465 @@ +import os +import time +import numpy as np +import array +import torch +from torch.nn.functional import pad +from torch.utils.data import DataLoader +from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaForCausalLM +from transformers.generation.streamers import BaseStreamer + +import json +import pickle +import time +import threading +import tqdm +import queue + +import logging +from typing import TYPE_CHECKING, Optional, List +from pathlib import Path + +import more_itertools as mit +from concurrent.futures.thread import ThreadPoolExecutor + +import requests +from urllib3.exceptions import InsecureRequestWarning + +import mlperf_loadgen as lg +from dataset import Dataset + +logging.basicConfig(level=logging.INFO) +log = logging.getLogger("Llama-70B-SUT") + +gen_kwargs = { + "early_stopping": True, + "max_new_tokens": 1024, + "min_new_tokens": 1, + "num_beams": 1, + "do_sample": False +} + + + +class FirstTokenStreamer(BaseStreamer): + """ Streams first tokens to a 'holder' """ + + def __init__(self, first_token, tokens_cache=[], is_first_token=True, response_ids=[] ): + """ Response ids added to 'sign' the first token""" + + self.first_token = first_token # Queue for first token + self.is_first_token = is_first_token + + # Cache for subsequent generated tokens + self.tokens_cache = tokens_cache + + self.response_ids = response_ids + + self.is_prompt = True # The first tokens sent to the streamer are actually the input prompts + + def put(self, value): + """ Caches the tokens as they're generated. Assumes bs=1 """ + + # Prompts are streamed first so we need to skip the first time value that arrives + if self.is_prompt: + self.is_prompt = False + return + + value = value.item() + if self.is_first_token: + + # Add generated first token together with its query response_id to first tokens queue + self.first_token.put((value, self.response_ids[0])) + + self.is_first_token = False + return + + self.tokens_cache.append(value) + + + def end(self): + pass + + def get_out_tokens(self): + return self.tokens_cache + + +class SUT(): + def __init__(self, + model_path=None, + api_server=None, + api_model_name=None, + dtype="bfloat16", + device="cpu", + batch_size=None, + total_sample_count=24576, + dataset_path=None, + use_cached_outputs=False, # Set this to True *only for test accuracy runs* in case your prior session was killed partway through + workers=1): + + self.model_path = model_path or "meta-llama/Llama-2-70b-chat-hf" + self.device = device + self.api_servers = [] + if api_server: + self.api_servers.append(api_server) + self.api_model_name = api_model_name + self.device = device + + batch_size = total_sample_count + self.batch_size = batch_size + + # dtype + if dtype == 'bfloat16': + self.amp_enabled = True + self.amp_dtype = torch.bfloat16 + elif dtype == 'float16': + self.amp_enabled = True + self.amp_dtype = torch.float16 + else: + self.amp_enabled = False + self.amp_dtype = torch.float32 + + if 'cuda' in self.device: + assert torch.cuda.is_available(), "torch gpu is not available, exiting..." + + self.dataset_path = dataset_path + self.data_object = Dataset(self.model_path, + dataset_path=self.dataset_path, + total_sample_count=total_sample_count, + device=self.device) + self.qsl = lg.ConstructQSL(self.data_object.total_sample_count, self.data_object.perf_count, + self.data_object.LoadSamplesToRam, self.data_object.UnloadSamplesFromRam) + + #self.load_model() + self.tokenizer = AutoTokenizer.from_pretrained( + self.model_path, + model_max_length=1024, + padding_side="left", + use_fast=True,) #changed from false + + self.tokenizer.pad_token = self.tokenizer.eos_token + + self.num_workers = workers + self.worker_threads = [None] * self.num_workers + self.query_queue = queue.Queue() + + self.use_cached_outputs = use_cached_outputs + self.sample_counter = 0 + self.sample_counter_lock = threading.Lock() + + + def start(self): + # Create worker threads + for j in range(self.num_workers): + worker = threading.Thread(target=self.process_queries) + worker.start() + self.worker_threads[j] = worker + + def stop(self): + for _ in range(self.num_workers): + self.query_queue.put(None) + + for worker in self.worker_threads: + worker.join() + + def query_api_vllm(self, inputs, idx): + headers = { + 'Content-Type': 'application/json', + } + json_data = { + "model": self.api_model_name, + "prompt": inputs, + "min_tokens": 1, + "max_tokens": 1024 + } + + response_code = 0 + print(f"Server path {self.api_servers[idx]}/v1/completions") + while response_code != 200: + try: + response = requests.post(f"{self.api_servers[idx]}/v1/completions", headers=headers, json=json_data, verify=False) + response_code = response.status_code + except Exception as e: + print(e) + print("connection failure") + break + return [resp["text"] for resp in json.loads(response.text)["choices"]] + + def api_action_handler(self, chunk, server_idx): + output = self.query_api_vllm(chunk, server_idx) + return output + + def process_queries(self): + """Processor of the queued queries. User may choose to add batching logic """ + + while True: + qitem = self.query_queue.get() + if qitem is None: + break + + query_ids = [q.index for q in qitem] + + fname = "q" + "_".join([str(i) for i in query_ids]) + fname = f"run_outputs/{fname}.pkl" + _p = Path(fname) + if self.use_cached_outputs and _p.exists(): + # Read cache + with _p.open(mode="rb") as f: + d = pickle.load(f) + processed_output = d["outputs"] + tik1 = None + tik2 = None + tik3 = None + tok = None + else: + # Construct / collate batch + max_seq_len = 1024 + + tik1 = time.time() + + # OpenAI-API servers don't require padding and can take input tokens + # directly, so we build our input_ids_tensor as a jagged list + input_ids_tensor = [] + for q in qitem: + #input_ids_tensor.append(self.data_object.input_ids[q.index].tolist()) + input_ids_tensor += self.data_object.input_ids[q.index].tolist() + + # NOTE(mgoin): I don't think this has to be a torch tensor + #input_ids_tensor = torch.cat(input_ids_tensor) + + #print(input_ids_tensor) + + assert len(input_ids_tensor) <= self.batch_size + + tik2 = time.time() + + # NOTE(mgoin): I don't think threading is necessary since we are submitting all queries in one request + # The API server should take care of mini-batches and scheduling + if self.api_servers: + ''' + decoded = self.tokenizer.batch_decode(input_ids_tensor) + cleaned = [entry.replace('','').replace('','') for entry in decoded] + cleaned_chunks = [list(c) for c in mit.divide(len(self.api_servers), cleaned)] + ''' + cleaned_chunks = [input_ids_tensor] + with ThreadPoolExecutor(max_workers=len(self.api_servers)) as executor: + #needs to be tested + output_chunks = list(executor.map(self.api_action_handler,cleaned_chunks,range(len(self.api_servers)))) + output = [] + for row in output_chunks: + output += row + else: + print("Error: Specify at least one API to which the request is to be sent!") + exit(1) + + tik3 = time.time() + + processed_output = self.tokenizer(output)['input_ids'] + #for i in range(len(qitem)): + for i in range(len(processed_output)): + # NOTE(mgoin): Not optimal to make numpy arrays just to serialize + unpadded = np.array(processed_output[i]) + n_tokens = unpadded.shape[0] + response_array = array.array("B", unpadded.tobytes()) + bi = response_array.buffer_info() + response = [lg.QuerySampleResponse(qitem[i].id, bi[0], bi[1], n_tokens)] + lg.QuerySamplesComplete(response) + + tok = time.time() + + with self.sample_counter_lock: + self.sample_counter += len(qitem) + print(f"Samples run: {self.sample_counter}") + if tik1: + print(f"\tBatchMaker time: {tik2 - tik1}") + print(f"\tInference time: {tik3 - tik2}") + print(f"\tPostprocess time: {tok - tik3}") + print(f"\t==== Total time: {tok - tik1}") + else: + print(f"\tLoaded from cache: {_p}") + + def get_sut(self): + self.sut = lg.ConstructSUT(self.issue_queries, self.flush_queries) + return self.sut + + def get_qsl(self): + return self.qsl + + + def predict(self,**kwargs): + raise NotImplementedError + + + def issue_queries(self, query_samples): + """ Receives samples from loadgen and adds them to queue. Users may choose to batch here""" + + list_prompts_tokens = [] + list_prompts_attn_masks = [] + + print(f"IssueQuery started with {len(query_samples)} samples") + while len(query_samples) > 0: + self.query_queue.put(query_samples[:self.batch_size]) + query_samples = query_samples[self.batch_size:] + print(f"IssueQuery done") + + + def flush_queries(self): + pass + + def __del__(self): + pass + + +class SUTServer(SUT): + def __init__(self, model_path=None, api_server=None, api_model_name=None, dtype="bfloat16", device="cpu", total_sample_count=24576, dataset_path=None, batch_size=None, workers=1): + + super().__init__(model_path=model_path, api_server=None, api_model_name=None, dtype=dtype, device=device, total_sample_count=total_sample_count, dataset_path=dataset_path, workers=workers) + + with open(f"{self.model_path}/tokenizer.json", 'r') as token_file: + llama_tokenizer = json.load(token_file) + self.llama_vocab = llama_tokenizer["model"]["vocab"] + + self.first_token_queue = queue.Queue() + + def start(self): + + # Create worker threads + for j in range(self.num_workers): + worker = threading.Thread(target=self.process_queries) + worker.start() + self.worker_threads[j] = worker + + # Create first token response thread + self.ft_response_thread = threading.Thread(target=self.process_first_tokens) + self.ft_response_thread.start() + + + def process_first_tokens(self): + + while True: + first_token_item = self.first_token_queue.get() + + if first_token_item is None: + log.info("Exiting First token response thread") + break + + first_tokens, response_id = first_token_item + + response_data = array.array("B", np.array(first_tokens, np.float32).tobytes()) + bi = response_data.buffer_info() + response = [lg.QuerySampleResponse(response_id, bi[0], bi[1])] + lg.FirstTokenComplete(response) + + def stream_api_vllm(self, input, response_ids, idx): + headers = { + 'Content-Type': 'application/json', + } + + json_data = { + 'model': '/opt/app-root/share/models', + 'prompt': input, + 'max_tokens': 1024, + 'temperature': 0, + 'stream': True, + 'logprobs': 1 + } + + while True: + try: + token_cache = [] + s = requests.Session() + first = True + with s.post( + f'{self.api_servers[idx]}/v1/completions', + headers=headers, + json=json_data, + verify=False, + stream=True + ) as resp: + for line in resp.iter_lines(): + if line: + decoded = line.decode() + if decoded.startswith("data") and "[DONE]" not in decoded: + inter = json.loads(decoded[6:])["choices"][0]["logprobs"] + if "top_logprobs" in inter: + token_s = list(inter["top_logprobs"][0].keys())[0] + token = self.llama_vocab[token_s] + if first: + self.first_token_queue.put((token, response_ids[0])) + first = False + token_cache.append(token) + s.close() + if token_cache: + return token_cache + except Exception as e: + s.close() + print("Connection failure") + print(f"An exception occurred: {type(e).__name__}") + print(f"Exception details: {e}") + + def async_process_query(self, input_ids_tensor, qitem_id, idx): + decoded = self.tokenizer.decode(input_ids_tensor[0]) + response_ids = [qitem_id] + output_tokens = self.stream_api_vllm(decoded, response_ids, idx) + n_tokens = len(output_tokens) + if n_tokens <= 1: + print("WARNING: caught low token count") + print(input_ids_tensor) + print(output_tokens) + response_array = array.array("B", np.array(output_tokens, np.int32).tobytes()) + bi = response_array.buffer_info() + response = [lg.QuerySampleResponse( + qitem_id, bi[0], bi[1], n_tokens)] + lg.QuerySamplesComplete(response) + sys.exit() + + def process_queries(self): + """Processor of the queued queries. User may choose to add batching logic """ + server_idx = 0 + while True: + + qitem = self.query_queue.get() + if qitem is None: + break + + input_ids_tensor = self.data_object.input_ids[qitem.index] + input_masks_tensor = self.data_object.attention_masks[qitem.index] + + if self.api_servers: + threading.Thread(target=self.async_process_query, args=(input_ids_tensor, qitem.id, server_idx)).start() + server_idx = (server_idx + 1) % len(self.api_servers) + else: + #TODO: This PoC is super slow with significant overhead. Best to create a patch to `generate` + tokens_cache = [] + tokens_streamer = FirstTokenStreamer(self.first_token_queue, tokens_cache=tokens_cache, is_first_token=True, response_ids=[qitem.id]) + + _ = self.model.generate( input_ids=input_ids_tensor, + attention_mask=input_masks_tensor, + pad_token_id=self.tokenizer.pad_token_id, + streamer = tokens_streamer, + **gen_kwargs + ) + + output_tokens = tokens_streamer.get_out_tokens() + + n_tokens = len(output_tokens) + response_array = array.array("B", np.array(output_tokens, np.int32).tobytes()) + bi = response_array.buffer_info() + response = [lg.QuerySampleResponse( + qitem.id, bi[0], bi[1], n_tokens)] + lg.QuerySamplesComplete(response) + + def issue_queries(self, query_samples): + + self.query_queue.put(query_samples[0]) + + + def stop(self): + for _ in range(self.num_workers): + self.query_queue.put(None) + + for worker in self.worker_threads: + worker.join() + + self.first_token_queue.put(None) + self.ft_response_thread.join() diff --git a/language/llama2-70b/main.py b/language/llama2-70b/main.py index 15dbdd2ed..c1af88b95 100644 --- a/language/llama2-70b/main.py +++ b/language/llama2-70b/main.py @@ -4,13 +4,27 @@ import os import logging import sys -from SUT import SUT, SUTServer +import requests +import json sys.path.insert(0, os.getcwd()) logging.basicConfig(level=logging.INFO) log = logging.getLogger("Llama-70B-MAIN") +# function to check the model name in server matches the user specified one +def verify_model_name(user_specified_name, url): + response = requests.get(url) + if response.status_code == 200: + response_dict = response.json() + server_model_name = response_dict["data"][0]["id"] + if user_specified_name == server_model_name: + return {"matched":True, "error":False} + else: + return {"matched":False, "error":f"User specified {user_specified_name} and server model name {server_model_name} mismatch!"} + else: + return {"matched":False, "error":f"Failed to get a valid response. Status code: {response.status_code}"} + def get_args(): parser = argparse.ArgumentParser() parser.add_argument("--scenario", type=str, choices=["Offline", "Server"], default="Offline", help="Scenario") @@ -27,6 +41,9 @@ def get_args(): parser.add_argument("--output-log-dir", type=str, default="output-logs", help="Where logs are saved") parser.add_argument("--enable-log-trace", action="store_true", help="Enable log tracing. This file can become quite large") parser.add_argument("--num-workers", type=int, default=1, help="Number of workers to process queries") + parser.add_argument("--vllm", action="store_true", help="vllm mode") + parser.add_argument("--api-model-name", type=str, default="meta-llama/Llama-2-70b-chat-hf", help="Model name(specified in llm server)") + parser.add_argument("--api-server", type=str, default=None, help="Specify an api endpoint call to use api mode") args = parser.parse_args() return args @@ -37,13 +54,15 @@ def get_args(): "server": lg.TestScenario.Server, } -sut_map = { - "offline": SUT, - "server": SUTServer - } - def main(): args = get_args() + + if args.vllm: + resp = verify_model_name(args.api_model_name, args.api_server+"/v1/models") + if resp["error"]: + print(f"\n\n\033[91mError:\033[0m", end=" ") + print(resp["error"]) + sys.exit(1) settings = lg.TestSettings() settings.scenario = scenario_map[args.scenario.lower()] @@ -64,16 +83,40 @@ def main(): log_settings.log_output = log_output_settings log_settings.enable_trace = args.enable_log_trace + if args.vllm: + from SUT_API import SUT, SUTServer + else: + from SUT import SUT, SUTServer + + sut_map = { + "offline": SUT, + "server": SUTServer + } + sut_cls = sut_map[args.scenario.lower()] - sut = sut_cls( - model_path=args.model_path, - dtype=args.dtype, - batch_size=args.batch_size, - dataset_path=args.dataset_path, - total_sample_count=args.total_sample_count, - device=args.device, - ) + if args.vllm: + sut = sut_cls( + model_path=args.model_path, + dtype=args.dtype, + batch_size=args.batch_size, + dataset_path=args.dataset_path, + total_sample_count=args.total_sample_count, + device=args.device, + api_server=args.api_server, + api_model_name=args.api_model_name, + workers=args.num_workers + ) + else: + sut = sut_cls( + model_path=args.model_path, + dtype=args.dtype, + batch_size=args.batch_size, + dataset_path=args.dataset_path, + total_sample_count=args.total_sample_count, + device=args.device, + workers=args.num_workers + ) # Start sut before loadgen starts sut.start() diff --git a/main.py b/main.py old mode 100644 new mode 100755 index 2f1c036fe..c9e3e1b56 --- a/main.py +++ b/main.py @@ -12,30 +12,44 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): content="" scenarios = [] execution_envs = ["Docker","Native"] - code_version="r4.1" + code_version="r4.1-dev" if model == "rnnt": code_version="r4.0" if implementation == "reference": + # Tip + if "99.9" not in model: + content += f"\n{pre_space}!!! tip\n\n" + content += f"{pre_space} - MLCommons reference implementations are only meant to provide a rules compliant reference implementation for the submitters and in most cases are not best performing. If you want to benchmark any system, it is advisable to use the vendor MLPerf implementation for that system like Nvidia, Intel etc.\n\n" + devices = [ "CPU", "CUDA", "ROCm" ] if model.lower() == "resnet50": frameworks = [ "Onnxruntime", "Tensorflow", "Deepsparse" ] elif model.lower() == "retinanet": frameworks = [ "Onnxruntime", "Pytorch" ] elif "bert" in model.lower(): - frameworks = [ "Onnxruntime", "Pytorch", "Tensorflow" ] + frameworks = [ "Pytorch", "Deepsparse" ] else: frameworks = [ "Pytorch" ] elif implementation == "nvidia": - if model in [ "sdxl", "llama2-70b-99", "llama2-70b-99.9", "mixtral-8x7b" ]: + if model in [ "mixtral-8x7b" ]: return pre_space+" WIP" devices = [ "CUDA" ] frameworks = [ "TensorRT" ] + + elif implementation == "neuralmagic": + devices = [ "CUDA" ] + frameworks = [ "pytorch" ] elif implementation == "intel": - if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9" ]: + # Tip + if "99.9" not in model: + content += f"\n{pre_space}!!! tip\n\n" + content += f"{pre_space} - Intel MLPerf inference implementation is available only for datacenter category and has been tested only on a limited number of systems. Most of the benchmarks using Intel implementation require at least Intel Sapphire Rapids or higher CPU generation.\n\n" + + if model not in [ "bert-99", "bert-99.9", "gptj-99", "gptj-99.9", "resnet50", "retinanet", "3d-unet-99", "3d-unet-99.9", "dlrm-v2-99", "dlrm-v2-99.9", "sdxl" ]: return pre_space+" WIP" if model in [ "bert-99", "bert-99.9", "retinanet", "3d-unet-99", "3d-unet-99.9" ]: code_version="r4.0" @@ -68,6 +82,8 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): else: categories = [ "Edge", "Datacenter" ] + # model name + content += f"{pre_space}{model.upper()}\n\n" for category in categories: if category == "Edge" and not scenarios: scenarios = [ "Offline", "SingleStream" ] @@ -100,6 +116,9 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): content += f"{cur_space1}=== \"{device}\"\n" content += f"{cur_space2}##### {device} device\n\n" + # minimum system requirements + content += get_min_system_requirements(cur_space2, model, implementation, device) + # to select the execution environments(currently Docker and Native) for execution_env in execution_envs: if (device == "ROCm" or implementation == "qualcomm") and execution_env == "Docker": @@ -108,9 +127,18 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): continue # Nvidia implementation only supports execution through docker content += f"{cur_space2}=== \"{execution_env}\"\n" content += f"{cur_space3}###### {execution_env} Environment\n\n" + # ref to cm installation + content += f"{cur_space3}Please refer to the [installation page](../../install/index.md) to install CM for running the automated benchmark commands.\n\n" test_query_count=get_test_query_count(model, implementation, device) if "99.9" not in model: #not showing docker command as it is already done for the 99% variant + if implementation == "neuralmagic": + content += f"{cur_space3}####### Run the Inference Server\n" + content += get_inference_server_run_cmd(spaces+16,implementation) + # tips regarding the running of nural magic server + content += f"\n{cur_space3}!!! tip\n\n" + content += f"{cur_space3} - Host and Port number of the server can be configured through `--host` and `--port`. Otherwise, server will run on default host `localhost` and port `8000`.\n\n" + if execution_env == "Native": # Native implementation steps through virtual environment content += f"{cur_space3}####### Setup a virtual environment for Python\n" content += get_venv_command(spaces+16) @@ -126,7 +154,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): content += f"{cur_space3}The above command should get you to an interactive shell inside the docker container and do a quick test run for the Offline scenario. Once inside the docker container please do the below commands to do the accuracy + performance runs for each scenario.\n\n" content += f"{cur_space3}
\n" content += f"{cur_space3} Please click here to see more options for the docker launch \n\n" - content += f"{cur_space3}* `--docker_cm_repo `: to use a custom fork of cm4mlops repository inside the docker image\n\n" + content += f"{cur_space3}* `--docker_cm_repo=`: to use a custom fork of cm4mlops repository inside the docker image\n\n" content += f"{cur_space3}* `--docker_cache=no`: to not use docker cache during the image build\n" if device.lower() not in [ "cuda" ]: @@ -146,7 +174,28 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): run_suffix += f"{cur_space3} Please click here to see more options for the RUN command\n\n" run_suffix += f"{cur_space3}* Use `--division=closed` to do a closed division submission which includes compliance runs\n\n" run_suffix += f"{cur_space3}* Use `--rerun` to do a rerun even when a valid run exists\n" - run_suffix += f"{cur_space3}
\n" + run_suffix += f"{cur_space3}\n\n" + + if "bert" in model.lower() and framework == "deepsparse": + run_suffix += f"{cur_space3}
\n" + run_suffix += f"{cur_space3} Please click here for generic model stubs for bert deepsparse\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95_quant-none-vnni\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50_quant-none-vnni\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned95_obs_quant-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/14layer_pruned50-none-vnni\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-base/pytorch/huggingface/squad/pruned90-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97_quant-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/pruned90-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/pruned80_quant-none-vnni\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned95-none-vnni\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/pruned97-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/bert-large/pytorch/huggingface/squad/base-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/obert-large/pytorch/huggingface/squad/base-none\n\n" + run_suffix += f"{cur_space3}* zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base-none\n" + run_suffix += f"{cur_space3}
\n" + + for scenario in scenarios: content += f"{cur_space3}=== \"{scenario}\"\n{cur_space4}###### {scenario}\n\n" @@ -155,7 +204,7 @@ def mlperf_inference_implementation_readme(spaces, model, implementation): #content += run_suffix content += f"{cur_space3}=== \"All Scenarios\"\n{cur_space4}###### All Scenarios\n\n" - run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", scenarios, code_version) + run_cmd = mlperf_inference_run_command(spaces+21, model, implementation, framework.lower(), category.lower(), "All Scenarios", device.lower(), "valid", 0, False, scenarios, code_version) content += run_cmd content += run_suffix @@ -181,6 +230,55 @@ def get_test_query_count(model, implementation, device, num_devices=1): p_range *= num_devices return p_range + + def get_min_system_requirements(spaces, model, implementation, device): + model = model.lower() + min_sys_req_content = "" + min_sys_req_content += f"{spaces}
\n" + min_sys_req_content += f"{spaces}Please click here to see the minimum system requirements for running the benchmark\n\n" + # device memory + if device.lower() == "cuda" and (implementation.lower() == "nvidia" or implementation.lower() == "reference"): + if implementation.lower() == "nvidia": + if "dlrm" in model: + device_memory = "24GB" + elif "llama2-70b" in model or "mixtral" in model: + device_memory = "80GB" + elif "sdxl" in model or "gptj" in model: + device_memory = "16GB" + else: + device_memory = "8GB" + elif implementation.lower() == "reference": + if "dlrm" in model: + device_memory = "2x80GB" + elif "llama2-70b" in model: + device_memory = "8x80GB" + elif "mixtral" in model: + device_memory = "4x80GB" + elif "sdxl" in model: + device_memory = "24GB(fp32), 16GB(fp16)" + elif "gptj" in model: + device_memory = "80GB(fp32). 40GB(fp16)" + else: + device_memory = "8GB" + min_sys_req_content += f"{spaces}* **Device Memory**: {device_memory}\n\n" + # disk space + if "dlrm" in model: + disk_space = "500GB" + elif "llama2-70b" in model: + disk_space = "700GB" + elif "mixtral" in model: + disk_space = "100GB" + elif "retinanet" in model: + disk_space = "200GB" + else: + disk_space = "50GB" + min_sys_req_content += f"{spaces}* **Disk Space**: {disk_space}\n\n" + # System memory + if "dlrm" in model: + system_memory = "512GB" + min_sys_req_content += f"{spaces}* **System Memory(RAM+SWAP)**: {system_memory}\n\n" + min_sys_req_content += f"{spaces}
\n" + return min_sys_req_content def get_readme_prefix(spaces, model, implementation): readme_prefix = "" @@ -191,6 +289,18 @@ def get_readme_prefix(spaces, model, implementation): return readme_prefix + def get_inference_server_run_cmd(spaces, implementation): + indent = " "*spaces + " " + if implementation == "neuralmagic": + pre_space = " "*spaces + return f"""\n +{pre_space}```bash +{pre_space}cm run script --tags=run,vllm-server \\ +{indent}--model=nm-testing/Llama-2-70b-chat-hf-FP8 \\ +{indent}--vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8 \\ +{indent}--quiet +{pre_space}```\n""" + def get_venv_command(spaces): pre_space = " "*spaces return f"""\n @@ -208,7 +318,7 @@ def get_docker_info(spaces, model, implementation, device): #pre_space = " " if implementation == "nvidia": info += f"\n{pre_space}!!! tip\n\n" - info+= f"{pre_space} All the Nvidia benchmarks, except GPT-J and LLAMA2-70B, use the same Docker container. Therefore, if you have already executed the Docker setup command for any benchmark, you can skip the Docker setup command below and run the commands inside the existing Docker container. The Docker container for GPT-J and LLAMA2-70B is the same and can be used for the other benchmarks, but not vice versa. This is because TensorRT-LLM is built specifically for the LLM benchmarks. If you are already inside a Docker container, execute the below Docker setup command without the --docker option for performance estimation.\n\n" + info+= f"{pre_space} If ran with `--all_models=yes`, all the benchmark models of NVIDIA implementation could be run within the same container.\n\n" return info def get_readme_suffix(spaces, model, implementation): @@ -241,7 +351,7 @@ def get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scen return extra_content @env.macro - def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = [], code_version="r4.1"): + def mlperf_inference_run_command(spaces, model, implementation, framework, category, scenario, device="cpu", execution_mode="test", test_query_count="20", docker=False, scenarios = [], code_version="r4.1-dev"): pre_space = "" for i in range(1,spaces): pre_space = pre_space + " " @@ -256,13 +366,27 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ scenario_option = f"\\\n{pre_space} --scenario={scenario}" if scenario == "Server" or (scenario == "All Scenarios" and "Server" in scenarios): - scenario_option = f"\\\n{pre_space} --server_target_qps=" + scenario_option += f"\\\n{pre_space} --server_target_qps=" run_cmd_extra = get_run_cmd_extra(f_pre_space, model, implementation, device, scenario, scenarios) if docker: docker_cmd_suffix = f" \\\n{pre_space} --docker --quiet" docker_cmd_suffix += f" \\\n{pre_space} --test_query_count={test_query_count}" + + if "bert" in model.lower() and framework == "deepsparse": + docker_cmd_suffix += f"\\\n{pre_space} --env.CM_MLPERF_NEURALMAGIC_MODEL_ZOO_STUB=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none" + if "llama2-70b" in model.lower(): + if implementation == "nvidia": + docker_cmd_suffix += f" \\\n{pre_space} --tp_size=2" + docker_cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=" + elif implementation == "neuralmagic": + docker_cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000" + docker_cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8" + docker_cmd_suffix += f" \\\n{pre_space} --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm" + + if "dlrm-v2" in model.lower() and implementation == "nvidia": + docker_cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=" docker_setup_cmd = f"""\n {f_pre_space}```bash @@ -283,6 +407,20 @@ def mlperf_inference_run_command(spaces, model, implementation, framework, categ if execution_mode == "test": cmd_suffix += f" \\\n {pre_space} --test_query_count={test_query_count}" + if "bert" in model.lower() and framework == "deepsparse": + cmd_suffix += f"\\\n{pre_space} --env.CM_MLPERF_NEURALMAGIC_MODEL_ZOO_STUB=zoo:nlp/question_answering/mobilebert-none/pytorch/huggingface/squad/base_quant-none" + if "llama2-70b" in model.lower(): + if implementation == "nvidia": + cmd_suffix += f" \\\n{pre_space} --tp_size=" + cmd_suffix += f" \\\n{pre_space} --nvidia_llama2_dataset_file_path=" + elif implementation == "neuralmagic": + cmd_suffix += f" \\\n{pre_space} --api_server=http://localhost:8000" + cmd_suffix += f" \\\n{pre_space} --vllm_model_name=nm-testing/Llama-2-70b-chat-hf-FP8" + cmd_suffix += f" \\\n{pre_space} --adr.mlperf-implementation.tags=_repo.https://github.com/neuralmagic/inference,_branch.vllm" + + if "dlrm-v2" in model and implementation == "nvidia": + cmd_suffix += f" \\\n{pre_space} --criteo_day23_raw_data_path=" + run_cmd = f"""\n {f_pre_space}```bash {f_pre_space}cm run script --tags=run-mlperf,inference,_{code_version}{scenario_variation_tag} \\ diff --git a/mkdocs.yml b/mkdocs.yml index 90a3ce9fd..8e59acf63 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -22,7 +22,7 @@ nav: - Install: - install/index.md - Benchmarks: - - benchmarks/index.md + - index.md - Image Classification: - ResNet50: benchmarks/image_classification/resnet50.md - Text to Image: