⚠️ Notice: Limited Maintenance

This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.

TorchServe example with Huggingface bert model

In this example we will show how to serve Huggingface Transformers with TorchServe model locally using kserve.

Model archive file creation

Clone pytorch/serve repository.

Copy the Transformer_kserve_handler.py handler file to examples/Huggingface_Transformers folder

Navigate to examples/Huggingface_Transformers

Run the following command to download the model

python Download_Transformer_models.py

Generate mar file

torch-model-archiver --model-name BERTSeqClassification --version 1.0 \
--serialized-file Transformer_model/pytorch_model.bin \
--handler ./Transformer_kserve_handler.py \
--extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json,./Transformer_handler_generalized.py"

The command will create BERTSeqClassification.mar file in current directory

Move the mar file to model-store

sudo mv BERTSeqClassification.mar /mnt/models/model-store

and use the following config properties (/mnt/models/config)

inference_address=http://127.0.0.1:8085
management_address=http://127.0.0.1:8085
metrics_address=http://127.0.0.1:8082
enable_envvars_config=true
install_py_dep_per_model=true
enable_metrics_api=true
service_envelope=kservev2
metrics_mode=prometheus
NUM_WORKERS=1
number_of_netty_threads=4
job_queue_size=10
model_store=/mnt/models/model-store
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"BERTSeqClassification":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}}

Preparing input

Use bert_bytes_v2.json or bert_tensor_v2.

For new sample text, follow the instructions below

For bytes input, use tobytes utility.

python tobytes.py --input_text "this year business is good"

For tensor input, use bert_tokenizer utility

python bert_tokenizer.py --input_text "this year business is good"

Deploying the model in local machine

Start TorchServe

torchserve --start --ts-config /mnt/models/config/config.properties --ncs

To test locally, clone TorchServe and move to the following folder kubernetes/kserve/kserve_wrapper

Start Kserve

python __main__.py

Request and response

Sample request and response for bytes input

Navigate to kubernetes/kserve/kf_request_json/v2/bert

Run the following command

curl -v -H "ContentType: application/json" http://localhost:8080/v2/models/BERTSeqClassification/infer -d @./bert_bytes_v2.json

Expected Output

{"id": "d3b15cad-50a2-4eaf-80ce-8b0a428bd298", "model_name": "BERTSeqClassification", "model_version": "1.0", "outputs": [{"name": "predict", "shape": [], "datatype": "BYTES", "data": ["Not Accepted"]}]}

Sample request and response for tensor input

Run the following command

curl -v -H "ContentType: application/json" http://localhost:8080/v2/models/BERTSeqClassification/infer -d @./bert_tensor_v2.json

Expected output

{"id": "33abc661-7265-42fc-b7d9-44e5f79a7a67", "model_name": "BERTSeqClassification", "model_version": "1.0", "outputs": [{"name": "predict", "shape": [], "datatype": "BYTES", "data": ["Not Accepted"]}]}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

⚠️ Notice: Limited Maintenance

TorchServe example with Huggingface bert model

Model archive file creation

Generate mar file

Preparing input

Deploying the model in local machine

Request and response

Sample request and response for bytes input

Sample request and response for tensor input

Files

README.md

Latest commit

History

README.md

File metadata and controls

⚠️ Notice: Limited Maintenance

TorchServe example with Huggingface bert model

Model archive file creation

Generate mar file

Preparing input

Deploying the model in local machine

Request and response

Sample request and response for bytes input

Sample request and response for tensor input