Get started with KServe ModelMesh for multi-model serving
The following tutorial was originally posted on IBM Developer.
ModelMesh is a mature, general-purpose model serving management and routing layer. Optimized for high volume, high density, and frequently changing model use cases, ModelMesh intelligently loads and unloads models to and from memory to strike a balance between responsiveness and compute.
You can read more about ModelMesh features in this blog but here are some of its features at a glance:
- Cache management
- Pods are managed as a distributed least recently used (LRU) cache.
- Copies of models are loaded and unloaded based on usage recency and current request volumes.
- Intelligent placement and loading
- Model placement is balanced by both the cache age across the pods and the request load.
- Queues are used to handle concurrent model loads and minimize impact to runtime traffic.
- Resiliency
- Failed model loads are automatically retried in different pods.
- Operational simplicity
- Rolling model updates are handled automatically and seamlessly.
IBM used ModelMesh in production for several years before it was contributed to the open source community as part of KServe. ModelMesh has served as the backbone for most Watson services, such as Watson NLU and watsonx Assistant, and more recently it underpins the upcoming enterprise-ready AI and data platform watsonx.
In this tutorial, you’ll be guided to:
- Install ModelMesh Serving, the controller for managing ModelMesh clusters.
- Deploy a model using the
InferenceService
resource and check its status. - Perform an inference on the deployed model using gRPC and REST.
Prerequisites
The following prerequisites are required to install ModelMesh Serving:
- A Kubernetes cluster (v1.16+) with administrative privileges.
- kubectl and kustomize (v3.2.0+).
- At least 4 vCPUs and 8 GB memory. If you’d like information about resources, check out the details about each deployed component of ModelMesh Serving.
Step 1. Install ModelMesh Serving
Assuming the latest release is v0.10
, start by cloning the branch:
RELEASE=release-0.10
git clone -b $RELEASE --depth 1 --single-branch https://github.com/kserve/modelmesh-serving.git
cd modelmesh-serving
From within our cloned repository, we can run the installation script. Create a namespace called modelmesh-serving
to deploy ModelMesh to. You’ll pass the namespace as a parameter for the script. You’ll also include the flag --namespace-scope-mode
, which means that the ModelMesh Serving instance and its components will exist only within a single namespace. (The alternative (and default) is cluster-scoped.) Lastly, the --quickstart
flag will make sure etcd
and MinIO
instances with sample models are included in the installation too. These sample models are intended to be used for development and experimentation and not for production. If you want to dive deeper, check out the installation documentation.
The script should only take a few minutes to run and will output the message Successfully installed ModelMesh Serving!
when completed succesfully.
Let’s verify the installation.
First, make sure that the controller, MinIO
, and the etcd
pods are running:
kubectl get pods
NAME READY STATUS RESTARTS AGE
pod/etcd 1/1 Running 0 5m
pod/minio 1/1 Running 0 5m
pod/modelmesh-controller-547bfb64dc-mrgrq 1/1 Running 0 5m
Now, let’s confirm that the ServingRuntime
resources are available:
kubectl get servingruntimes
NAME DISABLED MODELTYPE CONTAINERS AGE
mlserver-1.x sklearn mlserver 5m
ovms-1.x openvino_ir ovms 5m
torchserve-0.x pytorch-mar torchserve 5m
triton-2.x tensorflow triton 5m
A ServingRuntime
defines the templates for pods that can serve one or more particular model formats pods for each are automatically provisioned depending on the framework of the model deployed. ModelMesh Serving currently includes several runtimes by default:
ServingRuntime | Supported Frameworks |
---|---|
mlserver-1.x | sklearn, xgboost, lightgbm |
ovms-1.x | openvino_ir, onnx |
torchserve-0.x | pytorch-mar |
triton-2.x | tensorflow, pytorch, onnx, tensorrt |
You can learn more about serving runtimes here, or if these model servers don’t meet all of your specific requirements, you can learn how to build your own custom serving runtime in this tutorial, “Creating a custom serving runtime in KServe ModelMesh”.
Step 2. Deploy a model
With ModelMesh Serving now installed, you can deploy a model using the KServe InferenceService
custom resource definition. It’s the main interface that KServe and ModelMesh use for managing models, representing the model’s logical endpoint for serving inferences. The ModelMesh controller will only handle those that include the serving.kserve.io/deploymentMode: ModelMesh
annotation.
As mentioned earlier, deploying ModelMesh Serving using the --quickstart
flag includes a set of sample models to get started with. Below, we’ll deploy the sample SKLearn MNIST
model served from the local MinIO
container:
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-sklearn-isvc
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: sklearn
storage:
key: localMinIO
path: sklearn/mnist-svm.joblib
EOF
The above YAML uses the InferenceService
predictor storageSpec
where the key
is the credential key for the destination storage in the common secret and the path
is the model path inside the bucket. You could also use storageUri
instead:
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: example-sklearn-isvc
annotations:
serving.kserve.io/deploymentMode: ModelMesh
serving.kserve.io/secretKey: localMinIO
spec:
predictor:
model:
modelFormat:
name: sklearn
storageUri: s3://modelmesh-example-models/sklearn/mnist-svm.joblib
EOF
Either way you do it, after creating this InferenceService
, you’ll likely see that it’s not yet ready:
kubectl get isvc
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
example-sklearn-isvc False 3s
If you do, it’s probably because the ServingRuntime
pods that will host the SKLearn
model are still spinning up. You can check your pods to confirm this and eventually, you’ll see them in the Running
state:
kubectl get pods
...
modelmesh-serving-mlserver-1.x-7db675f677-twrwd 3/3 Running 0 2m
modelmesh-serving-mlserver-1.x-7db675f677-xvd8q 3/3 Running 0 2m
Then, checking on the InferenceService
again, it is now ready with a provided URL:
kubectl get isvc
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
example-sklearn-isvc grpc://modelmesh-serving.modelmesh-serving:8033 True 97s
You can describe the InferenceService
to get descriptive status information too. If you checked it before it was ready, you will see useful information for debugging such as a Waiting for runtime Pod to become available
message, but not in this case:
kubectl describe isvc example-sklearn-isvc
Name: example-sklearn-isvc
...
Status:
Components:
Predictor:
Grpc URL: grpc://modelmesh-serving.modelmesh-serving:8033
Rest URL: http://modelmesh-serving.modelmesh-serving:8008
URL: grpc://modelmesh-serving.modelmesh-serving:8033
Conditions:
Last Transition Time: 2022-07-18T18:01:54Z
Status: True
Type: PredictorReady
Last Transition Time: 2022-07-18T18:01:54Z
Status: True
Type: Ready
Model Status:
Copies:
Failed Copies: 0
Total Copies: 2
States:
Active Model State: Loaded
Target Model State:
Transition Status: UpToDate
URL: grpc://modelmesh-serving.modelmesh-serving:8033
...
More details on the InferenceService
CRD and its status information can be found the ModelMesh Serving docs.
Step 3. Perform an inference request
Now that a model is loaded and available, you can perform inferences! Currently, only gRPC inference requests are supported by ModelMesh, but REST support is enabled via a REST proxy container. By default, ModelMesh Serving uses a headless Service since load balancing gRPC requests requires special attention.
Inferencing using gRPC request
To test out gRPC inference requests, you can port-forward the headless service in a separate terminal window:
kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8033 -n modelmesh-serving
Using grpcurl and the gRPC client generated from the KServe grpc_predict_v2.proto file, we can test an inference on localhost:8033
. Make sure to run the following command from the root directory of the cloned repository and set MODEL_NAME
to the name of the deployed InferenceService
:
MODEL_NAME=example-sklearn-isvc
grpcurl \
-plaintext \
-proto fvt/proto/kfs_inference_v2.proto \
-d '{ "model_name": "'"${MODEL_NAME}"'", "inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "contents": { "fp32_contents": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0] }}]}' \
localhost:8033 \
inference.GRPCInferenceService.ModelInfer
The response output will look like:
{
"modelName": "example-sklearn-isvc__isvc-3642375d03",
"outputs": [
{
"name": "predict",
"datatype": "INT64",
"shape": ["1"],
"contents": {
"int64Contents": ["8"]
}
}
]
}
Inferencing using a REST request
While the REST proxy is currently in an alpha state, we can still use curl
to test an inference using it as well.
First you’ll need to port-forward a different port for REST:
kubectl port-forward --address 0.0.0.0 service/modelmesh-serving 8008 -n modelmesh-serving
Once again, make sure that MODEL_NAME
is set to the name of your InferenceService
when you run the following command:
MODEL_NAME=example-sklearn-isvc
curl -X POST -k http://localhost:8008/v2/models/${MODEL_NAME}/infer -d '{"inputs": [{ "name": "predict", "shape": [1, 64], "datatype": "FP32", "data": [0.0, 0.0, 1.0, 11.0, 14.0, 15.0, 3.0, 0.0, 0.0, 1.0, 13.0, 16.0, 12.0, 16.0, 8.0, 0.0, 0.0, 8.0, 16.0, 4.0, 6.0, 16.0, 5.0, 0.0, 0.0, 5.0, 15.0, 11.0, 13.0, 14.0, 0.0, 0.0, 0.0, 0.0, 2.0, 12.0, 16.0, 13.0, 0.0, 0.0, 0.0, 0.0, 0.0, 13.0, 16.0, 16.0, 6.0, 0.0, 0.0, 0.0, 0.0, 16.0, 16.0, 16.0, 7.0, 0.0, 0.0, 0.0, 0.0, 11.0, 13.0, 12.0, 1.0, 0.0]}]}'
The response output will look like:
{
"model_name": "example-sklearn-isvc__ksp-7702c1b55a",
"outputs": [
{
"name": "predict",
"datatype": "FP32",
"shape": [1],
"data": [8]
}
]
}
You’ve made your first inference requests! As always, more detailed information about sending inference requests to your InferenceService
can be found in the [ModelMesh Serving docs]](https://github.com/kserve/modelmesh-serving/blob/main/docs/predictors/run-inference.md).
Summary and next steps
This tutorial is a great starting point toward taking advantage of ModelMesh’s effectiveness and reliability to scale as needed. You learned about some of ModelMesh’s features and core resources like the ServingRuntime
and the InferenceService
, all while deploying and inferencing your first model deployed on your own ModelMesh Serving instance.
You can dive deeper and learn how to create a custom serving runtime to serve any model you’d like in this tutorial, “Creating a custom serving runtime in KServe ModelMesh”.
If you want an enterprise-grade platform for your AI workloads built on top of open source software like ModelMesh, be sure to try watsonx. Explore more articles and tutorials about watsonx on IBM Developer.