Creating a custom serving runtime in KServe ModelMesh
The following tutorial was originally posted on IBM Developer.
ModelMesh is a mature, general-purpose model serving management and routing layer. Optimized for high volume, high density, and frequently changing model use cases, ModelMesh intelligently loads and unloads models to and from memory to strike a balance between responsiveness and compute.
IBM used ModelMesh in production for several years before it was contributed to the open source community as part of KServe. ModelMesh served as the backbone for most Watson services, such as Watson NLU and Watson Assistant, and more recently it underpins the upcoming enterprise-ready AI and data platform watsonx.
You can learn more about ModelMesh’s features and install ModelMesh Serving, the controller for managing ModelMesh clusters, by following our getting started with KServe ModelMesh tutorial. You’ll even be guided to deploy and inference your first model!
ModelMesh Serving provides many model servers by default, such as:
- Triton Inference Server, NVIDIA’s server for frameworks like TensorFlow, PyTorch, TensorRT, or ONNX.
- MLServer, Seldon’s Python-based server for frameworks like SKLearn, XGBoost, or LightGBM.
- OpenVINO Model Server, Intel’s server for frameworks such as Intel OpenVINO or ONNX.
- TorchServe, support for PyTorch models that include eager mode.
However, these model servers might not meet all of your specific requirements. Your model might have custom functionality, it might have custom logic for inferencing, or the framework your model needs might not be supported yet (but feel free to request it!).
In this tutorial, you’ll learn how to serve your custom models by using ModelMesh Serving.
Serving runtimes
The namespace-scoped ServingRuntime
(and its cluster-scoped counterpart ClusterServingRuntime
) defines the templates for pods that can serve one or more particular model formats. It includes key information such as the container image of the runtime and a list of the supported model formats while other configuration settings for the runtime can be passed through environment variables in the specification.
The ServingRuntime
CRDs allow for flexibility and extensibility, which can help you to define or customize reusable runtimes without touching any of the ModelMesh controller code or other resources in the controller namespace. This means you can easily build a custom runtime to support your desired framework.
Custom serving runtimes are created by building a new container image with support for the desired framework and then creating a ServingRuntime
resource that uses that image. This is especially easy if the desired runtime’s framework uses Python bindings. For that scenario, there’s a simplified process that uses MLServer’s extension point for adding additional frameworks. MLServer provides the serving interface, you provide the framework, and ModelMesh Serving provides the glue to integrate it as a ServingRuntime
.
Build a Python-based custom serving runtime
At a high level, the steps required to build a custom runtime are:
-
Implement a class that inherits from MLServer’s
MLModel
class. -
Package the model class and dependencies into a container image.
-
Create the new
ServingRuntime
resource using that image.
Implement the MLModel class
MLServer can be extended by adding an implementation of the MLModel
class. The two main functions are load()
and predict()
. The following code is a template implementation of an MLModel
class in MLServer that includes a recommended structure along with TODOs where runtime-specific changes might need to be made. Another example implemention of this class can be found in the MLServer documentation.
from typing import List
from mlserver import MLModel, types
from mlserver.utils import get_model_uri
class CustomMLModel(MLModel):
async def load(self) -> bool:
model_uri = await get_model_uri(self._settings)
self._load_model_from_file(model_uri)
self.ready = True
return self.ready
async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
payload = self._check_request(payload)
return types.InferenceResponse(
model_name=self.name,
model_version=self.version,
outputs=self._predict_outputs(payload),
)
def _load_model_from_file(self, file_uri):
# TODO: load model from file and instantiate class data
return
def _check_request(self, payload: types.InferenceRequest) -> types.InferenceRequest:
# TODO: validate request: number of inputs, input tensor names/types, etc.
return payload
def _predict_outputs(self, payload: types.InferenceRequest) -> List[types.ResponseOutput]:
inputs = payload.inputs
# TODO: transform inputs into internal data structures
# TODO: send data through the model's prediction logic
outputs = []
return outputs
Create the runtime image
Now that we have our model class implemented, we need to package its dependencies, including MLServer, into an image that is supported as a ServingRuntime
resource. There are a variety of ways to do this, and MLServer provides helpers to build it for you using the mlserver build
command. Alternatively, you can build off of the set of directives included in the Dockerfile
snippet in the following code block. (You can learn more about Dockerfiles in this Docker tutorial.)
# TODO: choose appropriate base image, install Python, MLServer, and
# dependencies of your MLModel implementation
FROM python:3.8-slim-buster
RUN pip install mlserver
# ...
# The custom `MLModel` implementation should be on the Python search path
# instead of relying on the working directory of the image. If using a
# single-file module, this can be accomplished with:
COPY --chown=${USER} ./custom_model.py /opt/custom_model.py
ENV PYTHONPATH=/opt/
# The environment variables here are for compatibility with ModelMesh Serving.
# These can also be set in the ServingRuntime, but this is recommended for
# consistency when building and testing
ENV MLSERVER_MODELS_DIR=/models/_mlserver_models \
MLSERVER_GRPC_PORT=8001 \
MLSERVER_HTTP_PORT=8002 \
MLSERVER_LOAD_MODELS_AT_STARTUP=false \
MLSERVER_MODEL_NAME=dummy-model
# With this setting, the implementation field is not required in the model
# settings which eases integration by allowing the built-in adapter to generate
# a basic model settings file
ENV MLSERVER_MODEL_IMPLEMENTATION=custom_model.CustomMLModel
CMD ["mlserver", "start", "${MLSERVER_MODELS_DIR}"]
Create the ServingRuntime resource
Now you can make a new ServingRuntime
resource using the YAML template in the following code block and point it to the image you just created. In this YAML code:
-
`` is the name you want to give your runtime (for example,
my-custom-runtime-0.x
). -
`` is a list of model formats that this runtime will support. Behind the scenes, this is what ModelMesh will look for when deploying a model of that format and finding a suitable runtime for it.
-
`` refers to the image you created in the previous step.
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name:
spec:
supportedModelFormats:
- name:
version: "1"
autoSelect: true
multiModel: true
grpcDataEndpoint: port:8001
grpcEndpoint: port:8085
containers:
- name: mlserver
image:
env:
- name: MLSERVER_MODELS_DIR
value: "/models/_mlserver_models/"
- name: MLSERVER_GRPC_PORT
value: "8001"
- name: MLSERVER_HTTP_PORT
value: "8002"
- name: MLSERVER_LOAD_MODELS_AT_STARTUP
value: "false"
- name: MLSERVER_MODEL_NAME
value: dummy-model
- name: MLSERVER_HOST
value: "127.0.0.1"
- name: MLSERVER_GRPC_MAX_MESSAGE_LENGTH
value: "-1"
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: "5"
memory: 1Gi
builtInAdapter:
serverType: mlserver
runtimeManagementPort: 8001
memBufferBytes: 134217728
modelLoadingTimeoutMillis: 90000
And that’s it! Create the ServingRuntime
resource using the kubectl apply
command, and you’ll see your new custom runtime in your ModelMesh deployment. In the following example output, the custom runtime’s name is custom-runtime-0.x
and it supports the model format custom_model
.
kubectl get servingruntimes
NAME DISABLED MODELTYPE CONTAINERS AGE
custom-runtime-0.x custom_model mlserver 5m
mlserver-1.x sklearn mlserver 32m
ovms-1.x openvino_ir ovms 32m
torchserve-0.x pytorch-mar torchserve 32m
triton-2.x keras triton 31m
Deploy your model
To deploy a model using your newly created runtime, you’ll need to create an InferenceService
resource to serve the model. This resource is the main interface that KServe and ModelMesh use for managing models, representing the model’s logical endpoint for serving inferences.
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sentiment-analyzer
annotations:
serving.kserve.io/deploymentMode: ModelMesh
spec:
predictor:
model:
modelFormat:
name: custom-model
runtime: custom-runtime-0.x # OPTIONAL
storage:
key: localMinIO
path: sentiment-model/models.py
The InferenceService
in the previous code block names the model sentiment-analyzer
and declares its model format custom-model
, which is the same format as the example custom runtime created earlier. An optional field runtime
is passed as well, explictly telling ModelMesh to use the custom-runtime-0.x
serving runtime to deploy this model. Lastly, the storage
field points to where the model resides, in this case the localMinIO
instance that deploys using ModelMesh Serving’s quickstart guide.
After creating the InferenceService
, you’ll be able to watch it become available as shown in the following sample output:
kubectl get isvc
NAME URL READY ...
sentiment-analyzer grpc://modelmesh-serving.modelmesh-serving:8033 True ...
Summary and next steps
Now that you can build your own custom runtime to deploy any model you like, there’s nothing stopping you from taking advantage of ModelMesh’s effectiveness and reliability to scale as needed.
If you want an enterprise-grade platform for your AI workloads built on top of open source software like ModelMesh, be sure to check out watsonx. Explore more articles and tutorials about watsonx on IBM Developer.