Load Testing KServe Inference Services with Vegeta

Vegeta is a versatile HTTP load testing tool built to generate loads on HTTP services with a constant request rate.

In this guide, we show how to run a load test on an KServe Endpoint. This guide offers concise instructions on using Vegeta, a simple but powerful tool, to load test HTTP services, thus assessing their performance under high-level stress. Particularly for KServe Endpoints, which handle complex machine learning models, these tests are critical to identify potential bottlenecks, ensuring optimal performance and reliability.

You could use this to create a variety of test to cover different scenarios, such as quick tests, tests of different batch sizes, or “soak tests” that run for hours to ensure stability.

Untitled

Main web site: https://github.com/tsenart/vegeta

Here’s an example YAML for testing a KServe endpoint:

vegeta-inference-service-test.yaml

apiVersion: batch/v1
kind: Job
metadata:
  generateName: load-test
spec:
  backoffLimit: 6
  parallelism: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: OnFailure
      containers:
      - args:
        - vegeta -cpus=4 attack -duration=2m -rate=100/1s -targets=/var/vegeta/cfg
          | vegeta report -type=text
        command:
        - sh 
        - -c
        image: peterevans/vegeta:latest
        imagePullPolicy: Always
        name: vegeta
        volumeMounts:
        - mountPath: /var/vegeta
          name: vegeta-cfg
      volumes:
      - configMap:
          defaultMode: 420
          name: vegeta-cfg
        name: vegeta-cfg
---
apiVersion: v1
data:
  cfg: |
    POST <http://churn-gbm-onnx.my-namespace.svc.cluster.local/v2/models/churn-gbm-onnx/infer>
    @/var/vegeta/payload
  payload: |
    {"inputs": [{"name": "float_input", "datatype": "FP32", "shape": [10, 13], "data": [[14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015], [14200.0, 14200.0, 14200.0, 195175.0, 1, 1, 195175, 95, 386.0, 2022, 11, 48, 2015]]}]}
kind: ConfigMap
metadata:
  annotations:
  name: vegeta-cfg

To customize this test for your endpoint:

Update the -cpus, -duration, -rate based on the load you want to generate
Alter the POST URL to match your Endpoint. You can find the endpoint URL in the Kubeflow Dashboard in the Endpoints section. Click the endpoint to view the details page, and use the “URL internal” value.
Alter the payload based on your Endpoint’s input requirements. (In this case, we are passing a batch with 10 rows of data, with 13 float values each.)

To run this test, run:

$ kubectl create -f vegeta-inference-service-test.yaml