This page is a work in progress. Feedback is greatly appreciated.

Welcome to Cake AI’s help center! We're here to answer your questions. Can't find what you're looking for? Send our support team a note at [email protected]!

Components

Component Description Documentation & Resources
Kubeflow Kubeflow simplifies scaling and deploying ML models using Kubernetes, allowing easy deployments on diverse infrastructures, managing micro-services, and scaling on-demand. It supports notebooks, model training, model serving, hyper-parameter and neural architecture search, pipelines, monitoring and observability, and more. The Kubeflow platform supports efficient ML deployment, experimentation, and management. Main Site
Kubeflow Architecture
Documentation
Kubeflow Pipelines Kubeflow Pipelines is a platform for creating and deploying scalable ML workflows using Docker containers. It offers a user interface for managing experiments, an engine for scheduling workflows, an SDK for pipeline manipulation, and notebooks for SDK interaction. The platform aims to streamline end-to-end orchestration, simplify experimentation, and facilitate easy reuse of components and pipelines. Main Site
Documentation

Getting Started with Kubeflow Pipelines Creating Kubeflow Pipelines with Elyra Kubeflow Pipeline Runtime Parameterization Setting up recurring pipelines Alibi Detect KSDrift Pipeline Component Deploying a model from MLflow to KServe | | JupyterLab Notebooks | JupyterLab is a web-based interactive development environment for notebooks, code, and data. Its flexible interface allows users to configure and arrange workflows in data science, scientific computing, computational journalism, and machine learning. A modular design invites extensions to expand and enrich functionality. | Main Site Documentation

Kubeflow Notebooks - Quick Start Guide Launching Notebooks Troubleshooting Notebooks Custom Environments for Notebook Servers VS Code and Kubeflow Notebooks JetBrains IDEs and Kubeflow Notebooks How to set up k9s in a notebook | | Elyra | Elyra is a collection of AI-focused JupyterLab Notebook extensions, featuring a visual pipeline editor, batch job support, reusable code snippets, hybrid runtime support, Python and R script editors, integrated debugging, notebook outlining, Language Server Protocol integration, and Git version control. | Main Site Documentation | | KServe | KServe is a Kubernetes-based solution for serving ML models on various frameworks, streamlining production use cases with features like auto-scaling, health checking, and server configuration. It offers advanced serving features such as GPU auto-scaling and canary roll-outs, simplifying ML deployments and making them more efficient. | Main Site Documentation

KServe Inference Service Deployment API Building an InferenceGraph Load Testing KServe with Vegeta Load testing Inference Services with BlazeMeter/Taurus | | MLflow | MLflow is an open-source platform for managing the ML lifecycle, offering four key components: MLflow Tracking for recording and querying experiments, MLflow Projects for packaging data science code, MLflow Models for deploying ML models across various environments, and Model Registry for centralized storage and management of models. | Main Site Documentation

Auto-logging experiment tracking for your training code | | Feast | Feast is a standalone, open-source feature store that serves features to ML projects in a consistent way for both offline training and online inference. | Main Site Documentation Videos

Feast Intro Building Entities Building Views | | Airflow | Airflow is a platform to programmatically author, schedule and monitor workflows. Airflow excels at data engineering workflows with its large set of data connectors but can also be used in-place of Kubeflow Pipelines for data science flows. | Main Site Documentation | | Alibi Detect | Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. The package covers both online and offline detectors for tabular data, text, images, and time series. | Main Site Documentation | | Argo CD | Argo CD is a declarative, continuous delivery tool for Kubernetes that follows the GitOps pattern of using Git repositories as the source of truth for defining the desired application state. Argo CD automates the deployment process, and continuously monitors running applications and compares the current, live state against the desired target state (as specified in a Git repository). Argo CD reports & visualizes the differences, and automatically sync the live state back to the desired target state. | Main Site Documentation Source Code | | Dex | Dex plays the role of an authentication portal that supports pluggable authentication against many different identity providers, while presenting a unified OpenId Connect (OIDC) 2 interface. | Main Site Documentation | | DVC | DVC is a version control system for machine learning projects, enabling version control of models, datasets, and intermediate files. It offers full code and data provenance, simplified ML experiment management with Git branching, and improved deployment and collaboration using push/pull commands. DVC also introduces lightweight, language-agnostic pipelines as a first-class Git mechanism, streamlining the process of getting code into production. | Main Site Documentation | | Evidently | Evidently helps evaluate, test, and monitor data and ML models from validation to production. It works with tabular, text data and embeddings. | Main Site Documentation | | featurewiz | featurewiz is an automated feature selection library for boosting model performance with minimal effort and maximum relevance using the MRMR algorithm. | Main Site Documentation | | Featuretools | Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning. | Main Site Documentation | | Grafana | Grafana is a versatile data visualization tool that unifies data from various sources (like Prometheus), enabling users to create and share dynamic dashboards. It supports a wide range of data sources, fostering a data-driven culture by making data accessible to everyone in an organization. Grafana offers flexibility, advanced querying, and transformation capabilities to create customized and insightful visualizations. | Main Site Documentation | | Great Expectations | Great Expectations is a tool for data validation, documentation, and profiling, ensuring data quality and enhancing team communication. It enables data science and engineering teams to create Expectations, which are like unit tests for data, to catch issues quickly, validate data transformations, prevent data quality problems, and streamline knowledge capture. This helps maintain data validity and creates shared data documentation. | Main Site Documentation

Building Expectations | | H20 | H2O’s AutoML framework can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. H20 is most useful for tuning neural net models | Main Site Documentation | | Istio | Istio extends Kubernetes to establish a programmable, application-aware network using the powerful Envoy service proxy. Istio brings standard, universal traffic management, telemetry, and security to complex deployments. | Main Site Documentation | | Katib (AutoML) | Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports distributed hyperparameter tuning, early stopping and neural architecture search (NAS). Katib supports numerous AutoML algorithms, such as Bayesian optimization, Tree of Parzen Estimators, Random Search, Covariance Matrix Adaptation Evolution Strategy, Hyperband, Efficient Neural Architecture Search, Differentiable Architecture Search and many more. | Documentation

Katib Trial Data | | Knative | Knative is an open-source, enterprise-grade platform designed for creating server-less and event-driven applications. It streamlines networking, auto-scaling, revision tracking, and event management, while offering universal event subscription and delivery. With its developer-friendly object models and declarative event connectivity, Knative enables the construction of modern applications by seamlessly connecting compute resources to data streams. | Main Site Documentation | | Kubeflow Training Operator | The Kubeflow Training Operator makes it easy to run distributed and non-distributed TensorFlow/PyTorch/Apache MXNet/XGBoost/MPI jobs on Kubernetes. It streamlines machine learning model training on Kubernetes, providing consistent support for multiple frameworks. Major features include easy configuration, GPU support, fault tolerance, and distributed training capabilities, making it an efficient tool for managing ML training tasks. | Main Site Documentation | | Kubernetes | Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications and is supported on the major cloud platforms (Amazon AWS, Google Cloud, Azure, etc.). (The name “Kubernetes” originates from Greek, meaning “helmsman” or “pilot”.) Kubernetes offers a robust container orchestration platform with features such as automated rollouts and rollbacks, service discovery, load balancing, storage orchestration, self-healing, secret and configuration management, automatic bin packing, batch execution, horizontal scaling, and extensibility. These capabilities ensure efficient deployment, management, and scaling of containerized applications. Kubernetes aims to support an extremely diverse variety of workloads, including stateless, stateful, and data-processing workloads. | Main Site Documentation

Kubernetes Basics for Data Scientists Using GPUs on Google GKE | | Milvus | Milvus is a vector db. Milvus stores, indexes, and manage massive embedding vectors generated by deep neural networks and other machine learning (ML) models. | Main Site Documentation | | ModelMesh | ModelMesh is a model-serving management and routing framework designed for high-scale, high-density, and frequently-changing model use cases. It functions as a distributed LRU cache for serving runtime models and works with existing or custom-built model servers. ModelMesh Serving provides full Kubernetes-based deployment and management of ModelMesh clusters and models, including integrations with existing open-source model servers and abstracted handling of model repository storage. | Main Site Documentation | | OIDC | OpenID Connect (OIDC) is an identity layer built on top of the OAuth 2.0 framework. It allows third-party applications to verify the identity of the end-user and to obtain basic user profile information. OIDC uses JSON web tokens (JWTs), which you can obtain using flows conforming to the OAuth 2.0 specifications. | Main Site Documentation | | Optuna | Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It features an imperative, define-by-run style user API. Optuna underpins various other AutoML solutions like Katib and PyCaret | Main Site Documentation | | ONNX | ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. | Main Site Documentation | | Prometheus | Prometheus is an open-source monitoring and alerting toolkit designed to collect and store time series data with timestamps and labels. This enables efficient systems monitoring, providing insights and alerts based on metrics and performance. Prometheus provides a user interface, but is commonly used in conjunction with Grafana for advanced charting capabilities. | Main Site Documentation | | PyCaret | PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It simplifies AutoML tasks like hyperparameter tuning and model selection. It is an end-to-end machine learning and model management tool that exponentially speeds up the experiment cycle and makes you more productive. PyCaret is most useful for traditional linear, tree, and boosted models. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more. | Main Site Documentation | | Ray | Ray is an open-source unified framework for scaling AI and Python applications. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert. The ray ecosystem includes various add-on libraries like Train and Tune that interact with Ray Core to simplify and scale various ML tasks. Ray Serve is an alternative model serving framework that is simpler than KServe. Cake AI deploys Ray via the KubeRay Operator | Main Site Documentation

Ray Core Ray Tune Ray Train Ray Serve

KubeRay | | Terraform | Terraform is an infrastructure-as-code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share. You can then use a consistent workflow to provision and manage all of your infrastructure throughout its lifecycle. Terraform can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features. The Terraform community have written thousands of providers to manage many different types of resources and services. You can find all publicly available providers on the Terraform Registry, including Amazon AWS, Azure, Google Cloud, Kubernetes, Helm, GitHub, and many more. | Main Site Documentation Terraform Registry | | TensorBoard | TensorBoard is a tool for providing the measurements and visualizations needed during the machine learning workflow. It enables tracking experiment metrics like loss and accuracy, visualizing the model graph, projecting embeddings to a lower dimensional space, support for audio/video/images, and much more. TensorBoard is often used with TensorFlow, but can also be used with any ML framework (PyTorch, SciKit-Learn, etc.). | Main Site Documentation

Creating a Tensorboard in Kubeflow | | Triton Inference Server | NVIDIA Triton Inference Server streamlines AI inference, allowing deployment and scaling of ML/DL models from any framework on various infrastructures. It provides flexibility in choosing frameworks without affecting deployment and delivers high-performance inference across different environments. Triton supports major frameworks like TensorFlow, NVIDIA TensorRT™, PyTorch, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, and offers features like dynamic batching, concurrent execution, optimal configuration, model ensemble, and streaming inputs for maximized throughput and utilization. | Main Site Documentation Release Notes Release Notes - Github Support Matrix Client Libraries Running Triton |


General Troubleshooting


Spellbook

Cake AI Documents

Kubernetes Basics for Data Scientists

Launching Kubeflow Notebooks

Custom Environments in Kubeflow Notebook Servers

VS Code Kubeflow Jupyter Setup

JetBrains IntelliJ IDEA With a Jupyter Notebook Server and GitHub Copilot / AI Assistant

Getting started with Kubeflow Pipelines

How to setup a recurring deploy pipeline

Load Testing KServe Inference Services with Vegeta