1. Gemma on GKE now supports TPUs for optimizing inference performance
2. JetStream is the recommended TPU inference stack for LLM inference on Google Cloud TPUs
3. GKE also supports GPUs with frameworks like vLLM and Text Generation Inference for serving LLMs
Google Cloud now supports Gemma on GKE with both TPUs and GPUs, providing users with options for optimizing inference performance and serving frameworks for AI models. JetStream is recommended for TPU inference, offering efficient throughput and latency for LLM inference. For users preferring GPU accelerators, vLLM, Text Generation Inference, and TensorRT-LLM frameworks are available to enhance serving throughput and optimize inference performance.
Gemma on GKE allows developers to build and deploy AI models using a self-managed, versatile, cost-effective, and performant platform. With integrations into major AI model repositories and support for both Google Cloud GPU and Cloud TPU, GKE provides flexible deployment and serving options for Gemma. Users can explore tutorials to get started with Gemma on GKE and take advantage of the different frameworks available for optimizing inference performance based on their preferences and needs. Google Cloud is committed to providing users with a variety of options to train and serve AI workloads efficiently and effectively on GKE.