'KoboldCPP GPU Layers -1: What It Means and How to Set It'

Learn what setting GPU layers to -1 does in KoboldCPP, how it offloads all layers

May 27, 2026 · 10 min read

KoboldCPP is the closest thing local AI has to a no-friction starting point. It's a single executable that runs your language model, ships a usable web interface, speaks the OpenAI API so other tools can connect to it, and runs on hardware ranging from a current GPU down to a decade-old CPU. No installation, no dependencies, no Python environment to wrangle. Download one file, point it at a model, and you're running local AI. This guide covers the full setup, the configuration options that actually matter, and how to pair it with SillyTavern for a complete companion or roleplay stack.

What KoboldCPP is and why it's the default

KoboldCPP is a local inference server built on llama.cpp, packaged as a single portable binary. In 2026 it's the de facto choice for single-user creative and companion use, and the reason is breadth of compatibility combined with zero setup friction. Where other backends require installing Python, managing dependencies, or configuring environments, KoboldCPP is one file you double-click.

It does more than run models. The current version includes a built-in chat interface (Kobold Lite), image generation support, text-to-speech, embeddings, an OpenAI-compatible API endpoint, and full sampler support including the advanced DRY and XTC samplers that improve output quality. For a creative-writing or companion stack, it covers the backend completely, and it has the broadest hardware compatibility of any local LLM server. Even integrated GPUs and old CPUs run small models on it, which makes it the right starting point regardless of your hardware.

Compared to the main alternative, oobabooga's text-generation-webui, KoboldCPP trades some advanced configurability for dramatically simpler setup. Oobabooga is more flexible for users who want to experiment with different loaders, training, and extensions, but it requires a proper installation and more configuration. For most people who want to run a model and chat with it, KoboldCPP is the faster path, and it's what we recommend for self-hosted companion setups.

Step one: download and launch

Download koboldcpp.exe from the official GitHub releases page at github.com/LostRuins/koboldcpp/releases. On Windows, that single file is everything. On Linux and Mac, grab the appropriate binary or build from source per the repository instructions. There's no installation step. Put the file wherever convenient, the same folder as your models works fine.

Double-click koboldcpp.exe to launch the GUI loader, or run it from the command line for more control. The GUI loader presents the configuration options visually, which is the easier path for a first run. The CLI gives you scriptable, repeatable launches once you know your preferred settings.

Step two: get a model

KoboldCPP runs GGUF-format models, the standard quantized format you'll find on Hugging Face. If you haven't chosen a model yet, our guide to the best uncensored models in 2026 covers the field. For a first run, download a model sized to your VRAM: a 7B-8B model for 8-12GB cards, a 13B model for 12-16GB, larger as your hardware allows. Choose the Q4_K_M quantization unless you have a specific reason otherwise; it's the gold standard balance of quality and size.

Save the GGUF file somewhere you'll remember. KoboldCPP will load it directly from disk.

Step three: configure the launch

This is where the settings that matter live. In the GUI loader, the key options are:

Model: point it at your downloaded GGUF file. This is the only strictly required setting.

GPU layers (offloading): this controls how much of the model runs on your GPU versus your CPU. The goal is to offload as many layers to the GPU as your VRAM allows, because GPU inference is far faster than CPU. KoboldCPP can estimate this automatically, but if you hit out-of-memory errors, reduce the layer count. If you have VRAM to spare, increase it. Getting all layers onto the GPU is the difference between fast and slow generation, and it's the single most important performance setting. Our GPU guide covers how much VRAM different models need.

Context size: this sets how much conversation the model can hold in working memory, measured in tokens. Larger contexts let your companion remember more of the current conversation but consume more VRAM. A common starting point is 8192 or 16384 tokens, raised if your VRAM allows and your use case needs longer memory. This setting interacts directly with how memory works in practice.

Most other settings can stay at defaults for a first run. KoboldCPP's defaults are sensible, and you can tune later once you understand your hardware's behavior.

Launch it, and KoboldCPP loads the model into VRAM, opens the Kobold Lite interface in your browser, and starts serving its API. The loading takes a few seconds to a minute depending on model size and disk speed.

Step four: test in Kobold Lite

Kobold Lite, KoboldCPP's built-in interface, opens automatically and lets you chat with the model immediately. This is the fastest way to confirm everything works. Type a message, get a response, and you've verified your model, your offloading, and your context settings are functioning.

Kobold Lite is functional for basic use and good for testing, but it's not a full companion interface. For character cards, lorebooks, persistent personas, and the polished experience that makes local AI competitive with cloud platforms, you'll connect SillyTavern to KoboldCPP's API.

Step five: connect SillyTavern

Keep KoboldCPP running in the background; it's the engine, and SillyTavern is the dashboard. Download SillyTavern from its official GitHub repository, extract it, and run Start.bat on Windows or the shell script on Mac and Linux. SillyTavern opens in your browser at localhost:8000.

In SillyTavern's API connection settings, select the KoboldCPP (or KoboldAI) API type and enter the address KoboldCPP is serving on, typically localhost:5001. Connect, and SillyTavern routes everything through KoboldCPP to your local model. Now you have KoboldCPP doing the inference and SillyTavern providing character management, lorebooks, group chats, and the customization depth that makes a self-hosted companion genuinely competitive with anything cloud platforms offer.

The full SillyTavern character and lorebook setup is covered in our local companion guide, but the connection itself is this simple: KoboldCPP serves the API, SillyTavern connects to it, done.

Tuning samplers for better output

Once you're running, samplers control how the model selects its next words, and they have a large effect on output quality. KoboldCPP supports the full modern sampler set, and a few are worth understanding.

Temperature controls randomness. Lower values (0.7-0.9) produce more focused, consistent output; higher values (1.0-1.2) produce more creative, varied output. For companion and roleplay use, slightly higher temperature often reads as more natural, but too high produces incoherence.

The DRY sampler (Don't Repeat Yourself) penalizes repetition, which is one of the most common immersion-breakers in local AI. If your companion keeps reusing the same phrases, DRY addresses it directly. XTC (Exclude Top Choices) occasionally removes the most predictable next words, pushing the model toward less formulaic output, which reduces the generic "AI slop" feel.

A common piece of advice: if your character sounds robotic, fix the repetition and exclusion penalties before you touch temperature. The DRY and XTC samplers often do more for output naturalness than temperature adjustments, and KoboldCPP exposes both. SillyTavern provides preset sampler configurations tuned for roleplay that are a good starting point, and you can adjust from there.

Common problems and fixes

Out-of-memory errors on launch mean you're trying to offload more model than your VRAM holds. Reduce the GPU layer count, reduce the context size, or use a smaller model or more aggressive quantization. The quantization formats give you options here.

Slow generation usually means the model isn't fully on the GPU. Check that all layers offloaded successfully; if some ran on CPU because they didn't fit, generation will be slow. Free VRAM by closing other GPU applications, or reduce context size to make room for more model layers.

The model loads but responses are incoherent: check that you're using a sensible quantization (Q4_K_M or better, not Q2_K unless you're cramming a large model onto a small card) and that your sampler settings aren't extreme. Very high temperature produces incoherence.

SillyTavern won't connect: confirm KoboldCPP is running and note the exact port it reports (usually 5001), then enter that address in SillyTavern's connection settings. The KoboldCPP window shows the address it's serving on.

Beyond text: KoboldCPP's extra features

KoboldCPP is more capable than its single-file simplicity suggests, and a few features are worth knowing about once your basic setup runs.

Image generation: KoboldCPP can load a Stable Diffusion model alongside your language model and generate images directly, no separate tool required. This is lighter-weight than running a full AUTOMATIC1111 or ComfyUI setup, and it's enough for companion selfies and scenario images. The tradeoff is that running an image model alongside your language model needs additional VRAM, so it's most practical on larger cards.

Text-to-speech: KoboldCPP includes TTS support, so your companion can speak responses. Combined with SillyTavern's voice features, this turns a text companion into a spoken one. The built-in TTS is functional; for higher quality, SillyTavern connects to more advanced external TTS engines.

Embeddings: KoboldCPP can serve embeddings, which matters for the vector-based memory that makes a companion remember across long relationships. SillyTavern's vector storage extension can use KoboldCPP's embedding endpoint, keeping your entire memory pipeline local.

OpenAI-compatible API: because KoboldCPP speaks the OpenAI API format, any tool built for OpenAI can connect to it pointed at your local server. This means KoboldCPP isn't limited to SillyTavern; it works as a local drop-in for development, scripting, or any application expecting an OpenAI-compatible endpoint, all running on your hardware with your model.

These features come from the KoboldCPP project bundling capabilities that usually require separate tools, which is a large part of why it's the recommended single-binary starting point for local AI.

KoboldCPP versus the alternatives

For completeness, the main alternatives and where they fit. Oobabooga's text-generation-webui is more configurable and supports more model loaders and extensions, at the cost of a real installation and more complexity; it's the choice for users who want to experiment broadly or do training. Ollama is excellent for command-line and developer workflows, with a clean model library and simple commands, though its UI story is thinner than KoboldCPP's built-in interface. LM Studio offers a polished graphical experience for browsing, downloading, and running models, appealing to users who prefer a more app-like feel.

KoboldCPP's niche is the intersection of zero-setup simplicity, broad hardware compatibility, and the bundled features (image, TTS, embeddings, API) that cover a complete companion stack. For the self-hosted companion use case specifically, that combination is hard to beat.

Why this stack wins for local companions

KoboldCPP plus SillyTavern is the standard for a reason. KoboldCPP handles the hard part (running the model) with the least possible friction and the broadest hardware support, while SillyTavern handles the experience (characters, memory, customization) with depth no cloud platform matches. Together they produce a completely private, unlimited, uncensored local AI setup with no monthly fees and no company reading your conversations.

The setup takes well under an hour for a first-timer: download KoboldCPP, download a Q4_K_M model, launch with sensible GPU offloading and context settings, connect SillyTavern, and build a character. From there you have a local AI server that runs almost any GGUF model on almost any hardware, tuned with the samplers that make output read naturally, serving an interface that turns a forgetful language model into a persistent companion. For users who want local AI without the configuration headaches of more complex backends, KoboldCPP is where to start and, for most people, where to stay.

Keep reading

GUIDE

'Best Abliterated Models in 2026: What Actually Works After the Hype'

10 min read

INSIGHT

We Tested Every Free AI Tier So You Don't Have To

INSIGHT

Free AI Changelog — June: What Changed, What Tightened, What's New

GUIDE

Which Free AI Is Right for You? A Simple Decision Guide