zeus

package module

v1.4.0 Latest Latest Go to latest Published: Dec 31, 2025 License: Unlicense Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/expki/zeus

Links

Open Source Insights

README ¶

Zeus

Go bindings for llama.cpp. Run LLMs locally with zero setup.

What is Zeus?

Zeus brings the power of llama.cpp to Go applications. llama.cpp is a high-performance C++ library for running Large Language Models, known for its efficiency, broad model support, and ability to run on consumer hardware.

Zeus wraps llama.cpp with a clean Go API, handling all the complexity of CGO bindings, memory management, and cross-platform builds. The result is a library that lets you run any GGUF model with just a few lines of Go code.

Key Features

Zero Setup - Pre-built static libraries included. No compilation, no cmake, no toolchains.
Universal Model Support - Works with any GGUF model: Llama, Mistral, Qwen, Phi, Gemma, and hundreds more.
Portable - x86_64 builds for Linux and Windows, ARM64 builds for Raspberry Pi 4/5.
GPU Acceleration - Vulkan support for GPU inference, with automatic CPU fallback.
Sensible Defaults - Works out of the box. Configure only what you need.
Memory Efficient - KV cache quantization to run larger contexts on limited RAM.
Developer Idiocracy - This library expects no assumed knowlege from developers and it cannot be used incorrectly.

Quick Start

Add Library

Standard: Windows 10+ / Ubuntu 24.04+ / Debian 12+ / Linux glibc 2.36+

go get github.com/expki/zeus@latest

Legacy: Ubuntu 22.04 / Linux glibc 2.35

go get github.com/expki/[email protected]

Use Library

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/expki/zeus"
)

func main() {
    model, err := zeus.LoadModel("model.gguf")
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    chat := model.NewChat()
    chat.AddMessage(zeus.RoleSystem, "You are a helpful assistant.")

    for token, err := range chat.GenerateSequence(context.Background(), "Hello!") {
        if err != nil {
            log.Fatal(err)
        }
        fmt.Print(token.Text)
    }
}

Build App

CGO_ENABLED=1 go build -o myapp .

Requirements

Go 1.25+
x86_64 Linux or Windows, ARM64 Linux (Raspberry Pi 4/5)
- Linux x86_64: libvulkan1 (executable) or libvulkan-dev (build)
- Linux ARM64: libvulkan1 (executable) or libvulkan-dev (build), mesa-vulkan-drivers
Any GGUF model file

Documentation

API Reference - Complete documentation of all interfaces, methods, and options
Contributing - Building from source and contributing guidelines

Acknowledgments

This project was inspired by go-skynet/go-llama.cpp, which pioneered Go bindings for llama.cpp. Zeus builds on that foundation with a focus on simplicity, portability, and pre-built binaries.

License

Unlicensed

Documentation ¶

Overview ¶

Package zeus provides Go bindings for llama.cpp, enabling local LLM inference with pre-built static libraries for Linux and Windows x86_64.

Zeus is designed for simplicity - load any GGUF model and start generating text with sensible defaults. No compilation required, no external dependencies.

Quick Start ¶

The simplest way to use Zeus is with the Chat API:

model, err := zeus.LoadModel("model.gguf")
if err != nil {
    log.Fatal(err)
}
defer model.Close()

chat := model.NewChat()
chat.AddMessage(zeus.RoleSystem, "You are a helpful assistant.")

for tok, err := range chat.GenerateSequence(ctx, "Hello!") {
    if err != nil {
        log.Fatal(err)
    }
    fmt.Print(tok.Text)
}

Core Abstractions ¶

Zeus provides three main abstractions for different use cases:

Model: Load and manage GGUF models. Provides tokenization, embeddings, and model information. Thread-safe for concurrent use.
Session: Token-level generation with state tracking. Use when you need precise control over the prompt format or are working with non-chat models. Supports checkpoint/backtrack for branching conversations.
Chat: Message-level conversations with automatic template handling. Manages conversation history and applies the model's chat template. Best for chatbot-style applications.

Configuration ¶

Zeus uses Go's functional options pattern for clean, readable configuration:

model, err := zeus.LoadModel("model.gguf",
    zeus.WithContextSize(4096),
    zeus.WithKVCacheType(zeus.KVCacheQ8_0),
)

for tok, err := range session.GenerateSequence(ctx, prompt,
    zeus.WithMaxTokens(512),
    zeus.WithTemperature(0.7),
) {
    // ...
}

Most options have sensible defaults. You only need to configure what you want to change.

Streaming Generation ¶

Zeus provides two streaming interfaces:

iter.Seq2 via GenerateSequence: Returns tokens one at a time using Go 1.23+ range-over-func. Best for token-by-token processing.
io.ReadCloser via Generate: Returns generated text as a byte stream. Best for piping to HTTP responses or other io.Writers.

GPU Acceleration ¶

Zeus includes pre-built Vulkan support for GPU acceleration. Enable it with WithGPULayers:

model, err := zeus.LoadModel("model.gguf",
    zeus.WithGPULayers(zeus.GPULayersAll),
)

GPU acceleration is optional - Zeus falls back to CPU if Vulkan is unavailable.

Thread Safety ¶

Model is safe for concurrent use from multiple goroutines. The KV cache is protected by a mutex, so generation operations are serialized. Multiple Sessions or Chats can exist simultaneously, but only one can generate at a time.

[Close] is safe to call multiple times and will wait for any ongoing generation to complete.

Error Handling ¶

Zeus provides sentinel errors for common conditions that can be checked with errors.Is:

ErrModelClosed: Operation attempted on a closed model
ErrEmbeddingsDisabled: Embeddings requested but model loaded without WithEmbeddings
ErrPromptTooLong: Prompt exceeds context size
ErrDecodeFailed: Decode operation failed during generation

Typed errors provide additional context and can be checked with errors.As:

ModelLoadError: Details about model loading failures
GenerationError: Details about generation failures
TokenizeError: Details about tokenization failures

For complete API documentation, see DOC.md in the repository.

Index ¶

Constants
Variables
func SetVerbose(verbose bool)
type AgentConfig
- func DefaultAgentConfig() AgentConfig
type AgentEvent
type AgentEventType
- func (t AgentEventType) String() string
type Chat
type ChatConfig
- func DefaultChatConfig() ChatConfig
type ChatFormat
- func (f ChatFormat) String() string
type ChatMessage
type ChatOption
- func WithChatFormat(format ChatFormat) ChatOption
- func WithMaxIterations(n int) ChatOption
- func WithMaxToolCalls(n int) ChatOption
- func WithParallelToolCalls(parallel bool) ChatOption
- func WithToolChoice(choice ToolChoice) ChatOption
- func WithToolTimeout(d time.Duration) ChatOption
- func WithTools(tools ...Tool) ChatOption
type ChatParams
type ChatTemplateConfig
- func DefaultChatTemplateConfig() ChatTemplateConfig
type ChatTemplateError
- func (e *ChatTemplateError) Error() string
type ChatTemplateOption
- func WithAddAssistant(add bool) ChatTemplateOption
- func WithAutoCompactThreshold(threshold float32) ChatTemplateOption
- func WithChatTemplate(name string) ChatTemplateOption
type EmbeddingError
- func (e *EmbeddingError) Error() string
type GenerateConfig
- func DefaultGenerateConfig() GenerateConfig
type GenerateOption
- func WithFrequencyPenalty(penalty float32) GenerateOption
- func WithGenerateSeed(seed int) GenerateOption
- func WithGrammar(grammar string) GenerateOption
- func WithIgnoreEOS() GenerateOption
- func WithMaxTokens(n int) GenerateOption
- func WithMinP(p float32) GenerateOption
- func WithMirostat(mode MirostatMode, tau, eta float32) GenerateOption
- func WithPresencePenalty(penalty float32) GenerateOption
- func WithReasoningEnabled(enabled bool) GenerateOption
- func WithRepeatLastN(n int) GenerateOption
- func WithRepeatPenalty(penalty float32) GenerateOption
- func WithStopSequences(seqs ...string) GenerateOption
- func WithTemperature(t float32) GenerateOption
- func WithThreads(n int) GenerateOption
- func WithTopK(k int) GenerateOption
- func WithTopP(p float32) GenerateOption
type GenerationError
- func (e *GenerationError) Error() string
type GrammarTrigger
type GrammarTriggerType
type KVCacheType
- func (k KVCacheType) String() string
type MirostatMode
type Model
- func LoadModel(path string, opts ...ModelOption) (Model, error)
type ModelConfig
- func DefaultModelConfig() ModelConfig
type ModelInfo
type ModelLoadError
- func (e *ModelLoadError) Error() string
type ModelOption
- func WithBatchSize(n int) ModelOption
- func WithContextSize(n int) ModelOption
- func WithEmbeddings() ModelOption
- func WithGPULayers(n int) ModelOption
- func WithKVCacheType(t KVCacheType) ModelOption
- func WithLoRA(path string) ModelOption
- func WithMMap(enable bool) ModelOption
- func WithMainGPU(gpu int) ModelOption
- func WithMlock(enable bool) ModelOption
- func WithNUMA(enable bool) ModelOption
- func WithRopeFreqBase(base float32) ModelOption
- func WithRopeFreqScale(scale float32) ModelOption
- func WithSeed(seed int) ModelOption
- func WithTensorSplit(split []float32) ModelOption
- func WithWarmup(enable bool) ModelOption
type ParseResult
type Role
type Session
type SpecialTokens
type StopReason
- func (s StopReason) String() string
type Token
type TokenProb
type TokenWithLogprobs
type TokenizeError
- func (e *TokenizeError) Error() string
type Tool
type ToolCall
type ToolChoice
type ToolExecutionError
- func (e *ToolExecutionError) Error() string
- func (e *ToolExecutionError) Unwrap() error
type ToolParameter
type ToolResult

Constants ¶

View Source

const GPULayersAll = 999

GPULayersAll is a constant to offload all model layers to GPU. llama.cpp will offload as many layers as fit in available VRAM.

Variables ¶

View Source

var (
	ErrModelClosed        = errors.New("zeus: model is closed")
	ErrEmbeddingsDisabled = errors.New("zeus: model loaded without embeddings support")
	ErrPromptTooLong      = errors.New("zeus: prompt exceeds context size")
	ErrDecodeFailed       = errors.New("zeus: decode operation failed")
	ErrSessionIsNil       = errors.New("zeus: session is nil and not defined")
	ErrModelIsNil         = errors.New("zeus: model is nil and not defined")
	ErrChatIsNil          = errors.New("zeus: chat is nil and not defined")

	// Tool-related errors
	ErrNoToolsRegistered       = errors.New("zeus: no tools registered")
	ErrMaxIterationsExceeded   = errors.New("zeus: max agent iterations exceeded")
	ErrMaxToolCallsExceeded    = errors.New("zeus: max tool calls exceeded")
	ErrTemplateApply           = errors.New("zeus: failed to apply chat template with tools")
	ErrToolTemplateUnsupported = errors.New("zeus: model does not support native tool templates")
)

Sentinel errors for use with errors.Is()

Functions ¶

func SetVerbose ¶

func SetVerbose(verbose bool)

SetVerbose enables or disables verbose logging from llama.cpp.

Types ¶

type AgentConfig ¶ added in v1.1.0

type AgentConfig struct {
	Tools             []Tool        // Registered tools
	MaxIterations     int           // Maximum agentic loop iterations (default: 10)
	MaxToolCalls      int           // Maximum total tool calls (default: 25)
	ToolTimeout       time.Duration // Per-tool execution timeout (default: 30s)
	ToolChoice        ToolChoice    // How the model should use tools (default: Auto)
	ParallelToolCalls bool          // Allow multiple tool calls in one response (default: true)
}

AgentConfig holds configuration for agentic tool execution.

func DefaultAgentConfig ¶ added in v1.1.0

func DefaultAgentConfig() AgentConfig

DefaultAgentConfig returns an AgentConfig with sensible defaults.

type AgentEvent ¶ added in v1.1.0

type AgentEvent struct {
	Type     AgentEventType // Type of event
	Token    *Token         // For AgentEventToken: the generated token
	ToolCall *ToolCall      // For AgentEventToolCallStart/End: the tool call
	Result   *ToolResult    // For AgentEventToolCallEnd: the result
	Error    error          // For AgentEventError: the error that occurred
}

AgentEvent represents an event during agentic loop execution.

type AgentEventType ¶ added in v1.1.0

type AgentEventType int

AgentEventType indicates the type of event during agentic loop execution.

const (
	AgentEventToken         AgentEventType = iota // A token was generated
	AgentEventToolCallStart                       // A tool call is starting
	AgentEventToolCallEnd                         // A tool call completed
	AgentEventError                               // An error occurred
	AgentEventDone                                // The agentic loop completed
)

func (AgentEventType) String ¶ added in v1.1.0

func (t AgentEventType) String() string

String returns the string representation of the event type.

type Chat ¶

type Chat interface {
	// Generate sends a user message and returns the assistant's response as a stream.
	// The user message and assistant response are added to the conversation.
	Generate(ctx context.Context, userMessage string, opts ...GenerateOption) io.ReadCloser

	// GenerateSequence sends a user message and returns tokens as an iterator.
	// The user message and assistant response are added to the conversation.
	GenerateSequence(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[Token, error]

	// GenerateWithTools executes an agentic loop, auto-executing tools until the model
	// produces a final response without tool calls. Requires tools to be registered via WithTools.
	GenerateWithTools(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[AgentEvent, error]

	// AddMessage adds a message to the conversation without generating.
	// Useful for adding system prompts or reconstructing conversation history.
	AddMessage(role Role, content string)

	// Checkpoint creates a snapshot of the current chat state.
	Checkpoint() Chat

	// Backtrack returns to the state before the last Generate call.
	Backtrack() (Chat, bool)

	// Messages returns a copy of the message history.
	Messages() []ChatMessage

	// MessageCount returns the number of messages in the conversation.
	MessageCount() int

	// Model returns the parent model.
	Model() Model

	// Tools returns the registered tools for this chat.
	Tools() []Tool

	// Compact summarizes older messages and replaces them with the summary, keeping the last 10% of messages (or at least 1).
	Compact(ctx context.Context) error
}

Chat represents a conversation that tracks message history. Generate methods mutate the chat, appending assistant responses. Use Checkpoint() before generation to save state for branching. Note: Chat wraps a Session internally for KV cache efficiency.

type ChatConfig ¶

type ChatConfig struct {
	ChatTemplateConfig             // Embedded - Template, AddAssistant
	AgentConfig        AgentConfig // Agent/tool configuration
	ChatFormat         ChatFormat  // Tool call format for parsing (default: Hermes2Pro)
}

ChatConfig holds options for creating a Chat.

func DefaultChatConfig ¶

func DefaultChatConfig() ChatConfig

DefaultChatConfig returns the default chat configuration.

type ChatFormat ¶ added in v1.1.0

type ChatFormat int32

ChatFormat represents the tool call format for parsing model output. Different models use different formats for function/tool calling. This is forward-compatible: new formats added to llama.cpp work automatically even if not listed here. Use String() to get a readable name.

const (
	ChatFormatContentOnly             ChatFormat = 0  // No tool calls, content only
	ChatFormatGeneric                 ChatFormat = 1  // Generic format with JSON
	ChatFormatMistralNemo             ChatFormat = 2  // Mistral Nemo format
	ChatFormatMagistral               ChatFormat = 3  // Magistral format
	ChatFormatLlama3X                 ChatFormat = 4  // Llama 3.x format
	ChatFormatLlama3XWithBuiltinTools ChatFormat = 5  // Llama 3.x with builtin tools
	ChatFormatDeepSeekR1              ChatFormat = 6  // DeepSeek R1 format
	ChatFormatFireFunctionV2          ChatFormat = 7  // FireFunction v2 format
	ChatFormatFunctionaryV32          ChatFormat = 8  // Functionary v3.2 format
	ChatFormatFunctionaryV31Llama31   ChatFormat = 9  // Functionary v3.1 Llama 3.1 format
	ChatFormatDeepSeekV31             ChatFormat = 10 // DeepSeek V3.1 format
	ChatFormatHermes2Pro              ChatFormat = 11 // Hermes 2 Pro format (Qwen 2.5, Hermes 2/3)
	ChatFormatCommandR7B              ChatFormat = 12 // Command R7B format
	ChatFormatGranite                 ChatFormat = 13 // Granite format
	ChatFormatGPTOSS                  ChatFormat = 14 // GPT-OSS format
	ChatFormatSeedOSS                 ChatFormat = 15 // Seed-OSS format
	ChatFormatNemotronV2              ChatFormat = 16 // Nemotron V2 format
	ChatFormatApertus                 ChatFormat = 17 // Apertus format
	ChatFormatLFM2WithJSONTools       ChatFormat = 18 // LFM2 with JSON tools format
	ChatFormatGLM45                   ChatFormat = 19 // GLM 4.5 format
	ChatFormatMiniMaxM2               ChatFormat = 20 // MiniMax-M2 format
	ChatFormatKimiK2                  ChatFormat = 21 // Kimi K2 format
	ChatFormatQwen3CoderXML           ChatFormat = 22 // Qwen3 Coder format
	ChatFormatApriel15                ChatFormat = 23 // Apriel 1.5 format
	ChatFormatXiaomiMiMo              ChatFormat = 24 // Xiaomi MiMo format
)

Known chat formats. This list may be incomplete - llama.cpp may support additional formats that will work automatically.

func (ChatFormat) String ¶ added in v1.1.0

func (f ChatFormat) String() string

String returns the name of the chat format from llama.cpp.

type ChatMessage ¶

type ChatMessage struct {
	Role    Role
	Content string
	// For assistant messages: tool calls made in this message
	ToolCalls []ToolCall
	// For tool result messages: identifies which tool call this responds to
	ToolName   string
	ToolCallID string
}

ChatMessage represents a single message in a conversation.

type ChatOption ¶

type ChatOption func(*ChatConfig)

ChatOption is a functional option for NewChat.

func WithChatFormat ¶ added in v1.1.0

func WithChatFormat(format ChatFormat) ChatOption

WithChatFormat sets the tool call format for parsing model output. Different models use different formats. Normally this is auto-detected by llama.cpp when applying tool templates, so you typically don't need to set this explicitly.

func WithMaxIterations ¶ added in v1.1.0

func WithMaxIterations(n int) ChatOption

WithMaxIterations sets the maximum number of agentic loop iterations. Each iteration may contain multiple tool calls. Default is 10.

func WithMaxToolCalls ¶ added in v1.1.0

func WithMaxToolCalls(n int) ChatOption

WithMaxToolCalls sets the maximum total tool calls across all iterations. Default is 25.

func WithParallelToolCalls ¶ added in v1.1.0

func WithParallelToolCalls(parallel bool) ChatOption

WithParallelToolCalls controls whether the model can make multiple tool calls in a single response. Default is true.

func WithToolChoice ¶ added in v1.1.0

func WithToolChoice(choice ToolChoice) ChatOption

WithToolChoice sets how the model should use tools.

ToolChoiceAuto (default): Model decides when to use tools
ToolChoiceNone: Never use tools
ToolChoiceRequired: Must use a tool

func WithToolTimeout ¶ added in v1.1.0

func WithToolTimeout(d time.Duration) ChatOption

WithToolTimeout sets the per-tool execution timeout. Default is 30 seconds.

func WithTools ¶ added in v1.1.0

func WithTools(tools ...Tool) ChatOption

WithTools registers tools for the chat to use with GenerateWithTools.

type ChatParams ¶ added in v1.1.0

type ChatParams struct {
	Prompt          string           // Formatted prompt with tools embedded
	Grammar         string           // GBNF grammar for constraining output (may be empty)
	Format          ChatFormat       // Detected chat format
	GrammarLazy     bool             // Apply grammar only after trigger patterns
	GrammarTriggers []GrammarTrigger // Typed patterns that activate grammar
	AdditionalStops []string         // Extra stop sequences
}

ChatParams contains the result of applying a chat template with tools. This includes the formatted prompt, grammar constraints, and other metadata.

type ChatTemplateConfig ¶

type ChatTemplateConfig struct {
	Template             string  // Empty = use model's embedded template
	AddAssistant         bool    // Add assistant turn prefix (default: true)
	AutoCompactThreshold float32 // Ratio after which automatic compact occurs (Chat only)
}

ChatTemplateConfig holds options for ApplyChatTemplate.

func DefaultChatTemplateConfig ¶

func DefaultChatTemplateConfig() ChatTemplateConfig

DefaultChatTemplateConfig returns the default chat template configuration.

type ChatTemplateError ¶

type ChatTemplateError struct {
	Message string
}

ChatTemplateError provides details about chat template failures.

func (*ChatTemplateError) Error ¶

func (e *ChatTemplateError) Error() string

type ChatTemplateOption ¶

type ChatTemplateOption func(*ChatTemplateConfig)

ChatTemplateOption is a functional option for ApplyChatTemplate.

func WithAddAssistant ¶

func WithAddAssistant(add bool) ChatTemplateOption

WithAddAssistant controls whether to append the assistant turn prefix.

func WithAutoCompactThreshold ¶

func WithAutoCompactThreshold(threshold float32) ChatTemplateOption

WithAutoCompactThreshold sets the context usage ratio at which Chat automatically compacts the conversation. When context usage exceeds this threshold (e.g., 0.8 = 80%), older messages are summarized to free space. Use 0 to disable auto-compaction.

func WithChatTemplate ¶

func WithChatTemplate(name string) ChatTemplateOption

WithChatTemplate specifies a built-in template name (e.g., "chatml", "llama3").

type EmbeddingError ¶

type EmbeddingError struct {
	Message string
}

EmbeddingError provides details about embedding extraction failures.

func (*EmbeddingError) Error ¶

func (e *EmbeddingError) Error() string

type GenerateConfig ¶

type GenerateConfig struct {
	MaxTokens        int          // Maximum tokens to generate (0 = unlimited)
	Temperature      float32      // Sampling temperature (higher = more random)
	TopK             int          // Top-K sampling (0 = disabled)
	TopP             float32      // Nucleus sampling probability
	MinP             float32      // Minimum probability threshold
	RepeatPenalty    float32      // Repetition penalty (1.0 = no penalty)
	RepeatLastN      int          // Number of tokens to consider for repetition penalty
	FrequencyPenalty float32      // Frequency-based penalty
	PresencePenalty  float32      // Presence-based penalty
	Mirostat         MirostatMode // Mirostat sampling mode
	MirostatTau      float32      // Mirostat target entropy
	MirostatEta      float32      // Mirostat learning rate
	StopSequences    []string     // Sequences that stop generation
	IgnoreEOS        bool         // Continue past end-of-sequence token
	Grammar          string       // GBNF grammar to constrain output
	Seed             int          // Random seed for sampling (-1 = random)
	Threads          int          // Number of threads for generation (0 = autodetect)
	ReasoningEnabled bool         // Enable model thinking/reasoning (default: true)
	// contains filtered or unexported fields
}

GenerateConfig holds configuration for text generation.

func DefaultGenerateConfig ¶

func DefaultGenerateConfig() GenerateConfig

DefaultGenerateConfig returns a GenerateConfig with sensible defaults.

type GenerateOption ¶

type GenerateOption func(*GenerateConfig)

GenerateOption configures text generation.

func WithFrequencyPenalty ¶

func WithFrequencyPenalty(penalty float32) GenerateOption

WithFrequencyPenalty sets the frequency-based penalty. Penalizes tokens based on their frequency in the generated text.

func WithGenerateSeed ¶

func WithGenerateSeed(seed int) GenerateOption

WithGenerateSeed sets the random seed for sampling. Use -1 for random seed.

func WithGrammar ¶

func WithGrammar(grammar string) GenerateOption

WithGrammar constrains output to match a GBNF grammar.

func WithIgnoreEOS ¶

func WithIgnoreEOS() GenerateOption

WithIgnoreEOS enables generation past the end-of-sequence token.

func WithMaxTokens ¶

func WithMaxTokens(n int) GenerateOption

WithMaxTokens sets the maximum number of tokens to generate. Use 0 for unlimited (generates until EOS or context full).

func WithMinP ¶

func WithMinP(p float32) GenerateOption

WithMinP sets the minimum probability threshold. Tokens with probability below this are excluded.

func WithMirostat ¶

func WithMirostat(mode MirostatMode, tau, eta float32) GenerateOption

WithMirostat enables Mirostat adaptive sampling.

func WithPresencePenalty ¶

func WithPresencePenalty(penalty float32) GenerateOption

WithPresencePenalty sets the presence-based penalty. Penalizes tokens that have already appeared in the generated text.

func WithReasoningEnabled ¶

func WithReasoningEnabled(enabled bool) GenerateOption

WithReasoningEnabled enables or disables model thinking/reasoning. When disabled, thinking tags are closed immediately to prevent reasoning output. This is equivalent to llama.cpp server's --reasoning-budget 0.

func WithRepeatLastN ¶

func WithRepeatLastN(n int) GenerateOption

WithRepeatLastN sets how many recent tokens to consider for repetition penalty.

func WithRepeatPenalty ¶

func WithRepeatPenalty(penalty float32) GenerateOption

WithRepeatPenalty sets the repetition penalty. Values > 1.0 discourage repetition, 1.0 = no penalty.

func WithStopSequences ¶

func WithStopSequences(seqs ...string) GenerateOption

WithStopSequences sets sequences that will stop generation when encountered.

func WithTemperature ¶

func WithTemperature(t float32) GenerateOption

WithTemperature sets the sampling temperature. Higher values (e.g., 1.0) make output more random. Lower values (e.g., 0.2) make output more deterministic.

func WithThreads ¶

func WithThreads(n int) GenerateOption

WithThreads sets the number of threads for generation. Default value is -1, which autodetect the amount of cores

func WithTopK ¶

func WithTopK(k int) GenerateOption

WithTopK sets the top-K sampling value. Only the K most likely tokens are considered. Use 0 to disable.

func WithTopP ¶

func WithTopP(p float32) GenerateOption

WithTopP sets the nucleus sampling probability. Tokens are sampled from the smallest set whose cumulative probability exceeds P.

type GenerationError ¶

type GenerationError struct {
	Stage   string // "tokenize", "decode", "sample"
	Message string
}

GenerationError provides details about text generation failures.

func (*GenerationError) Error ¶

func (e *GenerationError) Error() string

type GrammarTrigger ¶ added in v1.1.3

type GrammarTrigger struct {
	Type  GrammarTriggerType
	Value string // For word/pattern types
	Token int32  // For token type
}

GrammarTrigger represents a pattern that activates lazy grammar.

type GrammarTriggerType ¶ added in v1.1.3

type GrammarTriggerType int32

GrammarTriggerType defines how a trigger pattern should be matched.

const (
	TriggerTypeWord        GrammarTriggerType = 0 // Exact word match (auto-escaped)
	TriggerTypePattern     GrammarTriggerType = 1 // Regex pattern (anywhere in output)
	TriggerTypePatternFull GrammarTriggerType = 2 // Full regex pattern
	TriggerTypeToken       GrammarTriggerType = 3 // Token ID trigger
)

type KVCacheType ¶

type KVCacheType int

KVCacheType represents the data type for KV cache storage. Lower precision types use less memory but may reduce quality slightly.

const (
	KVCacheF32  KVCacheType = iota // Full precision (32-bit float)
	KVCacheF16                     // Half precision (16-bit float) - default
	KVCacheQ8_0                    // 8-bit quantized
	KVCacheQ4_0                    // 4-bit quantized - lowest memory
)

func (KVCacheType) String ¶

func (k KVCacheType) String() string

String returns the llama.cpp string representation.

type MirostatMode ¶

type MirostatMode int

MirostatMode controls the Mirostat adaptive sampling algorithm.

const (
	MirostatDisabled MirostatMode = iota // Standard sampling (top-k, top-p, temperature)
	Mirostat1                            // Mirostat v1 algorithm
	Mirostat2                            // Mirostat v2 algorithm
)

type Model ¶

type Model interface {
	// NewSession creates a new empty session for text generation.
	NewSession() *session

	// NewChat creates a new empty chat for conversation.
	// Uses the model's embedded template by default.
	NewChat(opts ...ChatOption) Chat

	// Embeddings extracts embeddings for the given text.
	// The model must be loaded with WithEmbeddings() option.
	Embeddings(ctx context.Context, text string) ([]float32, error)

	// EmbeddingsBatch extracts embeddings for multiple texts in a single call.
	// The model must be loaded with WithEmbeddings() option.
	EmbeddingsBatch(ctx context.Context, texts []string) ([][]float32, error)

	// Tokenize converts text to token IDs.
	Tokenize(text string, addSpecial bool) ([]int, error)

	// TokenizeCount returns number of token IDs the text represents.
	TokenizeCount(text string, addSpecial bool) (int, error)

	// Detokenize converts token IDs back to text.
	Detokenize(tokens []int) (string, error)

	// DetokenizeLength returns string length the token IDs represents.
	DetokenizeLength(tokens []int) (int, error)

	// BOS returns the beginning-of-sequence token ID.
	BOS() int

	// EOS returns the end-of-sequence token ID.
	EOS() int

	// TokenToText converts a single token ID to its text representation.
	TokenToText(token int) string

	// IsSpecialToken returns true if the token is a special/control token.
	IsSpecialToken(token int) bool

	// IsEOG returns true if the token is an end-of-generation token.
	IsEOG(token int) bool

	// SpecialTokens returns all special token IDs.
	SpecialTokens() SpecialTokens

	// VocabSize returns the vocabulary size.
	VocabSize() int

	// ContextSize returns the effective context window size.
	ContextSize() int

	// TrainContextSize returns the model's original training context size.
	TrainContextSize() int

	// EmbeddingSize returns the embedding dimension.
	EmbeddingSize() int

	// Info returns model metadata and architecture details.
	Info() ModelInfo

	// ChatTemplate returns the model's embedded chat template string.
	// Returns empty string if no template is embedded.
	ChatTemplate() string

	// ApplyChatTemplate formats messages using a chat template.
	// Uses model's embedded template by default.
	ApplyChatTemplate(messages []ChatMessage, opts ...ChatTemplateOption) (string, error)

	// Close releases model resources.
	Close() error
}

Model represents a loaded LLM model.

func LoadModel ¶

func LoadModel(path string, opts ...ModelOption) (Model, error)

LoadModel loads a model from a GGUF file.

type ModelConfig ¶

type ModelConfig struct {
	ContextSize   int         // Context window size (0 = model's native context)
	Seed          int         // Random seed for model initialization
	BatchSize     int         // Batch size for prompt processing
	GPULayers     int         // Number of layers to offload to GPU (GPULayersAll for all)
	MainGPU       int         // Primary GPU device index for multi-GPU systems
	TensorSplit   []float32   // Distribution of layers across GPUs (e.g., [0.5, 0.5])
	KVCacheType   KVCacheType // Data type for KV cache storage
	RopeFreqBase  float32     // RoPE frequency base (0 = from model)
	RopeFreqScale float32     // RoPE frequency scale (0 = from model)
	LoraAdapter   string      // Path to LoRA adapter file
	UseMMap       bool        // Use memory mapping for model loading
	UseMlock      bool        // Lock model in memory (prevent swapping)
	UseNUMA       bool        // Enable NUMA optimizations
	Embeddings    bool        // Enable embedding extraction mode
	Warmup        bool        // Run warmup in background after loading (default: true)
}

ModelConfig holds configuration for model loading.

func DefaultModelConfig ¶

func DefaultModelConfig() ModelConfig

DefaultModelConfig returns a ModelConfig with sensible defaults.

type ModelInfo ¶

type ModelInfo struct {
	Description  string // Model description (e.g., "LLaMA v2 7B Q4_K_M")
	Architecture string // Architecture name from metadata
	QuantType    string // Quantization type (e.g., "Q4_K_M")
	Parameters   uint64 // Total parameter count
	Size         uint64 // Model size in bytes
	Layers       int    // Number of layers
	Heads        int    // Number of attention heads
	HeadsKV      int    // Number of KV heads
	VocabSize    int    // Vocabulary size
}

ModelInfo provides model metadata and architecture details.

type ModelLoadError ¶

type ModelLoadError struct {
	Path   string
	Reason string
}

ModelLoadError provides details about model loading failures.

func (*ModelLoadError) Error ¶

func (e *ModelLoadError) Error() string

type ModelOption ¶

type ModelOption func(*ModelConfig)

ModelOption configures model loading.

func WithBatchSize ¶

func WithBatchSize(n int) ModelOption

WithBatchSize sets the batch size for prompt processing.

func WithContextSize ¶

func WithContextSize(n int) ModelOption

WithContextSize sets the context window size. Use 0 to use the model's native context size.

func WithEmbeddings ¶

func WithEmbeddings() ModelOption

WithEmbeddings enables embedding extraction mode. Required to use the Embeddings() method.

func WithGPULayers ¶

func WithGPULayers(n int) ModelOption

WithGPULayers sets the number of layers to offload to GPU. Use GPULayersAll to offload all layers, 0 for CPU only.

func WithKVCacheType ¶

func WithKVCacheType(t KVCacheType) ModelOption

WithKVCacheType sets the data type for KV cache storage.

func WithLoRA ¶

func WithLoRA(path string) ModelOption

WithLoRA loads a LoRA adapter from the specified path.

func WithMMap ¶

func WithMMap(enable bool) ModelOption

WithMMap enables or disables memory mapping for model loading. Disabling forces loading the entire model into memory, might improve preformance.

func WithMainGPU ¶

func WithMainGPU(gpu int) ModelOption

WithMainGPU sets the primary GPU device index for multi-GPU systems.

func WithMlock ¶

func WithMlock(enable bool) ModelOption

WithMlock enables or disables memory locking. When enabled, the model is locked in RAM to prevent swapping.

func WithNUMA ¶

func WithNUMA(enable bool) ModelOption

WithNUMA enables or disables NUMA optimizations.

func WithRopeFreqBase ¶

func WithRopeFreqBase(base float32) ModelOption

WithRopeFreqBase sets the RoPE frequency base. Use 0 to use the model's default value.

func WithRopeFreqScale ¶

func WithRopeFreqScale(scale float32) ModelOption

WithRopeFreqScale sets the RoPE frequency scale. Use 0 to use the model's default value.

func WithSeed ¶

func WithSeed(seed int) ModelOption

WithSeed sets the random seed for model initialization.

func WithTensorSplit ¶

func WithTensorSplit(split []float32) ModelOption

WithTensorSplit sets the distribution of layers across multiple GPUs. []float32{0.5, 0.5} splits evenly between two GPUs.

func WithWarmup ¶ added in v1.2.0

func WithWarmup(enable bool) ModelOption

WithWarmup enables or disables background warmup after model loading. When enabled (default), this runs a minimal decode in a goroutine to initialize GPU kernels and reduce latency on the first real generation.

type ParseResult ¶ added in v1.1.0

type ParseResult struct {
	Content          string     // Non-tool-call text content
	ReasoningContent string     // Reasoning/thinking content (if any)
	ToolCalls        []ToolCall // Parsed tool calls
}

ParseResult contains the parsed tool calls from model output.

type Role ¶

type Role string

Role represents the role of a message sender in a conversation.

const (
	RoleSystem    Role = "system"
	RoleUser      Role = "user"
	RoleAssistant Role = "assistant"
	RoleTool      Role = "tool" // Tool result messages
)

Standard chat roles.

type Session ¶

type Session interface {
	// Generate processes the prompt, returns new tokens and moves the session forward.
	Generate(ctx context.Context, prompt string, opts ...GenerateOption) io.ReadCloser

	// GenerateSequence processes the prompt, returns new tokens as yield iterator and moves the session forward.
	GenerateSequence(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq2[Token, error]

	// GenerateSequenceWithLogprobs is like GenerateSequence but returns tokens with probability information.
	// topK specifies how many top alternatives to include (0 = just the selected token's prob/logit).
	GenerateSequenceWithLogprobs(ctx context.Context, prompt string, topK int, opts ...GenerateOption) iter.Seq2[TokenWithLogprobs, error]

	// Checkpoint creates a snapshot of the current session state.
	// Both the original and checkpoint can be used independently for branching.
	// Note: All sessions share one KV cache; switching between divergent sessions recomputes tokens from the common prefix.
	Checkpoint() Session

	// Backtrack returns to the state before the last Generate call.
	// Returns ok=false if this is the initial session (no parent).
	Backtrack() (Session, bool)

	// Tokens returns a copy of the token history for this session.
	Tokens() []int

	// Text returns the full text of this session (detokenized).
	Text() (string, error)

	// TokenCount returns the number of tokens in this session.
	TokenCount() int

	// ContextUsed returns the percentage of context used in this session.
	ContextUsed() float64

	// Model returns the parent model.
	Model() Model
}

Session represents a conversation state that tracks token history. Generate/GenerateSequence mutate the session, appending new tokens. Use Checkpoint() before generation to save state for branching. Note: All sessions share one KV cache. Switching between unrelated sessions recomputes all tokens.

type SpecialTokens ¶

type SpecialTokens struct {
	BOS int // Beginning of sequence (-1 if not available)
	EOS int // End of sequence
	EOT int // End of turn
	PAD int // Padding
	SEP int // Separator
	NL  int // Newline
}

SpecialTokens contains all special token IDs for the model.

type StopReason ¶

type StopReason int

StopReason indicates why text generation stopped.

const (
	StopReasonEOS          StopReason = iota // End of sequence token encountered
	StopReasonMaxTokens                      // Reached maximum token limit
	StopReasonStopSequence                   // Matched a stop sequence
	StopReasonCancelled                      // Context was cancelled
	StopReasonError                          // An error occurred
)

func (StopReason) String ¶

func (s StopReason) String() string

type Token ¶

type Token struct {
	Text string // The token text (may not be valid UTF-8 for partial tokens)
	ID   int    // Token ID from the vocabulary
}

Token represents a single generated token.

type TokenProb ¶

type TokenProb struct {
	Token int     // Token ID
	Text  string  // Token text
	Prob  float32 // Probability (0-1)
	Logit float32 // Raw logit value
}

TokenProb represents a token with its probability.

type TokenWithLogprobs ¶

type TokenWithLogprobs struct {
	Token             // Embedded - ID, Text
	Prob  float32     // Probability of selected token
	Logit float32     // Logit of selected token
	TopK  []TokenProb // Top-K alternatives (if requested)
}

TokenWithLogprobs extends Token with probability information.

type TokenizeError ¶

type TokenizeError struct {
	Text    string
	Message string
}

TokenizeError provides details about tokenization failures.

func (*TokenizeError) Error ¶

func (e *TokenizeError) Error() string

type Tool ¶ added in v1.1.0

type Tool interface {
	// Name returns the unique identifier for this tool.
	Name() string

	// Description returns a human-readable description of what this tool does.
	// This is provided to the model to help it decide when to use the tool.
	Description() string

	// Parameters returns the list of parameters this tool accepts.
	Parameters() []ToolParameter

	// Execute runs the tool with the given arguments and returns the result.
	// The result should be a string that can be fed back to the model.
	// Return an error if the tool execution fails.
	Execute(ctx context.Context, args map[string]any) (string, error)
}

Tool defines a callable function that the model can invoke. Implement this interface to create custom tools.

type ToolCall ¶ added in v1.1.0

type ToolCall struct {
	ID        string         // Unique identifier for this call
	Name      string         // Name of the tool to invoke
	Arguments map[string]any // Parsed arguments from the model
}

ToolCall represents a tool invocation requested by the model.

type ToolChoice ¶ added in v1.1.0

type ToolChoice int

ToolChoice controls how the model should use tools during generation.

const (
	ToolChoiceAuto     ToolChoice = iota // Model decides when to use tools
	ToolChoiceNone                       // Never use tools
	ToolChoiceRequired                   // Must use a tool
)

type ToolExecutionError ¶ added in v1.1.0

type ToolExecutionError struct {
	ToolName string
	CallID   string
	Err      error
}

ToolExecutionError provides details about tool execution failures.

func (*ToolExecutionError) Error ¶ added in v1.1.0

func (e *ToolExecutionError) Error() string

func (*ToolExecutionError) Unwrap ¶ added in v1.1.0

func (e *ToolExecutionError) Unwrap() error

type ToolParameter ¶ added in v1.1.0

type ToolParameter struct {
	Name        string   // Parameter name
	Type        string   // Type: "string", "number", "boolean", "array", "object"
	Description string   // Human-readable description
	Required    bool     // Whether this parameter is required
	Enum        []string // Optional: allowed values for this parameter
}

ToolParameter describes a single parameter for a tool.

type ToolResult ¶ added in v1.1.0

type ToolResult struct {
	CallID  string // Corresponds to ToolCall.ID
	Content string // Result content to feed back to the model
	IsError bool   // Whether this result represents an error
}

ToolResult represents the outcome of executing a tool.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL