zeus

package module
v1.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 31, 2025 License: Unlicense Imports: 14 Imported by: 0

README

Zeus

CI

Go bindings for llama.cpp. Run LLMs locally with zero setup.

What is Zeus?

Zeus brings the power of llama.cpp to Go applications. llama.cpp is a high-performance C++ library for running Large Language Models, known for its efficiency, broad model support, and ability to run on consumer hardware.

Zeus wraps llama.cpp with a clean Go API, handling all the complexity of CGO bindings, memory management, and cross-platform builds. The result is a library that lets you run any GGUF model with just a few lines of Go code.

Key Features
  • Zero Setup - Pre-built static libraries included. No compilation, no cmake, no toolchains.
  • Universal Model Support - Works with any GGUF model: Llama, Mistral, Qwen, Phi, Gemma, and hundreds more.
  • Portable - x86_64 builds for Linux and Windows, ARM64 builds for Raspberry Pi 4/5.
  • GPU Acceleration - Vulkan support for GPU inference, with automatic CPU fallback.
  • Sensible Defaults - Works out of the box. Configure only what you need.
  • Memory Efficient - KV cache quantization to run larger contexts on limited RAM.
  • Developer Idiocracy - This library expects no assumed knowlege from developers and it cannot be used incorrectly.

Quick Start

Add Library
Standard: Windows 10+ / Ubuntu 24.04+ / Debian 12+ / Linux glibc 2.36+
go get github.com/expki/zeus@latest
Legacy: Ubuntu 22.04 / Linux glibc 2.35
go get github.com/expki/[email protected]
Use Library
package main

import (
    "context"
    "fmt"
    "log"

    "github.com/expki/zeus"
)

func main() {
    model, err := zeus.LoadModel("model.gguf")
    if err != nil {
        log.Fatal(err)
    }
    defer model.Close()

    chat := model.NewChat()
    chat.AddMessage(zeus.RoleSystem, "You are a helpful assistant.")

    for token, err := range chat.GenerateSequence(context.Background(), "Hello!") {
        if err != nil {
            log.Fatal(err)
        }
        fmt.Print(token.Text)
    }
}
Build App
CGO_ENABLED=1 go build -o myapp .

Requirements

  • Go 1.25+
  • x86_64 Linux or Windows, ARM64 Linux (Raspberry Pi 4/5)
    • Linux x86_64: libvulkan1 (executable) or libvulkan-dev (build)
    • Linux ARM64: libvulkan1 (executable) or libvulkan-dev (build), mesa-vulkan-drivers
  • Any GGUF model file

Documentation

  • API Reference - Complete documentation of all interfaces, methods, and options
  • Contributing - Building from source and contributing guidelines

Acknowledgments

This project was inspired by go-skynet/go-llama.cpp, which pioneered Go bindings for llama.cpp. Zeus builds on that foundation with a focus on simplicity, portability, and pre-built binaries.

License

Unlicensed

Documentation

Overview

Package zeus provides Go bindings for llama.cpp, enabling local LLM inference with pre-built static libraries for Linux and Windows x86_64.

Zeus is designed for simplicity - load any GGUF model and start generating text with sensible defaults. No compilation required, no external dependencies.

Quick Start

The simplest way to use Zeus is with the Chat API:

model, err := zeus.LoadModel("model.gguf")
if err != nil {
    log.Fatal(err)
}
defer model.Close()

chat := model.NewChat()
chat.AddMessage(zeus.RoleSystem, "You are a helpful assistant.")

for tok, err := range chat.GenerateSequence(ctx, "Hello!") {
    if err != nil {
        log.Fatal(err)
    }
    fmt.Print(tok.Text)
}

Core Abstractions

Zeus provides three main abstractions for different use cases:

  • Model: Load and manage GGUF models. Provides tokenization, embeddings, and model information. Thread-safe for concurrent use.

  • Session: Token-level generation with state tracking. Use when you need precise control over the prompt format or are working with non-chat models. Supports checkpoint/backtrack for branching conversations.

  • Chat: Message-level conversations with automatic template handling. Manages conversation history and applies the model's chat template. Best for chatbot-style applications.

Configuration

Zeus uses Go's functional options pattern for clean, readable configuration:

model, err := zeus.LoadModel("model.gguf",
    zeus.WithContextSize(4096),
    zeus.WithKVCacheType(zeus.KVCacheQ8_0),
)

for tok, err := range session.GenerateSequence(ctx, prompt,
    zeus.WithMaxTokens(512),
    zeus.WithTemperature(0.7),
) {
    // ...
}

Most options have sensible defaults. You only need to configure what you want to change.

Streaming Generation

Zeus provides two streaming interfaces:

  • iter.Seq2 via GenerateSequence: Returns tokens one at a time using Go 1.23+ range-over-func. Best for token-by-token processing.

  • io.ReadCloser via Generate: Returns generated text as a byte stream. Best for piping to HTTP responses or other io.Writers.

GPU Acceleration

Zeus includes pre-built Vulkan support for GPU acceleration. Enable it with WithGPULayers:

model, err := zeus.LoadModel("model.gguf",
    zeus.WithGPULayers(zeus.GPULayersAll),
)

GPU acceleration is optional - Zeus falls back to CPU if Vulkan is unavailable.

Thread Safety

Model is safe for concurrent use from multiple goroutines. The KV cache is protected by a mutex, so generation operations are serialized. Multiple Sessions or Chats can exist simultaneously, but only one can generate at a time.

[Close] is safe to call multiple times and will wait for any ongoing generation to complete.

Error Handling

Zeus provides sentinel errors for common conditions that can be checked with errors.Is:

Typed errors provide additional context and can be checked with errors.As:

For complete API documentation, see DOC.md in the repository.

Index

Constants

View Source
const GPULayersAll = 999

GPULayersAll is a constant to offload all model layers to GPU. llama.cpp will offload as many layers as fit in available VRAM.

Variables

View Source
var (
	ErrModelClosed        = errors.New("zeus: model is closed")
	ErrEmbeddingsDisabled = errors.New("zeus: model loaded without embeddings support")
	ErrPromptTooLong      = errors.New("zeus: prompt exceeds context size")
	ErrDecodeFailed       = errors.New("zeus: decode operation failed")
	ErrSessionIsNil       = errors.New("zeus: session is nil and not defined")
	ErrModelIsNil         = errors.New("zeus: model is nil and not defined")
	ErrChatIsNil          = errors.New("zeus: chat is nil and not defined")

	// Tool-related errors
	ErrNoToolsRegistered       = errors.New("zeus: no tools registered")
	ErrMaxIterationsExceeded   = errors.New("zeus: max agent iterations exceeded")
	ErrMaxToolCallsExceeded    = errors.New("zeus: max tool calls exceeded")
	ErrTemplateApply           = errors.New("zeus: failed to apply chat template with tools")
	ErrToolTemplateUnsupported = errors.New("zeus: model does not support native tool templates")
)

Sentinel errors for use with errors.Is()

Functions

func SetVerbose

func SetVerbose(verbose bool)

SetVerbose enables or disables verbose logging from llama.cpp.

Types

type AgentConfig added in v1.1.0

type AgentConfig struct {
	Tools             []Tool        // Registered tools
	MaxIterations     int           // Maximum agentic loop iterations (default: 10)
	MaxToolCalls      int           // Maximum total tool calls (default: 25)
	ToolTimeout       time.Duration // Per-tool execution timeout (default: 30s)
	ToolChoice        ToolChoice    // How the model should use tools (default: Auto)
	ParallelToolCalls bool          // Allow multiple tool calls in one response (default: true)
}

AgentConfig holds configuration for agentic tool execution.

func DefaultAgentConfig added in v1.1.0

func DefaultAgentConfig() AgentConfig

DefaultAgentConfig returns an AgentConfig with sensible defaults.

type AgentEvent added in v1.1.0

type AgentEvent struct {
	Type     AgentEventType // Type of event
	Token    *Token         // For AgentEventToken: the generated token
	ToolCall *ToolCall      // For AgentEventToolCallStart/End: the tool call
	Result   *ToolResult    // For AgentEventToolCallEnd: the result
	Error    error          // For AgentEventError: the error that occurred
}

AgentEvent represents an event during agentic loop execution.

type AgentEventType added in v1.1.0

type AgentEventType int

AgentEventType indicates the type of event during agentic loop execution.

const (
	AgentEventToken         AgentEventType = iota // A token was generated
	AgentEventToolCallStart                       // A tool call is starting
	AgentEventToolCallEnd                         // A tool call completed
	AgentEventError                               // An error occurred
	AgentEventDone                                // The agentic loop completed
)

func (AgentEventType) String added in v1.1.0

func (t AgentEventType) String() string

String returns the string representation of the event type.

type Chat

type Chat interface {
	// Generate sends a user message and returns the assistant's response as a stream.
	// The user message and assistant response are added to the conversation.
	Generate(ctx context.Context, userMessage string, opts ...GenerateOption) io.ReadCloser

	// GenerateSequence sends a user message and returns tokens as an iterator.
	// The user message and assistant response are added to the conversation.
	GenerateSequence(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[Token, error]

	// GenerateWithTools executes an agentic loop, auto-executing tools until the model
	// produces a final response without tool calls. Requires tools to be registered via WithTools.
	GenerateWithTools(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[AgentEvent, error]

	// AddMessage adds a message to the conversation without generating.
	// Useful for adding system prompts or reconstructing conversation history.
	AddMessage(role Role, content string)

	// Checkpoint creates a snapshot of the current chat state.
	Checkpoint() Chat

	// Backtrack returns to the state before the last Generate call.
	Backtrack() (Chat, bool)

	// Messages returns a copy of the message history.
	Messages() []ChatMessage

	// MessageCount returns the number of messages in the conversation.
	MessageCount() int

	// Model returns the parent model.
	Model() Model

	// Tools returns the registered tools for this chat.
	Tools() []Tool

	// Compact summarizes older messages and replaces them with the summary, keeping the last 10% of messages (or at least 1).
	Compact(ctx context.Context) error
}

Chat represents a conversation that tracks message history. Generate methods mutate the chat, appending assistant responses. Use Checkpoint() before generation to save state for branching. Note: Chat wraps a Session internally for KV cache efficiency.

type ChatConfig

type ChatConfig struct {
	ChatTemplateConfig             // Embedded - Template, AddAssistant
	AgentConfig        AgentConfig // Agent/tool configuration
	ChatFormat         ChatFormat  // Tool call format for parsing (default: Hermes2Pro)
}

ChatConfig holds options for creating a Chat.

func DefaultChatConfig

func DefaultChatConfig() ChatConfig

DefaultChatConfig returns the default chat configuration.

type ChatFormat added in v1.1.0

type ChatFormat int32

ChatFormat represents the tool call format for parsing model output. Different models use different formats for function/tool calling. This is forward-compatible: new formats added to llama.cpp work automatically even if not listed here. Use String() to get a readable name.

const (
	ChatFormatContentOnly             ChatFormat = 0  // No tool calls, content only
	ChatFormatGeneric                 ChatFormat = 1  // Generic format with JSON
	ChatFormatMistralNemo             ChatFormat = 2  // Mistral Nemo format
	ChatFormatMagistral               ChatFormat = 3  // Magistral format
	ChatFormatLlama3X                 ChatFormat = 4  // Llama 3.x format
	ChatFormatLlama3XWithBuiltinTools ChatFormat = 5  // Llama 3.x with builtin tools
	ChatFormatDeepSeekR1              ChatFormat = 6  // DeepSeek R1 format
	ChatFormatFireFunctionV2          ChatFormat = 7  // FireFunction v2 format
	ChatFormatFunctionaryV32          ChatFormat = 8  // Functionary v3.2 format
	ChatFormatFunctionaryV31Llama31   ChatFormat = 9  // Functionary v3.1 Llama 3.1 format
	ChatFormatDeepSeekV31             ChatFormat = 10 // DeepSeek V3.1 format
	ChatFormatHermes2Pro              ChatFormat = 11 // Hermes 2 Pro format (Qwen 2.5, Hermes 2/3)
	ChatFormatCommandR7B              ChatFormat = 12 // Command R7B format
	ChatFormatGranite                 ChatFormat = 13 // Granite format
	ChatFormatGPTOSS                  ChatFormat = 14 // GPT-OSS format
	ChatFormatSeedOSS                 ChatFormat = 15 // Seed-OSS format
	ChatFormatNemotronV2              ChatFormat = 16 // Nemotron V2 format
	ChatFormatApertus                 ChatFormat = 17 // Apertus format
	ChatFormatLFM2WithJSONTools       ChatFormat = 18 // LFM2 with JSON tools format
	ChatFormatGLM45                   ChatFormat = 19 // GLM 4.5 format
	ChatFormatMiniMaxM2               ChatFormat = 20 // MiniMax-M2 format
	ChatFormatKimiK2                  ChatFormat = 21 // Kimi K2 format
	ChatFormatQwen3CoderXML           ChatFormat = 22 // Qwen3 Coder format
	ChatFormatApriel15                ChatFormat = 23 // Apriel 1.5 format
	ChatFormatXiaomiMiMo              ChatFormat = 24 // Xiaomi MiMo format
)

Known chat formats. This list may be incomplete - llama.cpp may support additional formats that will work automatically.

func (ChatFormat) String added in v1.1.0

func (f ChatFormat) String() string

String returns the name of the chat format from llama.cpp.

type ChatMessage

type ChatMessage struct {
	Role    Role
	Content string
	// For assistant messages: tool calls made in this message
	ToolCalls []ToolCall
	// For tool result messages: identifies which tool call this responds to
	ToolName   string
	ToolCallID string
}

ChatMessage represents a single message in a conversation.

type ChatOption

type ChatOption func(*ChatConfig)

ChatOption is a functional option for NewChat.

func WithChatFormat added in v1.1.0

func WithChatFormat(format ChatFormat) ChatOption

WithChatFormat sets the tool call format for parsing model output. Different models use different formats. Normally this is auto-detected by llama.cpp when applying tool templates, so you typically don't need to set this explicitly.

func WithMaxIterations added in v1.1.0

func WithMaxIterations(n int) ChatOption

WithMaxIterations sets the maximum number of agentic loop iterations. Each iteration may contain multiple tool calls. Default is 10.

func WithMaxToolCalls added in v1.1.0

func WithMaxToolCalls(n int) ChatOption

WithMaxToolCalls sets the maximum total tool calls across all iterations. Default is 25.

func WithParallelToolCalls added in v1.1.0

func WithParallelToolCalls(parallel bool) ChatOption

WithParallelToolCalls controls whether the model can make multiple tool calls in a single response. Default is true.

func WithToolChoice added in v1.1.0

func WithToolChoice(choice ToolChoice) ChatOption

WithToolChoice sets how the model should use tools.

  • ToolChoiceAuto (default): Model decides when to use tools
  • ToolChoiceNone: Never use tools
  • ToolChoiceRequired: Must use a tool

func WithToolTimeout added in v1.1.0

func WithToolTimeout(d time.Duration) ChatOption

WithToolTimeout sets the per-tool execution timeout. Default is 30 seconds.

func WithTools added in v1.1.0

func WithTools(tools ...Tool) ChatOption

WithTools registers tools for the chat to use with GenerateWithTools.

type ChatParams added in v1.1.0

type ChatParams struct {
	Prompt          string           // Formatted prompt with tools embedded
	Grammar         string           // GBNF grammar for constraining output (may be empty)
	Format          ChatFormat       // Detected chat format
	GrammarLazy     bool             // Apply grammar only after trigger patterns
	GrammarTriggers []GrammarTrigger // Typed patterns that activate grammar
	AdditionalStops []string         // Extra stop sequences
}

ChatParams contains the result of applying a chat template with tools. This includes the formatted prompt, grammar constraints, and other metadata.

type ChatTemplateConfig

type ChatTemplateConfig struct {
	Template             string  // Empty = use model's embedded template
	AddAssistant         bool    // Add assistant turn prefix (default: true)
	AutoCompactThreshold float32 // Ratio after which automatic compact occurs (Chat only)
}

ChatTemplateConfig holds options for ApplyChatTemplate.

func DefaultChatTemplateConfig

func DefaultChatTemplateConfig() ChatTemplateConfig

DefaultChatTemplateConfig returns the default chat template configuration.

type ChatTemplateError

type ChatTemplateError struct {
	Message string
}

ChatTemplateError provides details about chat template failures.

func (*ChatTemplateError) Error

func (e *ChatTemplateError) Error() string

type ChatTemplateOption

type ChatTemplateOption func(*ChatTemplateConfig)

ChatTemplateOption is a functional option for ApplyChatTemplate.

func WithAddAssistant

func WithAddAssistant(add bool) ChatTemplateOption

WithAddAssistant controls whether to append the assistant turn prefix.

func WithAutoCompactThreshold

func WithAutoCompactThreshold(threshold float32) ChatTemplateOption

WithAutoCompactThreshold sets the context usage ratio at which Chat automatically compacts the conversation. When context usage exceeds this threshold (e.g., 0.8 = 80%), older messages are summarized to free space. Use 0 to disable auto-compaction.

func WithChatTemplate

func WithChatTemplate(name string) ChatTemplateOption

WithChatTemplate specifies a built-in template name (e.g., "chatml", "llama3").

type EmbeddingError

type EmbeddingError struct {
	Message string
}

EmbeddingError provides details about embedding extraction failures.

func (*EmbeddingError) Error

func (e *EmbeddingError) Error() string

type GenerateConfig

type GenerateConfig struct {
	MaxTokens        int          // Maximum tokens to generate (0 = unlimited)
	Temperature      float32      // Sampling temperature (higher = more random)
	TopK             int          // Top-K sampling (0 = disabled)
	TopP             float32      // Nucleus sampling probability
	MinP             float32      // Minimum probability threshold
	RepeatPenalty    float32      // Repetition penalty (1.0 = no penalty)
	RepeatLastN      int          // Number of tokens to consider for repetition penalty
	FrequencyPenalty float32      // Frequency-based penalty
	PresencePenalty  float32      // Presence-based penalty
	Mirostat         MirostatMode // Mirostat sampling mode
	MirostatTau      float32      // Mirostat target entropy
	MirostatEta      float32      // Mirostat learning rate
	StopSequences    []string     // Sequences that stop generation
	IgnoreEOS        bool         // Continue past end-of-sequence token
	Grammar          string       // GBNF grammar to constrain output
	Seed             int          // Random seed for sampling (-1 = random)
	Threads          int          // Number of threads for generation (0 = autodetect)
	ReasoningEnabled bool         // Enable model thinking/reasoning (default: true)
	// contains filtered or unexported fields
}

GenerateConfig holds configuration for text generation.

func DefaultGenerateConfig

func DefaultGenerateConfig() GenerateConfig

DefaultGenerateConfig returns a GenerateConfig with sensible defaults.

type GenerateOption

type GenerateOption func(*GenerateConfig)

GenerateOption configures text generation.

func WithFrequencyPenalty

func WithFrequencyPenalty(penalty float32) GenerateOption

WithFrequencyPenalty sets the frequency-based penalty. Penalizes tokens based on their frequency in the generated text.

func WithGenerateSeed

func WithGenerateSeed(seed int) GenerateOption

WithGenerateSeed sets the random seed for sampling. Use -1 for random seed.

func WithGrammar

func WithGrammar(grammar string) GenerateOption

WithGrammar constrains output to match a GBNF grammar.

func WithIgnoreEOS

func WithIgnoreEOS() GenerateOption

WithIgnoreEOS enables generation past the end-of-sequence token.

func WithMaxTokens

func WithMaxTokens(n int) GenerateOption

WithMaxTokens sets the maximum number of tokens to generate. Use 0 for unlimited (generates until EOS or context full).

func WithMinP

func WithMinP(p float32) GenerateOption

WithMinP sets the minimum probability threshold. Tokens with probability below this are excluded.

func WithMirostat

func WithMirostat(mode MirostatMode, tau, eta float32) GenerateOption

WithMirostat enables Mirostat adaptive sampling.

func WithPresencePenalty

func WithPresencePenalty(penalty float32) GenerateOption

WithPresencePenalty sets the presence-based penalty. Penalizes tokens that have already appeared in the generated text.

func WithReasoningEnabled

func WithReasoningEnabled(enabled bool) GenerateOption

WithReasoningEnabled enables or disables model thinking/reasoning. When disabled, thinking tags are closed immediately to prevent reasoning output. This is equivalent to llama.cpp server's --reasoning-budget 0.

func WithRepeatLastN

func WithRepeatLastN(n int) GenerateOption

WithRepeatLastN sets how many recent tokens to consider for repetition penalty.

func WithRepeatPenalty

func WithRepeatPenalty(penalty float32) GenerateOption

WithRepeatPenalty sets the repetition penalty. Values > 1.0 discourage repetition, 1.0 = no penalty.

func WithStopSequences

func WithStopSequences(seqs ...string) GenerateOption

WithStopSequences sets sequences that will stop generation when encountered.

func WithTemperature

func WithTemperature(t float32) GenerateOption

WithTemperature sets the sampling temperature. Higher values (e.g., 1.0) make output more random. Lower values (e.g., 0.2) make output more deterministic.

func WithThreads

func WithThreads(n int) GenerateOption

WithThreads sets the number of threads for generation. Default value is -1, which autodetect the amount of cores

func WithTopK

func WithTopK(k int) GenerateOption

WithTopK sets the top-K sampling value. Only the K most likely tokens are considered. Use 0 to disable.

func WithTopP

func WithTopP(p float32) GenerateOption

WithTopP sets the nucleus sampling probability. Tokens are sampled from the smallest set whose cumulative probability exceeds P.

type GenerationError

type GenerationError struct {
	Stage   string // "tokenize", "decode", "sample"
	Message string
}

GenerationError provides details about text generation failures.

func (*GenerationError) Error

func (e *GenerationError) Error() string

type GrammarTrigger added in v1.1.3

type GrammarTrigger struct {
	Type  GrammarTriggerType
	Value string // For word/pattern types
	Token int32  // For token type
}

GrammarTrigger represents a pattern that activates lazy grammar.

type GrammarTriggerType added in v1.1.3

type GrammarTriggerType int32

GrammarTriggerType defines how a trigger pattern should be matched.

const (
	TriggerTypeWord        GrammarTriggerType = 0 // Exact word match (auto-escaped)
	TriggerTypePattern     GrammarTriggerType = 1 // Regex pattern (anywhere in output)
	TriggerTypePatternFull GrammarTriggerType = 2 // Full regex pattern
	TriggerTypeToken       GrammarTriggerType = 3 // Token ID trigger
)

type KVCacheType

type KVCacheType int

KVCacheType represents the data type for KV cache storage. Lower precision types use less memory but may reduce quality slightly.

const (
	KVCacheF32  KVCacheType = iota // Full precision (32-bit float)
	KVCacheF16                     // Half precision (16-bit float) - default
	KVCacheQ8_0                    // 8-bit quantized
	KVCacheQ4_0                    // 4-bit quantized - lowest memory
)

func (KVCacheType) String

func (k KVCacheType) String() string

String returns the llama.cpp string representation.

type MirostatMode

type MirostatMode int

MirostatMode controls the Mirostat adaptive sampling algorithm.

const (
	MirostatDisabled MirostatMode = iota // Standard sampling (top-k, top-p, temperature)
	Mirostat1                            // Mirostat v1 algorithm
	Mirostat2                            // Mirostat v2 algorithm
)

type Model

type Model interface {
	// NewSession creates a new empty session for text generation.
	NewSession() *session

	// NewChat creates a new empty chat for conversation.
	// Uses the model's embedded template by default.
	NewChat(opts ...ChatOption) Chat

	// Embeddings extracts embeddings for the given text.
	// The model must be loaded with WithEmbeddings() option.
	Embeddings(ctx context.Context, text string) ([]float32, error)

	// EmbeddingsBatch extracts embeddings for multiple texts in a single call.
	// The model must be loaded with WithEmbeddings() option.
	EmbeddingsBatch(ctx context.Context, texts []string) ([][]float32, error)

	// Tokenize converts text to token IDs.
	Tokenize(text string, addSpecial bool) ([]int, error)

	// TokenizeCount returns number of token IDs the text represents.
	TokenizeCount(text string, addSpecial bool) (int, error)

	// Detokenize converts token IDs back to text.
	Detokenize(tokens []int) (string, error)

	// DetokenizeLength returns string length the token IDs represents.
	DetokenizeLength(tokens []int) (int, error)

	// BOS returns the beginning-of-sequence token ID.
	BOS() int

	// EOS returns the end-of-sequence token ID.
	EOS() int

	// TokenToText converts a single token ID to its text representation.
	TokenToText(token int) string

	// IsSpecialToken returns true if the token is a special/control token.
	IsSpecialToken(token int) bool

	// IsEOG returns true if the token is an end-of-generation token.
	IsEOG(token int) bool

	// SpecialTokens returns all special token IDs.
	SpecialTokens() SpecialTokens

	// VocabSize returns the vocabulary size.
	VocabSize() int

	// ContextSize returns the effective context window size.
	ContextSize() int

	// TrainContextSize returns the model's original training context size.
	TrainContextSize() int

	// EmbeddingSize returns the embedding dimension.
	EmbeddingSize() int

	// Info returns model metadata and architecture details.
	Info() ModelInfo

	// ChatTemplate returns the model's embedded chat template string.
	// Returns empty string if no template is embedded.
	ChatTemplate() string

	// ApplyChatTemplate formats messages using a chat template.
	// Uses model's embedded template by default.
	ApplyChatTemplate(messages []ChatMessage, opts ...ChatTemplateOption) (string, error)

	// Close releases model resources.
	Close() error
}

Model represents a loaded LLM model.

func LoadModel

func LoadModel(path string, opts ...ModelOption) (Model, error)

LoadModel loads a model from a GGUF file.

type ModelConfig

type ModelConfig struct {
	ContextSize   int         // Context window size (0 = model's native context)
	Seed          int         // Random seed for model initialization
	BatchSize     int         // Batch size for prompt processing
	GPULayers     int         // Number of layers to offload to GPU (GPULayersAll for all)
	MainGPU       int         // Primary GPU device index for multi-GPU systems
	TensorSplit   []float32   // Distribution of layers across GPUs (e.g., [0.5, 0.5])
	KVCacheType   KVCacheType // Data type for KV cache storage
	RopeFreqBase  float32     // RoPE frequency base (0 = from model)
	RopeFreqScale float32     // RoPE frequency scale (0 = from model)
	LoraAdapter   string      // Path to LoRA adapter file
	UseMMap       bool        // Use memory mapping for model loading
	UseMlock      bool        // Lock model in memory (prevent swapping)
	UseNUMA       bool        // Enable NUMA optimizations
	Embeddings    bool        // Enable embedding extraction mode
	Warmup        bool        // Run warmup in background after loading (default: true)
}

ModelConfig holds configuration for model loading.

func DefaultModelConfig

func DefaultModelConfig() ModelConfig

DefaultModelConfig returns a ModelConfig with sensible defaults.

type ModelInfo

type ModelInfo struct {
	Description  string // Model description (e.g., "LLaMA v2 7B Q4_K_M")
	Architecture string // Architecture name from metadata
	QuantType    string // Quantization type (e.g., "Q4_K_M")
	Parameters   uint64 // Total parameter count
	Size         uint64 // Model size in bytes
	Layers       int    // Number of layers
	Heads        int    // Number of attention heads
	HeadsKV      int    // Number of KV heads
	VocabSize    int    // Vocabulary size
}

ModelInfo provides model metadata and architecture details.

type ModelLoadError

type ModelLoadError struct {
	Path   string
	Reason string
}

ModelLoadError provides details about model loading failures.

func (*ModelLoadError) Error

func (e *ModelLoadError) Error() string

type ModelOption

type ModelOption func(*ModelConfig)

ModelOption configures model loading.

func WithBatchSize

func WithBatchSize(n int) ModelOption

WithBatchSize sets the batch size for prompt processing.

func WithContextSize

func WithContextSize(n int) ModelOption

WithContextSize sets the context window size. Use 0 to use the model's native context size.

func WithEmbeddings

func WithEmbeddings() ModelOption

WithEmbeddings enables embedding extraction mode. Required to use the Embeddings() method.

func WithGPULayers

func WithGPULayers(n int) ModelOption

WithGPULayers sets the number of layers to offload to GPU. Use GPULayersAll to offload all layers, 0 for CPU only.

func WithKVCacheType

func WithKVCacheType(t KVCacheType) ModelOption

WithKVCacheType sets the data type for KV cache storage.

func WithLoRA

func WithLoRA(path string) ModelOption

WithLoRA loads a LoRA adapter from the specified path.

func WithMMap

func WithMMap(enable bool) ModelOption

WithMMap enables or disables memory mapping for model loading. Disabling forces loading the entire model into memory, might improve preformance.

func WithMainGPU

func WithMainGPU(gpu int) ModelOption

WithMainGPU sets the primary GPU device index for multi-GPU systems.

func WithMlock

func WithMlock(enable bool) ModelOption

WithMlock enables or disables memory locking. When enabled, the model is locked in RAM to prevent swapping.

func WithNUMA

func WithNUMA(enable bool) ModelOption

WithNUMA enables or disables NUMA optimizations.

func WithRopeFreqBase

func WithRopeFreqBase(base float32) ModelOption

WithRopeFreqBase sets the RoPE frequency base. Use 0 to use the model's default value.

func WithRopeFreqScale

func WithRopeFreqScale(scale float32) ModelOption

WithRopeFreqScale sets the RoPE frequency scale. Use 0 to use the model's default value.

func WithSeed

func WithSeed(seed int) ModelOption

WithSeed sets the random seed for model initialization.

func WithTensorSplit

func WithTensorSplit(split []float32) ModelOption

WithTensorSplit sets the distribution of layers across multiple GPUs. []float32{0.5, 0.5} splits evenly between two GPUs.

func WithWarmup added in v1.2.0

func WithWarmup(enable bool) ModelOption

WithWarmup enables or disables background warmup after model loading. When enabled (default), this runs a minimal decode in a goroutine to initialize GPU kernels and reduce latency on the first real generation.

type ParseResult added in v1.1.0

type ParseResult struct {
	Content          string     // Non-tool-call text content
	ReasoningContent string     // Reasoning/thinking content (if any)
	ToolCalls        []ToolCall // Parsed tool calls
}

ParseResult contains the parsed tool calls from model output.

type Role

type Role string

Role represents the role of a message sender in a conversation.

const (
	RoleSystem    Role = "system"
	RoleUser      Role = "user"
	RoleAssistant Role = "assistant"
	RoleTool      Role = "tool" // Tool result messages
)

Standard chat roles.

type Session

type Session interface {
	// Generate processes the prompt, returns new tokens and moves the session forward.
	Generate(ctx context.Context, prompt string, opts ...GenerateOption) io.ReadCloser

	// GenerateSequence processes the prompt, returns new tokens as yield iterator and moves the session forward.
	GenerateSequence(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq2[Token, error]

	// GenerateSequenceWithLogprobs is like GenerateSequence but returns tokens with probability information.
	// topK specifies how many top alternatives to include (0 = just the selected token's prob/logit).
	GenerateSequenceWithLogprobs(ctx context.Context, prompt string, topK int, opts ...GenerateOption) iter.Seq2[TokenWithLogprobs, error]

	// Checkpoint creates a snapshot of the current session state.
	// Both the original and checkpoint can be used independently for branching.
	// Note: All sessions share one KV cache; switching between divergent sessions recomputes tokens from the common prefix.
	Checkpoint() Session

	// Backtrack returns to the state before the last Generate call.
	// Returns ok=false if this is the initial session (no parent).
	Backtrack() (Session, bool)

	// Tokens returns a copy of the token history for this session.
	Tokens() []int

	// Text returns the full text of this session (detokenized).
	Text() (string, error)

	// TokenCount returns the number of tokens in this session.
	TokenCount() int

	// ContextUsed returns the percentage of context used in this session.
	ContextUsed() float64

	// Model returns the parent model.
	Model() Model
}

Session represents a conversation state that tracks token history. Generate/GenerateSequence mutate the session, appending new tokens. Use Checkpoint() before generation to save state for branching. Note: All sessions share one KV cache. Switching between unrelated sessions recomputes all tokens.

type SpecialTokens

type SpecialTokens struct {
	BOS int // Beginning of sequence (-1 if not available)
	EOS int // End of sequence
	EOT int // End of turn
	PAD int // Padding
	SEP int // Separator
	NL  int // Newline
}

SpecialTokens contains all special token IDs for the model.

type StopReason

type StopReason int

StopReason indicates why text generation stopped.

const (
	StopReasonEOS          StopReason = iota // End of sequence token encountered
	StopReasonMaxTokens                      // Reached maximum token limit
	StopReasonStopSequence                   // Matched a stop sequence
	StopReasonCancelled                      // Context was cancelled
	StopReasonError                          // An error occurred
)

func (StopReason) String

func (s StopReason) String() string

type Token

type Token struct {
	Text string // The token text (may not be valid UTF-8 for partial tokens)
	ID   int    // Token ID from the vocabulary
}

Token represents a single generated token.

type TokenProb

type TokenProb struct {
	Token int     // Token ID
	Text  string  // Token text
	Prob  float32 // Probability (0-1)
	Logit float32 // Raw logit value
}

TokenProb represents a token with its probability.

type TokenWithLogprobs

type TokenWithLogprobs struct {
	Token             // Embedded - ID, Text
	Prob  float32     // Probability of selected token
	Logit float32     // Logit of selected token
	TopK  []TokenProb // Top-K alternatives (if requested)
}

TokenWithLogprobs extends Token with probability information.

type TokenizeError

type TokenizeError struct {
	Text    string
	Message string
}

TokenizeError provides details about tokenization failures.

func (*TokenizeError) Error

func (e *TokenizeError) Error() string

type Tool added in v1.1.0

type Tool interface {
	// Name returns the unique identifier for this tool.
	Name() string

	// Description returns a human-readable description of what this tool does.
	// This is provided to the model to help it decide when to use the tool.
	Description() string

	// Parameters returns the list of parameters this tool accepts.
	Parameters() []ToolParameter

	// Execute runs the tool with the given arguments and returns the result.
	// The result should be a string that can be fed back to the model.
	// Return an error if the tool execution fails.
	Execute(ctx context.Context, args map[string]any) (string, error)
}

Tool defines a callable function that the model can invoke. Implement this interface to create custom tools.

type ToolCall added in v1.1.0

type ToolCall struct {
	ID        string         // Unique identifier for this call
	Name      string         // Name of the tool to invoke
	Arguments map[string]any // Parsed arguments from the model
}

ToolCall represents a tool invocation requested by the model.

type ToolChoice added in v1.1.0

type ToolChoice int

ToolChoice controls how the model should use tools during generation.

const (
	ToolChoiceAuto     ToolChoice = iota // Model decides when to use tools
	ToolChoiceNone                       // Never use tools
	ToolChoiceRequired                   // Must use a tool
)

type ToolExecutionError added in v1.1.0

type ToolExecutionError struct {
	ToolName string
	CallID   string
	Err      error
}

ToolExecutionError provides details about tool execution failures.

func (*ToolExecutionError) Error added in v1.1.0

func (e *ToolExecutionError) Error() string

func (*ToolExecutionError) Unwrap added in v1.1.0

func (e *ToolExecutionError) Unwrap() error

type ToolParameter added in v1.1.0

type ToolParameter struct {
	Name        string   // Parameter name
	Type        string   // Type: "string", "number", "boolean", "array", "object"
	Description string   // Human-readable description
	Required    bool     // Whether this parameter is required
	Enum        []string // Optional: allowed values for this parameter
}

ToolParameter describes a single parameter for a tool.

type ToolResult added in v1.1.0

type ToolResult struct {
	CallID  string // Corresponds to ToolCall.ID
	Content string // Result content to feed back to the model
	IsError bool   // Whether this result represents an error
}

ToolResult represents the outcome of executing a tool.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL