Documentation
¶
Overview ¶
Package zeus provides Go bindings for llama.cpp, enabling local LLM inference with pre-built static libraries for Linux and Windows x86_64.
Zeus is designed for simplicity - load any GGUF model and start generating text with sensible defaults. No compilation required, no external dependencies.
Quick Start ¶
The simplest way to use Zeus is with the Chat API:
model, err := zeus.LoadModel("model.gguf")
if err != nil {
log.Fatal(err)
}
defer model.Close()
chat := model.NewChat()
chat.AddMessage(zeus.RoleSystem, "You are a helpful assistant.")
for tok, err := range chat.GenerateSequence(ctx, "Hello!") {
if err != nil {
log.Fatal(err)
}
fmt.Print(tok.Text)
}
Core Abstractions ¶
Zeus provides three main abstractions for different use cases:
Model: Load and manage GGUF models. Provides tokenization, embeddings, and model information. Thread-safe for concurrent use.
Session: Token-level generation with state tracking. Use when you need precise control over the prompt format or are working with non-chat models. Supports checkpoint/backtrack for branching conversations.
Chat: Message-level conversations with automatic template handling. Manages conversation history and applies the model's chat template. Best for chatbot-style applications.
Configuration ¶
Zeus uses Go's functional options pattern for clean, readable configuration:
model, err := zeus.LoadModel("model.gguf",
zeus.WithContextSize(4096),
zeus.WithKVCacheType(zeus.KVCacheQ8_0),
)
for tok, err := range session.GenerateSequence(ctx, prompt,
zeus.WithMaxTokens(512),
zeus.WithTemperature(0.7),
) {
// ...
}
Most options have sensible defaults. You only need to configure what you want to change.
Streaming Generation ¶
Zeus provides two streaming interfaces:
iter.Seq2 via GenerateSequence: Returns tokens one at a time using Go 1.23+ range-over-func. Best for token-by-token processing.
io.ReadCloser via Generate: Returns generated text as a byte stream. Best for piping to HTTP responses or other io.Writers.
GPU Acceleration ¶
Zeus includes pre-built Vulkan support for GPU acceleration. Enable it with WithGPULayers:
model, err := zeus.LoadModel("model.gguf",
zeus.WithGPULayers(zeus.GPULayersAll),
)
GPU acceleration is optional - Zeus falls back to CPU if Vulkan is unavailable.
Thread Safety ¶
Model is safe for concurrent use from multiple goroutines. The KV cache is protected by a mutex, so generation operations are serialized. Multiple Sessions or Chats can exist simultaneously, but only one can generate at a time.
[Close] is safe to call multiple times and will wait for any ongoing generation to complete.
Error Handling ¶
Zeus provides sentinel errors for common conditions that can be checked with errors.Is:
- ErrModelClosed: Operation attempted on a closed model
- ErrEmbeddingsDisabled: Embeddings requested but model loaded without WithEmbeddings
- ErrPromptTooLong: Prompt exceeds context size
- ErrDecodeFailed: Decode operation failed during generation
Typed errors provide additional context and can be checked with errors.As:
- ModelLoadError: Details about model loading failures
- GenerationError: Details about generation failures
- TokenizeError: Details about tokenization failures
For complete API documentation, see DOC.md in the repository.
Index ¶
- Constants
- Variables
- func SetVerbose(verbose bool)
- type AgentConfig
- type AgentEvent
- type AgentEventType
- type Chat
- type ChatConfig
- type ChatFormat
- type ChatMessage
- type ChatOption
- func WithChatFormat(format ChatFormat) ChatOption
- func WithMaxIterations(n int) ChatOption
- func WithMaxToolCalls(n int) ChatOption
- func WithParallelToolCalls(parallel bool) ChatOption
- func WithToolChoice(choice ToolChoice) ChatOption
- func WithToolTimeout(d time.Duration) ChatOption
- func WithTools(tools ...Tool) ChatOption
- type ChatParams
- type ChatTemplateConfig
- type ChatTemplateError
- type ChatTemplateOption
- type EmbeddingError
- type GenerateConfig
- type GenerateOption
- func WithFrequencyPenalty(penalty float32) GenerateOption
- func WithGenerateSeed(seed int) GenerateOption
- func WithGrammar(grammar string) GenerateOption
- func WithIgnoreEOS() GenerateOption
- func WithMaxTokens(n int) GenerateOption
- func WithMinP(p float32) GenerateOption
- func WithMirostat(mode MirostatMode, tau, eta float32) GenerateOption
- func WithPresencePenalty(penalty float32) GenerateOption
- func WithReasoningEnabled(enabled bool) GenerateOption
- func WithRepeatLastN(n int) GenerateOption
- func WithRepeatPenalty(penalty float32) GenerateOption
- func WithStopSequences(seqs ...string) GenerateOption
- func WithTemperature(t float32) GenerateOption
- func WithThreads(n int) GenerateOption
- func WithTopK(k int) GenerateOption
- func WithTopP(p float32) GenerateOption
- type GenerationError
- type GrammarTrigger
- type GrammarTriggerType
- type KVCacheType
- type MirostatMode
- type Model
- type ModelConfig
- type ModelInfo
- type ModelLoadError
- type ModelOption
- func WithBatchSize(n int) ModelOption
- func WithContextSize(n int) ModelOption
- func WithEmbeddings() ModelOption
- func WithGPULayers(n int) ModelOption
- func WithKVCacheType(t KVCacheType) ModelOption
- func WithLoRA(path string) ModelOption
- func WithMMap(enable bool) ModelOption
- func WithMainGPU(gpu int) ModelOption
- func WithMlock(enable bool) ModelOption
- func WithNUMA(enable bool) ModelOption
- func WithRopeFreqBase(base float32) ModelOption
- func WithRopeFreqScale(scale float32) ModelOption
- func WithSeed(seed int) ModelOption
- func WithTensorSplit(split []float32) ModelOption
- func WithWarmup(enable bool) ModelOption
- type ParseResult
- type Role
- type Session
- type SpecialTokens
- type StopReason
- type Token
- type TokenProb
- type TokenWithLogprobs
- type TokenizeError
- type Tool
- type ToolCall
- type ToolChoice
- type ToolExecutionError
- type ToolParameter
- type ToolResult
Constants ¶
const GPULayersAll = 999
GPULayersAll is a constant to offload all model layers to GPU. llama.cpp will offload as many layers as fit in available VRAM.
Variables ¶
var ( ErrModelClosed = errors.New("zeus: model is closed") ErrEmbeddingsDisabled = errors.New("zeus: model loaded without embeddings support") ErrPromptTooLong = errors.New("zeus: prompt exceeds context size") ErrDecodeFailed = errors.New("zeus: decode operation failed") ErrSessionIsNil = errors.New("zeus: session is nil and not defined") ErrModelIsNil = errors.New("zeus: model is nil and not defined") ErrChatIsNil = errors.New("zeus: chat is nil and not defined") // Tool-related errors ErrNoToolsRegistered = errors.New("zeus: no tools registered") ErrMaxIterationsExceeded = errors.New("zeus: max agent iterations exceeded") ErrMaxToolCallsExceeded = errors.New("zeus: max tool calls exceeded") ErrTemplateApply = errors.New("zeus: failed to apply chat template with tools") ErrToolTemplateUnsupported = errors.New("zeus: model does not support native tool templates") )
Sentinel errors for use with errors.Is()
Functions ¶
func SetVerbose ¶
func SetVerbose(verbose bool)
SetVerbose enables or disables verbose logging from llama.cpp.
Types ¶
type AgentConfig ¶ added in v1.1.0
type AgentConfig struct {
Tools []Tool // Registered tools
MaxIterations int // Maximum agentic loop iterations (default: 10)
MaxToolCalls int // Maximum total tool calls (default: 25)
ToolTimeout time.Duration // Per-tool execution timeout (default: 30s)
ToolChoice ToolChoice // How the model should use tools (default: Auto)
ParallelToolCalls bool // Allow multiple tool calls in one response (default: true)
}
AgentConfig holds configuration for agentic tool execution.
func DefaultAgentConfig ¶ added in v1.1.0
func DefaultAgentConfig() AgentConfig
DefaultAgentConfig returns an AgentConfig with sensible defaults.
type AgentEvent ¶ added in v1.1.0
type AgentEvent struct {
Type AgentEventType // Type of event
Token *Token // For AgentEventToken: the generated token
ToolCall *ToolCall // For AgentEventToolCallStart/End: the tool call
Result *ToolResult // For AgentEventToolCallEnd: the result
Error error // For AgentEventError: the error that occurred
}
AgentEvent represents an event during agentic loop execution.
type AgentEventType ¶ added in v1.1.0
type AgentEventType int
AgentEventType indicates the type of event during agentic loop execution.
const ( AgentEventToken AgentEventType = iota // A token was generated AgentEventToolCallStart // A tool call is starting AgentEventToolCallEnd // A tool call completed AgentEventError // An error occurred AgentEventDone // The agentic loop completed )
func (AgentEventType) String ¶ added in v1.1.0
func (t AgentEventType) String() string
String returns the string representation of the event type.
type Chat ¶
type Chat interface {
// Generate sends a user message and returns the assistant's response as a stream.
// The user message and assistant response are added to the conversation.
Generate(ctx context.Context, userMessage string, opts ...GenerateOption) io.ReadCloser
// GenerateSequence sends a user message and returns tokens as an iterator.
// The user message and assistant response are added to the conversation.
GenerateSequence(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[Token, error]
// GenerateWithTools executes an agentic loop, auto-executing tools until the model
// produces a final response without tool calls. Requires tools to be registered via WithTools.
GenerateWithTools(ctx context.Context, userMessage string, opts ...GenerateOption) iter.Seq2[AgentEvent, error]
// AddMessage adds a message to the conversation without generating.
// Useful for adding system prompts or reconstructing conversation history.
AddMessage(role Role, content string)
// Checkpoint creates a snapshot of the current chat state.
Checkpoint() Chat
// Backtrack returns to the state before the last Generate call.
Backtrack() (Chat, bool)
// Messages returns a copy of the message history.
Messages() []ChatMessage
// MessageCount returns the number of messages in the conversation.
MessageCount() int
// Model returns the parent model.
Model() Model
// Tools returns the registered tools for this chat.
Tools() []Tool
// Compact summarizes older messages and replaces them with the summary, keeping the last 10% of messages (or at least 1).
Compact(ctx context.Context) error
}
Chat represents a conversation that tracks message history. Generate methods mutate the chat, appending assistant responses. Use Checkpoint() before generation to save state for branching. Note: Chat wraps a Session internally for KV cache efficiency.
type ChatConfig ¶
type ChatConfig struct {
ChatTemplateConfig // Embedded - Template, AddAssistant
AgentConfig AgentConfig // Agent/tool configuration
ChatFormat ChatFormat // Tool call format for parsing (default: Hermes2Pro)
}
ChatConfig holds options for creating a Chat.
func DefaultChatConfig ¶
func DefaultChatConfig() ChatConfig
DefaultChatConfig returns the default chat configuration.
type ChatFormat ¶ added in v1.1.0
type ChatFormat int32
ChatFormat represents the tool call format for parsing model output. Different models use different formats for function/tool calling. This is forward-compatible: new formats added to llama.cpp work automatically even if not listed here. Use String() to get a readable name.
const ( ChatFormatContentOnly ChatFormat = 0 // No tool calls, content only ChatFormatGeneric ChatFormat = 1 // Generic format with JSON ChatFormatMistralNemo ChatFormat = 2 // Mistral Nemo format ChatFormatMagistral ChatFormat = 3 // Magistral format ChatFormatLlama3X ChatFormat = 4 // Llama 3.x format ChatFormatLlama3XWithBuiltinTools ChatFormat = 5 // Llama 3.x with builtin tools ChatFormatDeepSeekR1 ChatFormat = 6 // DeepSeek R1 format ChatFormatFireFunctionV2 ChatFormat = 7 // FireFunction v2 format ChatFormatFunctionaryV32 ChatFormat = 8 // Functionary v3.2 format ChatFormatFunctionaryV31Llama31 ChatFormat = 9 // Functionary v3.1 Llama 3.1 format ChatFormatDeepSeekV31 ChatFormat = 10 // DeepSeek V3.1 format ChatFormatHermes2Pro ChatFormat = 11 // Hermes 2 Pro format (Qwen 2.5, Hermes 2/3) ChatFormatCommandR7B ChatFormat = 12 // Command R7B format ChatFormatGranite ChatFormat = 13 // Granite format ChatFormatGPTOSS ChatFormat = 14 // GPT-OSS format ChatFormatSeedOSS ChatFormat = 15 // Seed-OSS format ChatFormatNemotronV2 ChatFormat = 16 // Nemotron V2 format ChatFormatApertus ChatFormat = 17 // Apertus format ChatFormatLFM2WithJSONTools ChatFormat = 18 // LFM2 with JSON tools format ChatFormatGLM45 ChatFormat = 19 // GLM 4.5 format ChatFormatMiniMaxM2 ChatFormat = 20 // MiniMax-M2 format ChatFormatKimiK2 ChatFormat = 21 // Kimi K2 format ChatFormatQwen3CoderXML ChatFormat = 22 // Qwen3 Coder format ChatFormatApriel15 ChatFormat = 23 // Apriel 1.5 format ChatFormatXiaomiMiMo ChatFormat = 24 // Xiaomi MiMo format )
Known chat formats. This list may be incomplete - llama.cpp may support additional formats that will work automatically.
func (ChatFormat) String ¶ added in v1.1.0
func (f ChatFormat) String() string
String returns the name of the chat format from llama.cpp.
type ChatMessage ¶
type ChatMessage struct {
Role Role
Content string
// For assistant messages: tool calls made in this message
ToolCalls []ToolCall
// For tool result messages: identifies which tool call this responds to
ToolName string
ToolCallID string
}
ChatMessage represents a single message in a conversation.
type ChatOption ¶
type ChatOption func(*ChatConfig)
ChatOption is a functional option for NewChat.
func WithChatFormat ¶ added in v1.1.0
func WithChatFormat(format ChatFormat) ChatOption
WithChatFormat sets the tool call format for parsing model output. Different models use different formats. Normally this is auto-detected by llama.cpp when applying tool templates, so you typically don't need to set this explicitly.
func WithMaxIterations ¶ added in v1.1.0
func WithMaxIterations(n int) ChatOption
WithMaxIterations sets the maximum number of agentic loop iterations. Each iteration may contain multiple tool calls. Default is 10.
func WithMaxToolCalls ¶ added in v1.1.0
func WithMaxToolCalls(n int) ChatOption
WithMaxToolCalls sets the maximum total tool calls across all iterations. Default is 25.
func WithParallelToolCalls ¶ added in v1.1.0
func WithParallelToolCalls(parallel bool) ChatOption
WithParallelToolCalls controls whether the model can make multiple tool calls in a single response. Default is true.
func WithToolChoice ¶ added in v1.1.0
func WithToolChoice(choice ToolChoice) ChatOption
WithToolChoice sets how the model should use tools.
- ToolChoiceAuto (default): Model decides when to use tools
- ToolChoiceNone: Never use tools
- ToolChoiceRequired: Must use a tool
func WithToolTimeout ¶ added in v1.1.0
func WithToolTimeout(d time.Duration) ChatOption
WithToolTimeout sets the per-tool execution timeout. Default is 30 seconds.
func WithTools ¶ added in v1.1.0
func WithTools(tools ...Tool) ChatOption
WithTools registers tools for the chat to use with GenerateWithTools.
type ChatParams ¶ added in v1.1.0
type ChatParams struct {
Prompt string // Formatted prompt with tools embedded
Grammar string // GBNF grammar for constraining output (may be empty)
Format ChatFormat // Detected chat format
GrammarLazy bool // Apply grammar only after trigger patterns
GrammarTriggers []GrammarTrigger // Typed patterns that activate grammar
AdditionalStops []string // Extra stop sequences
}
ChatParams contains the result of applying a chat template with tools. This includes the formatted prompt, grammar constraints, and other metadata.
type ChatTemplateConfig ¶
type ChatTemplateConfig struct {
Template string // Empty = use model's embedded template
AddAssistant bool // Add assistant turn prefix (default: true)
AutoCompactThreshold float32 // Ratio after which automatic compact occurs (Chat only)
}
ChatTemplateConfig holds options for ApplyChatTemplate.
func DefaultChatTemplateConfig ¶
func DefaultChatTemplateConfig() ChatTemplateConfig
DefaultChatTemplateConfig returns the default chat template configuration.
type ChatTemplateError ¶
type ChatTemplateError struct {
Message string
}
ChatTemplateError provides details about chat template failures.
func (*ChatTemplateError) Error ¶
func (e *ChatTemplateError) Error() string
type ChatTemplateOption ¶
type ChatTemplateOption func(*ChatTemplateConfig)
ChatTemplateOption is a functional option for ApplyChatTemplate.
func WithAddAssistant ¶
func WithAddAssistant(add bool) ChatTemplateOption
WithAddAssistant controls whether to append the assistant turn prefix.
func WithAutoCompactThreshold ¶
func WithAutoCompactThreshold(threshold float32) ChatTemplateOption
WithAutoCompactThreshold sets the context usage ratio at which Chat automatically compacts the conversation. When context usage exceeds this threshold (e.g., 0.8 = 80%), older messages are summarized to free space. Use 0 to disable auto-compaction.
func WithChatTemplate ¶
func WithChatTemplate(name string) ChatTemplateOption
WithChatTemplate specifies a built-in template name (e.g., "chatml", "llama3").
type EmbeddingError ¶
type EmbeddingError struct {
Message string
}
EmbeddingError provides details about embedding extraction failures.
func (*EmbeddingError) Error ¶
func (e *EmbeddingError) Error() string
type GenerateConfig ¶
type GenerateConfig struct {
MaxTokens int // Maximum tokens to generate (0 = unlimited)
Temperature float32 // Sampling temperature (higher = more random)
TopK int // Top-K sampling (0 = disabled)
TopP float32 // Nucleus sampling probability
MinP float32 // Minimum probability threshold
RepeatPenalty float32 // Repetition penalty (1.0 = no penalty)
RepeatLastN int // Number of tokens to consider for repetition penalty
FrequencyPenalty float32 // Frequency-based penalty
PresencePenalty float32 // Presence-based penalty
Mirostat MirostatMode // Mirostat sampling mode
MirostatTau float32 // Mirostat target entropy
MirostatEta float32 // Mirostat learning rate
StopSequences []string // Sequences that stop generation
IgnoreEOS bool // Continue past end-of-sequence token
Grammar string // GBNF grammar to constrain output
Seed int // Random seed for sampling (-1 = random)
Threads int // Number of threads for generation (0 = autodetect)
ReasoningEnabled bool // Enable model thinking/reasoning (default: true)
// contains filtered or unexported fields
}
GenerateConfig holds configuration for text generation.
func DefaultGenerateConfig ¶
func DefaultGenerateConfig() GenerateConfig
DefaultGenerateConfig returns a GenerateConfig with sensible defaults.
type GenerateOption ¶
type GenerateOption func(*GenerateConfig)
GenerateOption configures text generation.
func WithFrequencyPenalty ¶
func WithFrequencyPenalty(penalty float32) GenerateOption
WithFrequencyPenalty sets the frequency-based penalty. Penalizes tokens based on their frequency in the generated text.
func WithGenerateSeed ¶
func WithGenerateSeed(seed int) GenerateOption
WithGenerateSeed sets the random seed for sampling. Use -1 for random seed.
func WithGrammar ¶
func WithGrammar(grammar string) GenerateOption
WithGrammar constrains output to match a GBNF grammar.
func WithIgnoreEOS ¶
func WithIgnoreEOS() GenerateOption
WithIgnoreEOS enables generation past the end-of-sequence token.
func WithMaxTokens ¶
func WithMaxTokens(n int) GenerateOption
WithMaxTokens sets the maximum number of tokens to generate. Use 0 for unlimited (generates until EOS or context full).
func WithMinP ¶
func WithMinP(p float32) GenerateOption
WithMinP sets the minimum probability threshold. Tokens with probability below this are excluded.
func WithMirostat ¶
func WithMirostat(mode MirostatMode, tau, eta float32) GenerateOption
WithMirostat enables Mirostat adaptive sampling.
func WithPresencePenalty ¶
func WithPresencePenalty(penalty float32) GenerateOption
WithPresencePenalty sets the presence-based penalty. Penalizes tokens that have already appeared in the generated text.
func WithReasoningEnabled ¶
func WithReasoningEnabled(enabled bool) GenerateOption
WithReasoningEnabled enables or disables model thinking/reasoning. When disabled, thinking tags are closed immediately to prevent reasoning output. This is equivalent to llama.cpp server's --reasoning-budget 0.
func WithRepeatLastN ¶
func WithRepeatLastN(n int) GenerateOption
WithRepeatLastN sets how many recent tokens to consider for repetition penalty.
func WithRepeatPenalty ¶
func WithRepeatPenalty(penalty float32) GenerateOption
WithRepeatPenalty sets the repetition penalty. Values > 1.0 discourage repetition, 1.0 = no penalty.
func WithStopSequences ¶
func WithStopSequences(seqs ...string) GenerateOption
WithStopSequences sets sequences that will stop generation when encountered.
func WithTemperature ¶
func WithTemperature(t float32) GenerateOption
WithTemperature sets the sampling temperature. Higher values (e.g., 1.0) make output more random. Lower values (e.g., 0.2) make output more deterministic.
func WithThreads ¶
func WithThreads(n int) GenerateOption
WithThreads sets the number of threads for generation. Default value is -1, which autodetect the amount of cores
func WithTopK ¶
func WithTopK(k int) GenerateOption
WithTopK sets the top-K sampling value. Only the K most likely tokens are considered. Use 0 to disable.
func WithTopP ¶
func WithTopP(p float32) GenerateOption
WithTopP sets the nucleus sampling probability. Tokens are sampled from the smallest set whose cumulative probability exceeds P.
type GenerationError ¶
GenerationError provides details about text generation failures.
func (*GenerationError) Error ¶
func (e *GenerationError) Error() string
type GrammarTrigger ¶ added in v1.1.3
type GrammarTrigger struct {
Type GrammarTriggerType
Value string // For word/pattern types
Token int32 // For token type
}
GrammarTrigger represents a pattern that activates lazy grammar.
type GrammarTriggerType ¶ added in v1.1.3
type GrammarTriggerType int32
GrammarTriggerType defines how a trigger pattern should be matched.
const ( TriggerTypeWord GrammarTriggerType = 0 // Exact word match (auto-escaped) TriggerTypePattern GrammarTriggerType = 1 // Regex pattern (anywhere in output) TriggerTypePatternFull GrammarTriggerType = 2 // Full regex pattern TriggerTypeToken GrammarTriggerType = 3 // Token ID trigger )
type KVCacheType ¶
type KVCacheType int
KVCacheType represents the data type for KV cache storage. Lower precision types use less memory but may reduce quality slightly.
const ( KVCacheF32 KVCacheType = iota // Full precision (32-bit float) KVCacheF16 // Half precision (16-bit float) - default KVCacheQ8_0 // 8-bit quantized KVCacheQ4_0 // 4-bit quantized - lowest memory )
func (KVCacheType) String ¶
func (k KVCacheType) String() string
String returns the llama.cpp string representation.
type MirostatMode ¶
type MirostatMode int
MirostatMode controls the Mirostat adaptive sampling algorithm.
const ( MirostatDisabled MirostatMode = iota // Standard sampling (top-k, top-p, temperature) Mirostat1 // Mirostat v1 algorithm Mirostat2 // Mirostat v2 algorithm )
type Model ¶
type Model interface {
// NewSession creates a new empty session for text generation.
NewSession() *session
// NewChat creates a new empty chat for conversation.
// Uses the model's embedded template by default.
NewChat(opts ...ChatOption) Chat
// Embeddings extracts embeddings for the given text.
// The model must be loaded with WithEmbeddings() option.
Embeddings(ctx context.Context, text string) ([]float32, error)
// EmbeddingsBatch extracts embeddings for multiple texts in a single call.
// The model must be loaded with WithEmbeddings() option.
EmbeddingsBatch(ctx context.Context, texts []string) ([][]float32, error)
// Tokenize converts text to token IDs.
Tokenize(text string, addSpecial bool) ([]int, error)
// TokenizeCount returns number of token IDs the text represents.
TokenizeCount(text string, addSpecial bool) (int, error)
// Detokenize converts token IDs back to text.
Detokenize(tokens []int) (string, error)
// DetokenizeLength returns string length the token IDs represents.
DetokenizeLength(tokens []int) (int, error)
// BOS returns the beginning-of-sequence token ID.
BOS() int
// EOS returns the end-of-sequence token ID.
EOS() int
// TokenToText converts a single token ID to its text representation.
TokenToText(token int) string
// IsSpecialToken returns true if the token is a special/control token.
IsSpecialToken(token int) bool
// IsEOG returns true if the token is an end-of-generation token.
IsEOG(token int) bool
// SpecialTokens returns all special token IDs.
SpecialTokens() SpecialTokens
// VocabSize returns the vocabulary size.
VocabSize() int
// ContextSize returns the effective context window size.
ContextSize() int
// TrainContextSize returns the model's original training context size.
TrainContextSize() int
// EmbeddingSize returns the embedding dimension.
EmbeddingSize() int
// Info returns model metadata and architecture details.
Info() ModelInfo
// ChatTemplate returns the model's embedded chat template string.
// Returns empty string if no template is embedded.
ChatTemplate() string
// ApplyChatTemplate formats messages using a chat template.
// Uses model's embedded template by default.
ApplyChatTemplate(messages []ChatMessage, opts ...ChatTemplateOption) (string, error)
// Close releases model resources.
Close() error
}
Model represents a loaded LLM model.
type ModelConfig ¶
type ModelConfig struct {
ContextSize int // Context window size (0 = model's native context)
Seed int // Random seed for model initialization
BatchSize int // Batch size for prompt processing
GPULayers int // Number of layers to offload to GPU (GPULayersAll for all)
MainGPU int // Primary GPU device index for multi-GPU systems
TensorSplit []float32 // Distribution of layers across GPUs (e.g., [0.5, 0.5])
KVCacheType KVCacheType // Data type for KV cache storage
RopeFreqBase float32 // RoPE frequency base (0 = from model)
RopeFreqScale float32 // RoPE frequency scale (0 = from model)
LoraAdapter string // Path to LoRA adapter file
UseMMap bool // Use memory mapping for model loading
UseMlock bool // Lock model in memory (prevent swapping)
UseNUMA bool // Enable NUMA optimizations
Embeddings bool // Enable embedding extraction mode
Warmup bool // Run warmup in background after loading (default: true)
}
ModelConfig holds configuration for model loading.
func DefaultModelConfig ¶
func DefaultModelConfig() ModelConfig
DefaultModelConfig returns a ModelConfig with sensible defaults.
type ModelInfo ¶
type ModelInfo struct {
Description string // Model description (e.g., "LLaMA v2 7B Q4_K_M")
Architecture string // Architecture name from metadata
QuantType string // Quantization type (e.g., "Q4_K_M")
Parameters uint64 // Total parameter count
Size uint64 // Model size in bytes
Layers int // Number of layers
Heads int // Number of attention heads
HeadsKV int // Number of KV heads
VocabSize int // Vocabulary size
}
ModelInfo provides model metadata and architecture details.
type ModelLoadError ¶
ModelLoadError provides details about model loading failures.
func (*ModelLoadError) Error ¶
func (e *ModelLoadError) Error() string
type ModelOption ¶
type ModelOption func(*ModelConfig)
ModelOption configures model loading.
func WithBatchSize ¶
func WithBatchSize(n int) ModelOption
WithBatchSize sets the batch size for prompt processing.
func WithContextSize ¶
func WithContextSize(n int) ModelOption
WithContextSize sets the context window size. Use 0 to use the model's native context size.
func WithEmbeddings ¶
func WithEmbeddings() ModelOption
WithEmbeddings enables embedding extraction mode. Required to use the Embeddings() method.
func WithGPULayers ¶
func WithGPULayers(n int) ModelOption
WithGPULayers sets the number of layers to offload to GPU. Use GPULayersAll to offload all layers, 0 for CPU only.
func WithKVCacheType ¶
func WithKVCacheType(t KVCacheType) ModelOption
WithKVCacheType sets the data type for KV cache storage.
func WithLoRA ¶
func WithLoRA(path string) ModelOption
WithLoRA loads a LoRA adapter from the specified path.
func WithMMap ¶
func WithMMap(enable bool) ModelOption
WithMMap enables or disables memory mapping for model loading. Disabling forces loading the entire model into memory, might improve preformance.
func WithMainGPU ¶
func WithMainGPU(gpu int) ModelOption
WithMainGPU sets the primary GPU device index for multi-GPU systems.
func WithMlock ¶
func WithMlock(enable bool) ModelOption
WithMlock enables or disables memory locking. When enabled, the model is locked in RAM to prevent swapping.
func WithNUMA ¶
func WithNUMA(enable bool) ModelOption
WithNUMA enables or disables NUMA optimizations.
func WithRopeFreqBase ¶
func WithRopeFreqBase(base float32) ModelOption
WithRopeFreqBase sets the RoPE frequency base. Use 0 to use the model's default value.
func WithRopeFreqScale ¶
func WithRopeFreqScale(scale float32) ModelOption
WithRopeFreqScale sets the RoPE frequency scale. Use 0 to use the model's default value.
func WithSeed ¶
func WithSeed(seed int) ModelOption
WithSeed sets the random seed for model initialization.
func WithTensorSplit ¶
func WithTensorSplit(split []float32) ModelOption
WithTensorSplit sets the distribution of layers across multiple GPUs. []float32{0.5, 0.5} splits evenly between two GPUs.
func WithWarmup ¶ added in v1.2.0
func WithWarmup(enable bool) ModelOption
WithWarmup enables or disables background warmup after model loading. When enabled (default), this runs a minimal decode in a goroutine to initialize GPU kernels and reduce latency on the first real generation.
type ParseResult ¶ added in v1.1.0
type ParseResult struct {
Content string // Non-tool-call text content
ReasoningContent string // Reasoning/thinking content (if any)
ToolCalls []ToolCall // Parsed tool calls
}
ParseResult contains the parsed tool calls from model output.
type Session ¶
type Session interface {
// Generate processes the prompt, returns new tokens and moves the session forward.
Generate(ctx context.Context, prompt string, opts ...GenerateOption) io.ReadCloser
// GenerateSequence processes the prompt, returns new tokens as yield iterator and moves the session forward.
GenerateSequence(ctx context.Context, prompt string, opts ...GenerateOption) iter.Seq2[Token, error]
// GenerateSequenceWithLogprobs is like GenerateSequence but returns tokens with probability information.
// topK specifies how many top alternatives to include (0 = just the selected token's prob/logit).
GenerateSequenceWithLogprobs(ctx context.Context, prompt string, topK int, opts ...GenerateOption) iter.Seq2[TokenWithLogprobs, error]
// Checkpoint creates a snapshot of the current session state.
// Both the original and checkpoint can be used independently for branching.
// Note: All sessions share one KV cache; switching between divergent sessions recomputes tokens from the common prefix.
Checkpoint() Session
// Backtrack returns to the state before the last Generate call.
// Returns ok=false if this is the initial session (no parent).
Backtrack() (Session, bool)
// Tokens returns a copy of the token history for this session.
Tokens() []int
// Text returns the full text of this session (detokenized).
Text() (string, error)
// TokenCount returns the number of tokens in this session.
TokenCount() int
// ContextUsed returns the percentage of context used in this session.
ContextUsed() float64
// Model returns the parent model.
Model() Model
}
Session represents a conversation state that tracks token history. Generate/GenerateSequence mutate the session, appending new tokens. Use Checkpoint() before generation to save state for branching. Note: All sessions share one KV cache. Switching between unrelated sessions recomputes all tokens.
type SpecialTokens ¶
type SpecialTokens struct {
BOS int // Beginning of sequence (-1 if not available)
EOS int // End of sequence
EOT int // End of turn
PAD int // Padding
SEP int // Separator
NL int // Newline
}
SpecialTokens contains all special token IDs for the model.
type StopReason ¶
type StopReason int
StopReason indicates why text generation stopped.
const ( StopReasonEOS StopReason = iota // End of sequence token encountered StopReasonMaxTokens // Reached maximum token limit StopReasonStopSequence // Matched a stop sequence StopReasonCancelled // Context was cancelled StopReasonError // An error occurred )
func (StopReason) String ¶
func (s StopReason) String() string
type Token ¶
type Token struct {
Text string // The token text (may not be valid UTF-8 for partial tokens)
ID int // Token ID from the vocabulary
}
Token represents a single generated token.
type TokenProb ¶
type TokenProb struct {
Token int // Token ID
Text string // Token text
Prob float32 // Probability (0-1)
Logit float32 // Raw logit value
}
TokenProb represents a token with its probability.
type TokenWithLogprobs ¶
type TokenWithLogprobs struct {
Token // Embedded - ID, Text
Prob float32 // Probability of selected token
Logit float32 // Logit of selected token
TopK []TokenProb // Top-K alternatives (if requested)
}
TokenWithLogprobs extends Token with probability information.
type TokenizeError ¶
TokenizeError provides details about tokenization failures.
func (*TokenizeError) Error ¶
func (e *TokenizeError) Error() string
type Tool ¶ added in v1.1.0
type Tool interface {
// Name returns the unique identifier for this tool.
Name() string
// Description returns a human-readable description of what this tool does.
// This is provided to the model to help it decide when to use the tool.
Description() string
// Parameters returns the list of parameters this tool accepts.
Parameters() []ToolParameter
// Execute runs the tool with the given arguments and returns the result.
// The result should be a string that can be fed back to the model.
// Return an error if the tool execution fails.
Execute(ctx context.Context, args map[string]any) (string, error)
}
Tool defines a callable function that the model can invoke. Implement this interface to create custom tools.
type ToolCall ¶ added in v1.1.0
type ToolCall struct {
ID string // Unique identifier for this call
Name string // Name of the tool to invoke
Arguments map[string]any // Parsed arguments from the model
}
ToolCall represents a tool invocation requested by the model.
type ToolChoice ¶ added in v1.1.0
type ToolChoice int
ToolChoice controls how the model should use tools during generation.
const ( ToolChoiceAuto ToolChoice = iota // Model decides when to use tools ToolChoiceNone // Never use tools ToolChoiceRequired // Must use a tool )
type ToolExecutionError ¶ added in v1.1.0
ToolExecutionError provides details about tool execution failures.
func (*ToolExecutionError) Error ¶ added in v1.1.0
func (e *ToolExecutionError) Error() string
func (*ToolExecutionError) Unwrap ¶ added in v1.1.0
func (e *ToolExecutionError) Unwrap() error
type ToolParameter ¶ added in v1.1.0
type ToolParameter struct {
Name string // Parameter name
Type string // Type: "string", "number", "boolean", "array", "object"
Description string // Human-readable description
Required bool // Whether this parameter is required
Enum []string // Optional: allowed values for this parameter
}
ToolParameter describes a single parameter for a tool.
type ToolResult ¶ added in v1.1.0
type ToolResult struct {
CallID string // Corresponds to ToolCall.ID
Content string // Result content to feed back to the model
IsError bool // Whether this result represents an error
}
ToolResult represents the outcome of executing a tool.