Model Configuration

Complete guide to AI model settings and optimization

This page explains the various AI model configuration parameters available in Sapientia. Each setting can be customized to optimize performance, creativity, and resource usage according to your needs.


Base Models

AI Model

AI Model is the primary language model used to generate responses. Sapientia supports models in GGUF format that can be run locally.

How It Works:

  • This model acts as the "brain" of your AI assistant
  • Processes questions and generates relevant answers
  • Larger models are generally more accurate in their responses

How to Use:

  1. Click the "Change Model" button to select a model file from your local storage
  2. Select a file with .gguf format
  3. The system will display model compatibility information
  4. Click "Apply" to apply the changes

Tips:

  • Choose a model that matches your computer's RAM/VRAM capacity
  • Smaller models are faster but may be less accurate
  • Larger models are more accurate but require more resources

Embedding Model

Embedding Model is a specialized model that converts text into vector representations for semantic search and RAG (Retrieval-Augmented Generation) capabilities.

How It Works:

  • Converts text into numerical vectors
  • Enables AI to understand context and meaning of text
  • Used to search for relevant information from the knowledge base

How to Use:

  1. Click the "Change Model" button in the Embedding Model section
  2. Select an embedding model file in .gguf format
  3. Ensure the model supports the appropriate embedding dimensions (default: 2048)
  4. Click "Apply" to apply the changes

Tips:

  • Embedding models are typically smaller than the main AI model
  • A good embedding model improves document search accuracy
  • Choose a model that supports the languages you use

Inference Parameters

Temperature

Temperature controls the level of creativity and randomness in AI responses.

Value Range: 0.00 - 1.00

How It Works:

  • Low Values (0.0 - 0.3): More consistent and predictable responses. Suitable for tasks requiring high accuracy such as code, mathematics, or facts.
  • Medium Values (0.4 - 0.7): Balance between creativity and consistency. Ideal for general conversations.
  • High Values (0.8 - 1.0): More creative and varied responses. Suitable for creative writing, brainstorming, or idea exploration.

Usage Examples:

  • Temperature 0.1: "Calculate 2+2" → Always "4"
  • Temperature 0.8: "Tell me about dogs" → Various creative stories

Top-P (Nucleus Sampling)

Top-P controls response diversity by limiting word choices based on cumulative probability.

Value Range: 0.00 - 1.00

How It Works:

  • The model selects from a set of words whose cumulative probability reaches the Top-P value
  • Low Values (0.1 - 0.5): More focused and on-topic responses. AI selects from the most probable words.
  • High Values (0.6 - 1.0): More exploratory and diverse responses. AI considers more word choices.

Recommendations:

  • Use 0.9 - 1.0 for general conversations
  • Use 0.5 - 0.7 for specific tasks
  • Use < 0.5 for highly deterministic output

Top-K

Top-K limits the number of word choices the AI considers at each generation step.

Value Range: 0 - unlimited (practical: 10-100)

How It Works:

  • The model only considers K words with the highest probability
  • Value 0: No limit (all words may be considered)
  • Low Values (10-30): Very safe and predictable responses
  • Medium Values (40-60): Balance between variation and consistency
  • High Values (70-100): More vocabulary variation

Difference from Top-P:

  • Top-K: Fixed number of words limit
  • Top-P: Limit based on cumulative probability (more dynamic)

Min-P

Min-P filters out low-quality word choices based on a minimum probability threshold.

Value Range: 0.00 - 1.00

How It Works:

  • Removes words with probability below the threshold
  • Low Values (0.0 - 0.3): Strict quality control, only high-probability words
  • Medium Values (0.4 - 0.6): Balance between quality and variation
  • High Values (0.7 - 1.0): More permissive, allows more variation

Benefits:

  • Prevents AI from choosing nonsensical words
  • Improves response coherence
  • Reduces irrelevant output

Seed

Seed is a value that controls the reproducibility of AI output.

Value Range: -1 or positive number (0 - 2³¹)

How It Works:

  • -1: Random seed (different response each time for the same question)
  • Specific Number: Fixed seed (identical response for identical input with same parameters)

When to Use:

  • Random Seed (-1): Normal conversations, brainstorming, response variation
  • Fixed Seed: Testing, debugging, demos, reproducible research

Notes:

  • Same seed with different parameters will produce different output
  • Useful for comparing the effects of other parameter changes

Hardware Configuration

GPU Offload

GPU Offload determines how many model layers are moved to the GPU for acceleration.

Value Range: -1 (auto) or 0 - 40+

How It Works:

  • -1 (Auto): System automatically detects and allocates GPU optimally
  • 0: CPU only (no GPU acceleration)
  • 1-40+: Number of model layers processed on GPU

Guidelines:

  • Higher values = more GPU usage
  • Higher values = faster responses but higher VRAM usage
  • Start with -1 (auto) for optimal results
  • If VRAM runs out, reduce the value gradually

Recommendations Based on VRAM:

  • 4GB VRAM: 10-20 layers (small models)
  • 6-8GB VRAM: 20-30 layers (medium models)
  • 12GB+ VRAM: 30-40 layers or auto (large models)

Context Length

Context Length is the maximum number of tokens the AI can remember in one conversation.

Value Range: 0 (auto) or minimum 512

How It Works:

  • Determines the AI's "memory" in conversations
  • 1 token ≈ 0.75 words (English) or 1-2 characters (Indonesian)
  • Higher values = AI remembers more conversation context

Settings:

  • 0: Auto-scaling based on model and system
  • 512-2048: Short conversations, simple chatbot
  • 2048-4096: Normal conversations with medium context
  • 4096-8192: Long conversations, document analysis
  • 8192+: Large document analysis, very long context

Considerations:

  • Larger context = higher RAM/VRAM usage
  • Larger context = slightly longer processing time
  • Adjust according to your hardware capacity

Batch Size

Batch Size determines how many tokens are processed simultaneously.

Value Range: 32 - 2048 (common: 512)

How It Works:

  • Determines the number of tokens batched for parallel computation
  • Low Values (32-256): Slower but memory-efficient
  • Medium Values (512): Balance of speed and memory (recommended)
  • High Values (1024-2048): Faster but requires more memory

Recommendations:

  • Start with 512
  • If memory is sufficient and you want faster processing, increase to 1024
  • If memory is limited, decrease to 256

CPU Threads

CPU Threads determines how many CPU threads are used for model inference.

Value Range: 1 - your CPU thread count

How It Works:

  • More threads = more parallel processing
  • Low Values (1-4): For systems with limited CPU
  • Medium Values (4-8): Balance for mainstream CPUs
  • High Values (8-16+): For high-end CPUs with many cores

Guidelines:

  • Don't exceed your CPU's physical thread count
  • Recommendation: 50-75% of total CPU threads
  • Example: 8-core/16-thread CPU → use 8-12 threads
  • Leave threads for operating system and other applications

How to Find Your CPU Thread Count:

  • Windows: Task Manager → Performance → CPU
  • Linux: lscpu or nproc
  • macOS: System Information → Hardware

Flash Attention

Flash Attention is a more memory-efficient attention mechanism.

Options: Enabled/Disabled (Toggle)

How It Works:

  • Optimized attention algorithm to reduce VRAM usage
  • Increases speed on large models
  • Allows running larger models with the same VRAM

When to Enable:

  • ✅ If your GPU supports it (NVIDIA RTX 30/40 series, AMD RDNA2+)
  • ✅ When running large models with limited VRAM
  • ✅ To increase inference speed

When to Disable:

  • ❌ If experiencing errors or crashes
  • ❌ If GPU doesn't support it
  • ❌ If model is incompatible with Flash Attention

Notes:

  • Not all models support Flash Attention
  • If enabled and causing errors, disable this option

Use Mmap

Use Mmap (Memory Mapping) uses a special technique for faster model loading.

Options: Enabled/Disabled (Toggle)

How It Works:

  • Maps model file directly to virtual memory
  • Reduces application startup time
  • Operating system manages model loading efficiently

Benefits:

  • ✅ Faster model loading
  • ✅ More responsive application startup
  • ✅ Memory efficiency on some systems

Considerations:

  • Requires sufficient virtual memory space
  • On some systems, may not provide significant benefits
  • Keep enabled unless experiencing issues

Recommendations:

  • Enable by default for optimal performance
  • Disable if experiencing errors during model loading
  • Disable if system has very limited memory