Model Configuration
Complete guide to AI model settings and optimization
This page explains the various AI model configuration parameters available in Sapientia. Each setting can be customized to optimize performance, creativity, and resource usage according to your needs.
Base Models
AI Model
AI Model is the primary language model used to generate responses. Sapientia supports models in GGUF format that can be run locally.
How It Works:
- This model acts as the "brain" of your AI assistant
- Processes questions and generates relevant answers
- Larger models are generally more accurate in their responses
How to Use:
- Click the "Change Model" button to select a model file from your local storage
- Select a file with
.ggufformat - The system will display model compatibility information
- Click "Apply" to apply the changes
Tips:
- Choose a model that matches your computer's RAM/VRAM capacity
- Smaller models are faster but may be less accurate
- Larger models are more accurate but require more resources
Embedding Model
Embedding Model is a specialized model that converts text into vector representations for semantic search and RAG (Retrieval-Augmented Generation) capabilities.
How It Works:
- Converts text into numerical vectors
- Enables AI to understand context and meaning of text
- Used to search for relevant information from the knowledge base
How to Use:
- Click the "Change Model" button in the Embedding Model section
- Select an embedding model file in
.ggufformat - Ensure the model supports the appropriate embedding dimensions (default: 2048)
- Click "Apply" to apply the changes
Tips:
- Embedding models are typically smaller than the main AI model
- A good embedding model improves document search accuracy
- Choose a model that supports the languages you use
Inference Parameters
Temperature
Temperature controls the level of creativity and randomness in AI responses.
Value Range: 0.00 - 1.00
How It Works:
- Low Values (0.0 - 0.3): More consistent and predictable responses. Suitable for tasks requiring high accuracy such as code, mathematics, or facts.
- Medium Values (0.4 - 0.7): Balance between creativity and consistency. Ideal for general conversations.
- High Values (0.8 - 1.0): More creative and varied responses. Suitable for creative writing, brainstorming, or idea exploration.
Usage Examples:
- Temperature 0.1: "Calculate 2+2" → Always "4"
- Temperature 0.8: "Tell me about dogs" → Various creative stories
Top-P (Nucleus Sampling)
Top-P controls response diversity by limiting word choices based on cumulative probability.
Value Range: 0.00 - 1.00
How It Works:
- The model selects from a set of words whose cumulative probability reaches the Top-P value
- Low Values (0.1 - 0.5): More focused and on-topic responses. AI selects from the most probable words.
- High Values (0.6 - 1.0): More exploratory and diverse responses. AI considers more word choices.
Recommendations:
- Use 0.9 - 1.0 for general conversations
- Use 0.5 - 0.7 for specific tasks
- Use < 0.5 for highly deterministic output
Top-K
Top-K limits the number of word choices the AI considers at each generation step.
Value Range: 0 - unlimited (practical: 10-100)
How It Works:
- The model only considers K words with the highest probability
- Value 0: No limit (all words may be considered)
- Low Values (10-30): Very safe and predictable responses
- Medium Values (40-60): Balance between variation and consistency
- High Values (70-100): More vocabulary variation
Difference from Top-P:
- Top-K: Fixed number of words limit
- Top-P: Limit based on cumulative probability (more dynamic)
Min-P
Min-P filters out low-quality word choices based on a minimum probability threshold.
Value Range: 0.00 - 1.00
How It Works:
- Removes words with probability below the threshold
- Low Values (0.0 - 0.3): Strict quality control, only high-probability words
- Medium Values (0.4 - 0.6): Balance between quality and variation
- High Values (0.7 - 1.0): More permissive, allows more variation
Benefits:
- Prevents AI from choosing nonsensical words
- Improves response coherence
- Reduces irrelevant output
Seed
Seed is a value that controls the reproducibility of AI output.
Value Range: -1 or positive number (0 - 2³¹)
How It Works:
- -1: Random seed (different response each time for the same question)
- Specific Number: Fixed seed (identical response for identical input with same parameters)
When to Use:
- Random Seed (-1): Normal conversations, brainstorming, response variation
- Fixed Seed: Testing, debugging, demos, reproducible research
Notes:
- Same seed with different parameters will produce different output
- Useful for comparing the effects of other parameter changes
Hardware Configuration
GPU Offload
GPU Offload determines how many model layers are moved to the GPU for acceleration.
Value Range: -1 (auto) or 0 - 40+
How It Works:
- -1 (Auto): System automatically detects and allocates GPU optimally
- 0: CPU only (no GPU acceleration)
- 1-40+: Number of model layers processed on GPU
Guidelines:
- Higher values = more GPU usage
- Higher values = faster responses but higher VRAM usage
- Start with -1 (auto) for optimal results
- If VRAM runs out, reduce the value gradually
Recommendations Based on VRAM:
- 4GB VRAM: 10-20 layers (small models)
- 6-8GB VRAM: 20-30 layers (medium models)
- 12GB+ VRAM: 30-40 layers or auto (large models)
Context Length
Context Length is the maximum number of tokens the AI can remember in one conversation.
Value Range: 0 (auto) or minimum 512
How It Works:
- Determines the AI's "memory" in conversations
- 1 token ≈ 0.75 words (English) or 1-2 characters (Indonesian)
- Higher values = AI remembers more conversation context
Settings:
- 0: Auto-scaling based on model and system
- 512-2048: Short conversations, simple chatbot
- 2048-4096: Normal conversations with medium context
- 4096-8192: Long conversations, document analysis
- 8192+: Large document analysis, very long context
Considerations:
- Larger context = higher RAM/VRAM usage
- Larger context = slightly longer processing time
- Adjust according to your hardware capacity
Batch Size
Batch Size determines how many tokens are processed simultaneously.
Value Range: 32 - 2048 (common: 512)
How It Works:
- Determines the number of tokens batched for parallel computation
- Low Values (32-256): Slower but memory-efficient
- Medium Values (512): Balance of speed and memory (recommended)
- High Values (1024-2048): Faster but requires more memory
Recommendations:
- Start with 512
- If memory is sufficient and you want faster processing, increase to 1024
- If memory is limited, decrease to 256
CPU Threads
CPU Threads determines how many CPU threads are used for model inference.
Value Range: 1 - your CPU thread count
How It Works:
- More threads = more parallel processing
- Low Values (1-4): For systems with limited CPU
- Medium Values (4-8): Balance for mainstream CPUs
- High Values (8-16+): For high-end CPUs with many cores
Guidelines:
- Don't exceed your CPU's physical thread count
- Recommendation: 50-75% of total CPU threads
- Example: 8-core/16-thread CPU → use 8-12 threads
- Leave threads for operating system and other applications
How to Find Your CPU Thread Count:
- Windows: Task Manager → Performance → CPU
- Linux:
lscpuornproc - macOS: System Information → Hardware
Flash Attention
Flash Attention is a more memory-efficient attention mechanism.
Options: Enabled/Disabled (Toggle)
How It Works:
- Optimized attention algorithm to reduce VRAM usage
- Increases speed on large models
- Allows running larger models with the same VRAM
When to Enable:
- ✅ If your GPU supports it (NVIDIA RTX 30/40 series, AMD RDNA2+)
- ✅ When running large models with limited VRAM
- ✅ To increase inference speed
When to Disable:
- ❌ If experiencing errors or crashes
- ❌ If GPU doesn't support it
- ❌ If model is incompatible with Flash Attention
Notes:
- Not all models support Flash Attention
- If enabled and causing errors, disable this option
Use Mmap
Use Mmap (Memory Mapping) uses a special technique for faster model loading.
Options: Enabled/Disabled (Toggle)
How It Works:
- Maps model file directly to virtual memory
- Reduces application startup time
- Operating system manages model loading efficiently
Benefits:
- ✅ Faster model loading
- ✅ More responsive application startup
- ✅ Memory efficiency on some systems
Considerations:
- Requires sufficient virtual memory space
- On some systems, may not provide significant benefits
- Keep enabled unless experiencing issues
Recommendations:
- Enable by default for optimal performance
- Disable if experiencing errors during model loading
- Disable if system has very limited memory