Model Configuration

This page explains the various AI model configuration parameters available in Sapientia. Each setting can be customized to optimize performance, creativity, and resource usage according to your needs.

Base Models

AI Model

AI Model is the primary language model used to generate responses. Sapientia supports models in GGUF format that can be run locally.

How It Works:

This model acts as the "brain" of your AI assistant
Processes questions and generates relevant answers
Larger models are generally more accurate in their responses

How to Use:

Click the "Change Model" button to select a model file from your local storage
Select a file with .gguf format
The system will display model compatibility information
Click "Apply" to apply the changes

Tips:

Choose a model that matches your computer's RAM/VRAM capacity
Smaller models are faster but may be less accurate
Larger models are more accurate but require more resources

Embedding Model

Embedding Model is a specialized model that converts text into vector representations for semantic search and RAG (Retrieval-Augmented Generation) capabilities.

How It Works:

Converts text into numerical vectors
Enables AI to understand context and meaning of text
Used to search for relevant information from the knowledge base

How to Use:

Click the "Change Model" button in the Embedding Model section
Select an embedding model file in .gguf format
Ensure the model supports the appropriate embedding dimensions (default: 2048)
Click "Apply" to apply the changes

Tips:

Embedding models are typically smaller than the main AI model
A good embedding model improves document search accuracy
Choose a model that supports the languages you use

Inference Parameters

Temperature

Temperature controls the level of creativity and randomness in AI responses.

Value Range: 0.00 - 1.00

How It Works:

Low Values (0.0 - 0.3): More consistent and predictable responses. Suitable for tasks requiring high accuracy such as code, mathematics, or facts.
Medium Values (0.4 - 0.7): Balance between creativity and consistency. Ideal for general conversations.
High Values (0.8 - 1.0): More creative and varied responses. Suitable for creative writing, brainstorming, or idea exploration.

Usage Examples:

Temperature 0.1: "Calculate 2+2" → Always "4"
Temperature 0.8: "Tell me about dogs" → Various creative stories

Top-P (Nucleus Sampling)

Top-P controls response diversity by limiting word choices based on cumulative probability.

Value Range: 0.00 - 1.00

How It Works:

The model selects from a set of words whose cumulative probability reaches the Top-P value
Low Values (0.1 - 0.5): More focused and on-topic responses. AI selects from the most probable words.
High Values (0.6 - 1.0): More exploratory and diverse responses. AI considers more word choices.

Recommendations:

Use 0.9 - 1.0 for general conversations
Use 0.5 - 0.7 for specific tasks
Use < 0.5 for highly deterministic output

Top-K

Top-K limits the number of word choices the AI considers at each generation step.

Value Range: 0 - unlimited (practical: 10-100)

How It Works:

The model only considers K words with the highest probability
Value 0: No limit (all words may be considered)
Low Values (10-30): Very safe and predictable responses
Medium Values (40-60): Balance between variation and consistency
High Values (70-100): More vocabulary variation

Difference from Top-P:

Top-K: Fixed number of words limit
Top-P: Limit based on cumulative probability (more dynamic)

Min-P

Min-P filters out low-quality word choices based on a minimum probability threshold.

Value Range: 0.00 - 1.00

How It Works:

Removes words with probability below the threshold
Low Values (0.0 - 0.3): Strict quality control, only high-probability words
Medium Values (0.4 - 0.6): Balance between quality and variation
High Values (0.7 - 1.0): More permissive, allows more variation

Benefits:

Prevents AI from choosing nonsensical words
Improves response coherence
Reduces irrelevant output

Seed

Seed is a value that controls the reproducibility of AI output.

Value Range: -1 or positive number (0 - 2³¹)

How It Works:

-1: Random seed (different response each time for the same question)
Specific Number: Fixed seed (identical response for identical input with same parameters)

When to Use:

Random Seed (-1): Normal conversations, brainstorming, response variation
Fixed Seed: Testing, debugging, demos, reproducible research

Notes:

Same seed with different parameters will produce different output
Useful for comparing the effects of other parameter changes

Hardware Configuration

GPU Offload

GPU Offload determines how many model layers are moved to the GPU for acceleration.

Value Range: -1 (auto) or 0 - 40+

How It Works:

-1 (Auto): System automatically detects and allocates GPU optimally
0: CPU only (no GPU acceleration)
1-40+: Number of model layers processed on GPU

Guidelines:

Higher values = more GPU usage
Higher values = faster responses but higher VRAM usage
Start with -1 (auto) for optimal results
If VRAM runs out, reduce the value gradually

Recommendations Based on VRAM:

4GB VRAM: 10-20 layers (small models)
6-8GB VRAM: 20-30 layers (medium models)
12GB+ VRAM: 30-40 layers or auto (large models)

Context Length

Context Length is the maximum number of tokens the AI can remember in one conversation.

Value Range: 0 (auto) or minimum 512

How It Works:

Determines the AI's "memory" in conversations
1 token ≈ 0.75 words (English) or 1-2 characters (Indonesian)
Higher values = AI remembers more conversation context

Settings:

0: Auto-scaling based on model and system
512-2048: Short conversations, simple chatbot
2048-4096: Normal conversations with medium context
4096-8192: Long conversations, document analysis
8192+: Large document analysis, very long context

Considerations:

Larger context = higher RAM/VRAM usage
Larger context = slightly longer processing time
Adjust according to your hardware capacity

Batch Size

Batch Size determines how many tokens are processed simultaneously.

Value Range: 32 - 2048 (common: 512)

How It Works:

Determines the number of tokens batched for parallel computation
Low Values (32-256): Slower but memory-efficient
Medium Values (512): Balance of speed and memory (recommended)
High Values (1024-2048): Faster but requires more memory

Recommendations:

Start with 512
If memory is sufficient and you want faster processing, increase to 1024
If memory is limited, decrease to 256

CPU Threads

CPU Threads determines how many CPU threads are used for model inference.

Value Range: 1 - your CPU thread count

How It Works:

More threads = more parallel processing
Low Values (1-4): For systems with limited CPU
Medium Values (4-8): Balance for mainstream CPUs
High Values (8-16+): For high-end CPUs with many cores

Guidelines:

Don't exceed your CPU's physical thread count
Recommendation: 50-75% of total CPU threads
Example: 8-core/16-thread CPU → use 8-12 threads
Leave threads for operating system and other applications

How to Find Your CPU Thread Count:

Windows: Task Manager → Performance → CPU
Linux: lscpu or nproc
macOS: System Information → Hardware

Flash Attention

Flash Attention is a more memory-efficient attention mechanism.

Options: Enabled/Disabled (Toggle)

How It Works:

Optimized attention algorithm to reduce VRAM usage
Increases speed on large models
Allows running larger models with the same VRAM

When to Enable:

✅ If your GPU supports it (NVIDIA RTX 30/40 series, AMD RDNA2+)
✅ When running large models with limited VRAM
✅ To increase inference speed

When to Disable:

❌ If experiencing errors or crashes
❌ If GPU doesn't support it
❌ If model is incompatible with Flash Attention

Notes:

Not all models support Flash Attention
If enabled and causing errors, disable this option

Use Mmap

Use Mmap (Memory Mapping) uses a special technique for faster model loading.

Options: Enabled/Disabled (Toggle)

How It Works:

Maps model file directly to virtual memory
Reduces application startup time
Operating system manages model loading efficiently

Benefits:

✅ Faster model loading
✅ More responsive application startup
✅ Memory efficiency on some systems

Considerations:

Requires sufficient virtual memory space
On some systems, may not provide significant benefits
Keep enabled unless experiencing issues

Recommendations:

Enable by default for optimal performance
Disable if experiencing errors during model loading
Disable if system has very limited memory

Model Configuration

On this page