Tips and Tricks
Tips and Tricks overview
Tips and Tricks
Practical guide to optimize performance and user experience of the Sapientia application.
RAM Usage Optimization
Reducing Context Length
Reducing context length is an effective way to conserve RAM and increase inference speed:
- Low Context Length (2048-4096): Suitable for short chats, Q&A, or systems with limited RAM (4-6 GB)
- Medium Context Length (8192-16384): Ideal for long conversations and document analysis with 8-12 GB RAM
- High Context Length (32768+): For large document processing and multi-turn conversations, requires 16 GB RAM or more
Reducing context length significantly decreases memory consumption and speeds up response time.
Choosing the Right AI Model
Based on Use Case
General Purpose & Chat:
- Choose conversational models like Gemma, Llama, Mistral, or Qwen
- Prioritize models with instruction-tuning
Coding & Development:
- Use coding-specific models like CodeLlama, DeepSeek-Coder, or Qwen-Coder
- Models with larger context length for large codebases
Analysis & Reasoning:
- Choose models with strong reasoning capabilities
- Consider larger models for optimal accuracy
Multilingual:
- Select models that explicitly support your target language
- Models like Gemma, Qwen, or Aya have better multilingual support
Model Selection Tips
- Test various quantizations: Start with Q4 and gradually increase if RAM is sufficient
- Monitor performance metrics: Pay attention to TTFT and TPS to assess model suitability
- Consider tradeoffs: Larger models = more accurate but slower and require more resources
- Update regularly: New models often offer better performance with the same size