Tips and Tricks

Tips and Tricks overview

Tips and Tricks

Practical guide to optimize performance and user experience of the Sapientia application.

RAM Usage Optimization

Reducing Context Length

Reducing context length is an effective way to conserve RAM and increase inference speed:

  • Low Context Length (2048-4096): Suitable for short chats, Q&A, or systems with limited RAM (4-6 GB)
  • Medium Context Length (8192-16384): Ideal for long conversations and document analysis with 8-12 GB RAM
  • High Context Length (32768+): For large document processing and multi-turn conversations, requires 16 GB RAM or more

Reducing context length significantly decreases memory consumption and speeds up response time.

Choosing the Right AI Model

Based on Use Case

General Purpose & Chat:

  • Choose conversational models like Gemma, Llama, Mistral, or Qwen
  • Prioritize models with instruction-tuning

Coding & Development:

  • Use coding-specific models like CodeLlama, DeepSeek-Coder, or Qwen-Coder
  • Models with larger context length for large codebases

Analysis & Reasoning:

  • Choose models with strong reasoning capabilities
  • Consider larger models for optimal accuracy

Multilingual:

  • Select models that explicitly support your target language
  • Models like Gemma, Qwen, or Aya have better multilingual support

Model Selection Tips

  1. Test various quantizations: Start with Q4 and gradually increase if RAM is sufficient
  2. Monitor performance metrics: Pay attention to TTFT and TPS to assess model suitability
  3. Consider tradeoffs: Larger models = more accurate but slower and require more resources
  4. Update regularly: New models often offer better performance with the same size