What are the recommended settings for this model?

#7
by Arete7 - opened

What are the recommended settings for this model?

Yeah, I'd also be interested. Not sure how to achieve this (with llama.cpp?):

About 16.5 GB VRAM with Q4_K_M quantization
Full 262K context with no compromises

I can only fit about 220k context into 24 GB, with KV 8.0 cache.

24Gb VRAM here on two 3060, I can run only 180k ctx, more will result in a OOM during runtime on hard coding tasks:

ikllama/llama-server --host 0.0.0.0 -sm graph --jinja -m ~/models/Qwopus3.5-27B.Q4_K_M.gguf -fa on -ngl 999 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --cache-type-k q8_0 --cache-type-v q4_1 -np 1 -c 181072 -t 20

Sign up or log in to comment