What are the recommended settings for this model?

by Arete7 - opened Mar 7

Discussion

Arete7

Mar 7

What are the recommended settings for this model?

ulymp

Mar 9

•

edited Mar 9

Yeah, I'd also be interested. Not sure how to achieve this (with llama.cpp?):

About 16.5 GB VRAM with Q4_K_M quantization
Full 262K context with no compromises

I can only fit about 220k context into 24 GB, with KV 8.0 cache.

iskrid

29 days ago

•

edited 29 days ago

24Gb VRAM here on two 3060, I can run only 180k ctx, more will result in a OOM during runtime on hard coding tasks:

ikllama/llama-server --host 0.0.0.0 -sm graph --jinja -m ~/models/Qwopus3.5-27B.Q4_K_M.gguf -fa on -ngl 999 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --cache-type-k q8_0 --cache-type-v q4_1 -np 1 -c 181072 -t 20

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment