What models are you all using? #2673

stuckinsnow · 2026-01-18T11:13:33Z

stuckinsnow
Jan 18, 2026

I know that this isn't directly related to codecompanion... But now that I hear claude is banning people and whatnot, I'm curious what LLM models you're all using at the moment?

olimorris · 2026-01-18T11:48:28Z

olimorris
Jan 18, 2026
Maintainer

This is absolutely the type of discussions that I want to encourage on here. Also, keen to understand how people are developing with LLMs these days.

I'll go first:

For general, low-cost, low-impact conversations I use claude-haiku-4.5 via Copilot. It costs 1/3 of a premium request and seems pretty solid. It can function call reliably but it does seem to ignore my asks not to use H1 and H2 headers.

If I'm implementing a new feature (#2600), I enjoy using Claude Code to bounce ideas off and perhaps do an initial implementation. I've been using sonnet-4.5 and opus-4.5. I will write a blog post on how I've used LLMs to accomplish this (similar to this awesome post). Alas, I find there's still a lot of traps that these agents leave behind. That's why I have a rule whereby an agent can implement a feature OR a test but never both. And I always start out riffing on what we'll implement and how it should look from an API perspective. A favorite prompt of mine "Ask me some questions and challenge me on the plan".

I also really like Gemini. I've used that quite a lot to create and edit the prompts in my prompt library. And occasionally use it for coding tasks to see how it performs. I will do more testing in the coming month.

I've given up using self-hosted LLMs from Ollama. Never found their output to be useful and function calling still feels hit and miss.

0 replies

Diddie029 · 2026-01-21T17:48:16Z

Diddie029
Jan 21, 2026

I’m currently using GPT-5-mini via the OpenAI API for text generation

0 replies

pixlmint · 2026-01-21T21:54:56Z

pixlmint
Jan 21, 2026
Sponsor

I mostly like to play around with local models. They aren't exactly building my next unicorn saas, but I just enjoy finding the limits. I have had the most success when I used CodeCompanion's inline mode (highlight a few lines, call :CodeCompanion add a comment for this code). This gave me usable output more often than not. I find that tool calling does work, but having all this info tends to overwhelm lower-parameter models. I mostly use qwen3-coder:30b, but have just recently download z-AI's newly released glm-4.7-flash (30b model) and I'm looking forward to giving that a spin since I've read a lot of positives about this particular model.

7 replies

Witiko May 8, 2026

Thanks a lot! Using the qwen3-coder:30b Ollama model with a 16k-token context window seems to improve things noticeably:

Details

witiko@witiko-G5-5590:~$ docker exec -it ollama ollama run qwen3-coder:30b
>>> /set parameter num_ctx 16384
Set parameter 'num_ctx' to '16384'
>>> /save qwen3-coder:30b
Created new model 'qwen3-coder:30b'
>>> 
witiko@witiko-G5-5590:~$ docker exec -it ollama ollama stop qwen3-coder:30b

This is the largest context window I was able to comfortably fit into 6 GB VRAM and 32 GB RAM on a Dell G5 15 laptop, while still keeping a browser open and avoiding swapping. I am looking forward to seeing how capable this model turns out to be on more complex tasks.

pixlmint May 8, 2026
Sponsor

That's great, glad it's working for you. Though nowadays I mainly use gemma4:26b, especially for answering questions it's super helpful, but tool-use is already improved over qwen3. You might also want to look into using qwen3.6 as that's much newer than qwen3-coder

Witiko May 8, 2026

Thanks for the tips. I am currently playing with gemma4:e2b with the full 128K token context, which is blazingly fast and also fits comfortably into my VRAM and RAM. It also seems able to run tools, although I haven't given it any big tasks yet.

Details

`gemma4:e2b` with the full 128K token context

I also tried both gemma4:26b and qwen3.6:27b with 16K token context but they are both considerably slower than gemma4:e2b specifically seems to be struggling with tool use, potentially due to the small context window.

Details

`gemma4:26b` with 16K token context

`qwen3.6:27b` with 16K token context

Besides CodeCompanion, I am also experimenting with OpenCode, which seems to provide a significantly better experience outside a text editor.

Details

pixlmint May 9, 2026
Sponsor

What quantization did you get? Because I've also got 32gb ram + 12gb vram and can easily run q4 models with 60k kontext window.

Witiko May 10, 2026

Likely none, the official gemma4:e2b Ollama model doesn't seem quantized but I also tried the batiai/gemma4-e2b:q4 one, which fits almost entirely into the 6G VRAM even at the full 128K token context with no obvious degradation in quality.

ElliottLester · 2026-03-08T16:01:09Z

ElliottLester
Mar 8, 2026

I have been using Qwen3-Coder-Next-IQ4_NL locally on a laptop with llamacpp (it requires about 46GB of VRAM) but I can actually get work done in chat mode.

I have all the read tools enabled by default (local models shouldn't leak data)

It's able to do edits and pair program with me, it's not Claude but it's great for local edits and scripts and clean up.

List of things it has done:

pull shader uniform buffer defs out to a separate file and update cmake to call glslc to use that header, I just asked it if I could stop copy pasting it to every shader.
refactor some python to use http sessions to fix retry issues.

I realize these are toy cases but it was able to call and use tools to do the operations by it self and it didn't completely mess everything up, which is new for local models.

1 reply

olimorris Mar 9, 2026
Maintainer

it requires about 46GB of VRAM

😱

It's able to do edits and pair program with me, it's not Claude but it's great for local edits and scripts and clean up.
I realize these are toy cases but it was able to call and use tools to do the operations by it self and it didn't completely mess everything up, which is new for local models.

That's really good to hear and the fact you could do these locally is a great sign. The function calling being reliable is a huge step forward.

friesentyler · 2026-03-23T03:50:49Z

friesentyler
Mar 23, 2026

I've been playing around with the local Ollama configuration (running qwen2.5 32b) and have noticed issues with the tool calling in the chat window. The inline code editor is somewhat spotty, if there is no code in the file it seems to struggle, but if I am simply asking it to edit an existing file it works fine.

Can anyone describe their workflow with the local models using Ollama? I'm just trying to understand what it is, and is not capable of. (obviously agent mode does not work as evidenced by this thread)

5 replies

pixlmint Mar 23, 2026
Sponsor

qwen2.5 is pretty outdated at this point, you might want to give glm4.7-flash a try if your setup can handle it, I've found it quite capable even using the agent toolkit. Though I still prefer to just use the chat without any tools

friesentyler Mar 23, 2026

Thank you, yeah I just upgraded to qwen3.5 9b actually and I've gotten quite a bit of improvement. I had it write a basic flask application that was wrapping some calculator functionality I had built in a separate file. It didn't have any issues making tool calls for that talking to it in the chat window (although that was a pretty trivial task). I've found that 9b model performs much more acceptably in terms of token/sec. The 27b model was running at 7 tokens/sec and the 9b model was running at 23 tokens/sec. I'm using an M1 MacBook Pro with the Max chip, and 64gb of ram.

pixlmint Mar 23, 2026
Sponsor

I really like using qwen3-coder:30b for inline CodeCompanion, because it's very capable, and feels pretty quick on my hardware, but for anything more complex than these single-shot requests I prefer using some type of reasoning model. As such, gpt-oss:20b is also pretty helpful, especially for non-agentic tasks (like asking it questions about code for example)

friesentyler Mar 23, 2026

what type of reasoning model would you suggest?

pixlmint Mar 24, 2026
Sponsor

Pretty much any one of the already mentioned - gpt-oss, qwen3.5, glm4.7-flash are all good options

yegorich · 2026-06-03T10:25:38Z

yegorich
Jun 3, 2026

I am playing with Gemma4:e4b on Raseon 780M with 16GB VRAM and 32GB RAM. llama.cpp acts as backend as ollama doesn't really support Gemma4 on Vulkan. My problem is that in this constellation such simple features as #buffer are not working. I have to attach my source files via the /files command.

2 replies

olimorris Jun 3, 2026
Maintainer

It should be #{buffer} and it's not dependant on the adapter or LLM

yegorich Jun 4, 2026

Thanks. It is working now.

Uh oh!

What models are you all using? #2673

Uh oh!

Replies: 6 comments · 15 replies

Uh oh!

Uh oh!

olimorris Jan 18, 2026 Maintainer

Uh oh!

Uh oh!

pixlmint Jan 21, 2026 Sponsor

Uh oh!

Uh oh!

Uh oh!

pixlmint May 8, 2026 Sponsor

Uh oh!

Uh oh!

gemma4:e2b with the full 128K token context

gemma4:26b with 16K token context

qwen3.6:27b with 16K token context

Uh oh!

pixlmint May 9, 2026 Sponsor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

olimorris Mar 9, 2026 Maintainer

Uh oh!

Uh oh!

pixlmint Mar 23, 2026 Sponsor

Uh oh!

Uh oh!

Uh oh!

pixlmint Mar 23, 2026 Sponsor

Uh oh!

Uh oh!

pixlmint Mar 24, 2026 Sponsor

Uh oh!

Uh oh!

olimorris Jun 3, 2026 Maintainer

Uh oh!

Replies: 6 comments 15 replies

olimorris
Jan 18, 2026
Maintainer

pixlmint
Jan 21, 2026
Sponsor

pixlmint May 8, 2026
Sponsor

`gemma4:e2b` with the full 128K token context

`gemma4:26b` with 16K token context

`qwen3.6:27b` with 16K token context

pixlmint May 9, 2026
Sponsor

olimorris Mar 9, 2026
Maintainer

pixlmint Mar 23, 2026
Sponsor

pixlmint Mar 23, 2026
Sponsor

pixlmint Mar 24, 2026
Sponsor

olimorris Jun 3, 2026
Maintainer