Hermes 3 vs Gemma 4 vs Qwen3: I A/B tested 6 tasks on a 16GB Mac mini


A month ago I bought a $599 Mac mini M4 and started pulling models with Ollama. After a few weeks of running production daemons that needed actual decisions made — grade this job application, draft this cover letter, write this wiki chapter — I had a question: which model for which job?

I ran five A/B tests on real data. Here are the results, the surprises, and the pattern I think generalizes.

The constraint

16GB RAM. OLLAMA_MAX_LOADED_MODELS=1. One model hot at a time. So model choice actually matters — you can’t keep all of them loaded and pick on the fly.

My candidate models:

  • gemma4:e4b — 4.5B effective params, ~30 tok/s. Marketed for general-purpose. Strong on prose.
  • hermes3:8b — ~22 tok/s. Agentic-tuned, supposed to be good at JSON output.
  • qwen3:14b — ~11 tok/s. Bigger, more reasoning headroom.
  • qwen3.5:35b — UD-IQ3_XXS quant, ~17 tok/s via llama.cpp --mmap trick. Heavy tier.

Same machine, same prompts, same real tasks from my production pipelines. No synthetic benchmarks.

Test 1: Career grading

Task. Given a job description, my profile, and my CV, return yes/no/maybe with a confidence number (0–10). The hard part is borderline calls — easy to over-decide in either direction. A model that flips a real “maybe” to a confident “no” is worse than useless.

Results:

  • hermes3:8b — cautious on borderline. When fit was 5–6/10, it returned “maybe” appropriately. ~50s/job.
  • gemma4:e4b — verdict parity with Hermes but ~1.7× slower (~85s/job). Same answers, longer wait.
  • qwen3:14b — over-decisive. Demoted a real “maybe” job (one I’d actually want to revisit) to “no, confidence 9.” Wrong answer with high confidence. ~140s/job.

Winner: hermes3:8b. Cleanest “Hermes is right for constrained decisions” data point I have.

Test 2: Cover letter prose

Task. Given the JD, my CV, and my profile, write a 250-word cover letter and a tailoring memo (verbatim CV bullets to lead with). The “verbatim” rule matters — if the model paraphrases a CV bullet, the recruiter checking against my actual CV won’t find what I quoted.

Results:

  • gemma4:e4b — led with concrete CV metrics. Followed the verbatim rule. Wrote natural-sounding prose, not template-y.
  • hermes3:8b — paraphrased my CV’s Professional Summary instead of leading with metrics. Appended ” at Company X” to bullets that were supposed to be verbatim — broke the rule. Also used bullet lists inside the cover letter body even though the prompt forbade them.
  • qwen3:14b — 2.4× slower than gemma. Generic prose without specific CV grounding. Also deferred answerable questions (“what’s your work authorization?”) to [NEEDS USER INPUT] despite the answer being explicitly in my profile.

Winner: gemma4:e4b. The “Hermes is great at decisions, terrible at open prose” pattern showed up here.

Test 3: Wiki orderer (prerequisite graph)

Task. Given 8 ML chapters, return the optimal teaching order — what should you read first, what builds on what. The prompt said “prefer minimal prerequisites — only the directly-needed ones, not the whole transitive closure.”

Results:

  • hermes3:8b — 15.9s. Returned 7 valid prereqs, 100% precision against my ground-truth hand-curated list. Strict adherence to “minimal” rule — didn’t include transitive deps.
  • qwen3:14b — 116s (7.4× slower). Over-included transitive deps. Prereqs were technically true but violated the “minimal” rule.

Winner: hermes3:8b. When the prompt has rules like “prefer minimal X” or “use exactly these slugs”, Hermes follows them stricter than qwen3.

Test 4: Wiki clusterer (group sources)

Task. Given a folder of new sources with literal ugly filenames (2026-04-13-gpt3-few-shot-learners.md), cluster them by topic and return cluster + filenames. Critical: the model must use the LITERAL filename, not a cleaned-up version.

Results:

  • qwen3:14b (with /think default mode) — correctly carried literal filenames through. Right answer.
  • qwen3:14b (with /no_think) — invented readable-looking filenames (“GPT-3 Paper Few-Shot Learners.md”). Would have silently broken my pipeline at the file-existence check.
  • hermes3:8b — filenames correct but dropped 2/8 sources entirely. Better filename fidelity, worse coverage.

Winner: qwen3:14b with /think. Counterintuitive lesson: /no_think is an optimization but breaks anything requiring verbatim copy of ugly identifiers. Worth knowing before you flip it on across a pipeline.

Test 5: Wiki drafter (new chapter from sources)

Task. Given a small cluster of sources, draft a new wiki chapter with structured frontmatter and prose body.

Results:

  • qwen3:14b — 3411-char usable draft with 8 wikilinks, proper frontmatter, body heading present.
  • hermes3:8b — 1869-char malformed draft. No body heading, fragmented prose, frontmatter had prerequisites: [self-slug] (self-referencing), only 1 wikilink.
  • gemma4:e4b — 3 characters total. Wait what.

The gemma failure mode was the most informative. The drafter prompt asked for the chapter wrapped in JSON. Gemma loses track of newline escaping inside JSON strings — generates \"\n \n \n \n forever until num_predict cuts it off. Solution: use ===SECTION=== markdown markers instead of JSON for any multi-paragraph LLM output. With sections, gemma works fine.

Winner: qwen3:14b for this task as-structured. But the deeper lesson: don’t put multi-paragraph prose inside JSON if you use gemma anywhere.

Test 6: Daily briefing (heavy tier)

Task. Given 4266 chars of context (22 emails + 10 vault changes), synthesize a morning brief.

Results:

  • qwen3.5:35b on llama.cpp via --mmap — 186s, 1162 chars. Cited specific company names. Accurate counts.
  • qwen3:14b — 50s, 1233 chars. Vaguer (“several emails”). Duplicates. Leaked internal paths.
  • hermes3:8b — 20s, 878 chars. Miscategorized — LPL Financial got labeled a “job alert” when it was actually a financial-services email.

Winner: qwen3.5:35b. Brief runs at night. Accuracy >> latency. Smaller models are fast but make categorical errors on heavy synthesis tasks where the context window is full.

The pattern

There is no single “agent model.” The marketing categories (“agentic-tuned”, “general-purpose”) don’t tell you which model wins which job. You have to test.

What I think generalizes:

  1. Hermes 3 wins constrained-output decision tasks. Grading, ordering, classification with rules like “prefer minimal X.” When the prompt says “be cautious,” Hermes is.
  2. Hermes 3 loses open-ended prose generation. Cover letters, drafter output, anything requiring verbatim copying of source material.
  3. Gemma 4 wins prose generation when output is plain text. Natural-sounding, follows verbatim rules, cites specifics.
  4. Gemma 4 cannot do multi-paragraph prose inside JSON strings. Use ===SECTION=== markers instead.
  5. qwen3:14b wins structured prose with frontmatter (wiki drafts) and comprehensive enumeration (clustering, but only with /think mode).
  6. qwen3:14b is over-decisive on borderline calls. Don’t use it where “maybe” is a valid answer.
  7. qwen3.5:35b via --mmap is the right call for accuracy-critical synthesis that runs overnight. Slow but careful.

How I A/B test now

50-line script in /tmp/<pipeline>_ab.py. Imports the production agent factory, instantiates it with different --model candidates, runs the same input through all of them. Dumps each output to /tmp/<pipeline>_<model>.md for side-by-side reading. Logs latency.

It’s not fancy. The point is every new pipeline gets a documented A/B verdict in code, not a hunch. The A/B harness lives in the repo so the verdict stays with the change.

What I’d tell day-1 me

Centralize model names in config.py. Use MODEL_AGENT, MODEL_PRIMARY, MODEL_HEAVY aliases. The named alias documents the A/B verdict — comment with the date and what won. Then when you A/B a new pipeline, you don’t grep through files swapping strings. You add a new alias or change one constant.

The template

This whole stack — the model picks, the daemon scheduling, the A/B notes, the daemon catalog — is captured in the Notion template I shipped. The 16GB Mac Mini Local AI Setup on Gumroad. Pay what you want, $3 minimum. Buy v1.0, get every future update free.

If you’re starting your own local AI thing, that’s the map.

minillm.dev