Echo Locate
A 27B open-weights model on a 2015 workstation, in the same leaderboard band as the trillion-plus-parameter frontier models hosted in a data center.
Full results, full hardware, full methodology. No cloud. No network. Under $2,000. Scrap eBay hardware.
The Scoreboard
| Rank | System | Backend | Score | Deployment |
|---|---|---|---|---|
| 01 | Mastra OM | GPT-5-mini | 94.87% | Cloud |
| 02 | Mastra OM | Gemini-3-pro-preview | 93.27% | Cloud |
| 03 | Hindsight | Gemini-3-pro-preview | 91.40% | Cloud |
| 04 | Mastra OM | Gemini-3-flash-preview | 89.20% | Cloud |
| 05 | Hindsight | GPT-OSS-120B | 89.00% | Cloud |
| 06 | Echo Locate | Qwen3.6-27B Q4_K_XL | 85.40% | Local, air-gapped |
| 07 | Supermemory | Gemini-3-pro-preview | 85.20% | Cloud |
| 08 | Supermemory | GPT-5 | 84.60% | Cloud |
| 09 | Mastra OM | GPT-4o | 84.23% | Cloud |
| 10 | Hindsight | GPT-OSS-20B | 83.60% | Cloud |
| 11 | EmergenceMem Simple | GPT-4o | 82.40% | Cloud |
| 11 | Echo Locate | Qwen3.5-27B Q4_K_M | 82.40% | Local, air-gapped |
| 13 | Supermemory | GPT-4o | 81.60% | Cloud |
| 14 | Mastra RAG | GPT-4o | 80.05% | Cloud |
| 15 | Zep Graphiti | GPT-4o | 71.20% | Cloud |
| 16 | Full Context | GPT-4o | 60.20% | Cloud |
LongMemEval-S (Wu et al., 2024). GPT-4o judge, temperature 0. Full 500 questions. Standard harness (github.com/xiaowu0162/LongMemEval). No best-of-N, no self-consistency voting, no judge swaps, no recall@N substitutions.
Look at that table for a moment before reading further.
Every system above Echo Locate calls a model with hundreds of billions to multiple trillions of parameters, runs in a cloud data center, and depends on a live internet connection to function. GPT-5-mini, Gemini-3-pro, Gemini-3-flash, GPT-OSS-120B; these are frontier-class systems backed by multi-billion-dollar infrastructure.
Every system below Echo Locate uses GPT-4o or weaker as its backbone. That includes Mastra OM (84.23%), the previous best GPT-4o-class result on this benchmark.
Echo Locate sits between those two tiers with a 27-billion-parameter open-weights model, quantized to 4 bits, running on a 2015 HP Z840 workstation with retired mining GPUs. Total hardware cost: under $2,000. No API calls. No network at inference. The entire deployed system runs in an Obama era workstation we bought off eBay.
It outperforms every GPT-4o-class system on the leaderboard, and sits within single digits of systems running trillion-parameter models in cloud data centers.
The Argument
For the past two years, the AI industry has operated on a quiet assumption: solving long-horizon memory, the kind that lets an assistant actually remember what you told it three months ago, track how facts about you have evolved, and synthesize answers across hundreds of past conversations, requires frontier-scale models. Trillion-parameter systems. Massive context windows. Cloud data centers full of H100s. The bigger the model, the better the memory.
The leaderboard above suggests that assumption is more fragile than it looks.
A quantized 27B open-weights model has no business being within a handful of points of GPT-5-mini on a memory benchmark. Except it is. And the reason it is has nothing to do with the model getting better, and everything to do with the pipeline Echo Locate engineered around it doing the work the model was previously being asked to do alone.
When you stop treating the LLM as the place where memory happens, and start treating it as one carefully-bounded component in an engineered retrieval pipeline (embedding, reranking, fact extraction, session-level aggregation, type-specific routing), the parameter-count race fades into the background. The model stops being asked to remember things. It gets asked to reason over a curated evidence set that has been assembled by purpose-built components running beside it.
The 27B model in Echo Locate is not smarter than GPT-5-mini. It isn't even in the same conversation. What it has is a pipeline that hands it the right facts to reason over, every time, in a format it can handle without hallucinating. The trillion-parameter systems on the leaderboard above are doing long-horizon memory the hard way: by stuffing huge contexts into huge models and hoping the attention mechanism surfaces what matters. That works, and it works well, but it can't be infinitely scaled to solve the memory problem. It's also expensive, cloud-dependent, and, as the leaderboard shows, not actually that much better than a well-engineered quantized 27B pipeline for this specific problem.
This is not a paradigm shift for every AI use case. Frontier models still matter for raw intelligence, novel synthesis, code generation, and tasks where the model's internal world knowledge is the bottleneck. But for the specific problem of long-horizon interactive memory, recalling, updating, and synthesizing facts across long conversation histories, pipeline design is more important than parameter count.
Why This Matters Beyond the Benchmark
For organizations that can use cloud AI freely, the existing commercial memory layers will continue to work fine. The cloud APIs are easy to integrate, the latency is better, and convenience wins most procurement decisions.
For everyone else, the legal firms who can't send privileged client data to OpenAI, the hospital networks bound by HIPAA, the defense contractors operating on classified networks, the financial institutions facing strict data residency requirements, the government agencies whose data simply cannot leave the building, the calculus has shifted.
There is a second procurement path converging on the same architecture: teams whose workloads routinely exceed the one-million-token context limits of frontier cloud models. For them the bottleneck isn't sovereignty, it's that no model's context window is long enough, and brute-forcing memory by scaling context further has hard limits. They end up needing the same thing the regulated buyer needs: a memory layer that runs alongside the model rather than inside it.
The previous option space was:
- Use a cloud-backed memory system and accept the compliance risk, or
- Use a local system and accept dramatically worse performance.
The leaderboard above narrows that gap from "dramatically worse" to "single digits behind the frontier, ahead of most GPT-4o-class deployments." For buyers in regulated industries, that changes the procurement question from "do we accept cloud risk in exchange for AI memory capabilities?" to "can a local deployment now meet our performance bar?" It is a different conversation than the market has been having for the past few years.
System Overview
Echo Locate is a fully self-hosted long-horizon memory system with three components running entirely on local hardware:
- Retrieval layer, Embedding model and Reranker
- Generation layer, Quantized open-weights LLM
- Orchestration layer, Python pipeline coordinating Indexing, Retrieval, Ranking, and Answer Synthesis
All inference happens locally via llama.cpp. No external network calls at any point during operation.
Hardware
Production deployment, the system customers receive:
| Component | Specification | Source | Approx Cost |
|---|---|---|---|
| Workstation | HP Z840 (2015) | eBay | $500 |
| CPU | 2× Intel Xeon E5-2650 v4 (24C/48T total); second CPU sourced separately from eBay | eBay | $15 |
| RAM | 64 GB DDR4 ECC | Included w/ Z840 | — |
| GPU 0 | NVIDIA RTX 5060 Ti 16 GB | Micro Center | $525 |
| GPU 1 | NVIDIA RTX 3090 FTW3 24 GB (retired mining GPU) | Used market | $900 |
| Boot drive | 512 GB NVMe SSD | Used market | $50 |
| Storage | 2× 512 GB SATA SSD | Included w/ Z840 | — |
| Bulk storage | 6 TB HDD | Included w/ Z840 | — |
| PSU | 1125 W (factory) | Included w/ Z840 | — |
| Total | ~$1,990 | ||
The RTX 3090 was released in 2020 and salvaged from a retired crypto mining rig. The RTX 5060 Ti is the only new GPU in the stack and could easily be replaced by any GPU with 12 GB or more of VRAM.
Benchmark configuration, the rig that produced the score:
The evaluation reported here was run on a two-machine topology: the Z840 above with a second RTX 3090 added as a parallel evaluation worker, plus a separate companion machine, a Lenovo S20 with DDR3 ECC memory, hosting the embedding and reranker services on the RTX 5060 Ti. The second 3090 was added solely to parallelize the 500-question run and roughly halve wall-clock benchmarking time during iterative development. The companion-machine split was an artifact of the test bench, not a requirement of the system.
The production deployment described in the table above runs the entire pipeline in a single Z840 chassis with one RTX 3090 and the RTX 5060 Ti. The score in the leaderboard above was achieved with functionally equivalent software and pipeline logic; the only difference is hardware topology.
This matters, and I'd rather flag it here than have someone question it later. Testing pipeline patches with a single machine took too long, so a second 3090 was sourced from eBay.
Software Stack
| Component | Version / Configuration |
|---|---|
| OS | Ubuntu 24.04 LTS |
| Kernel | 6.17.0-19 |
| CUDA | 13.0 (sm_86 + sm_120 builds) |
| Inference engine | llama.cpp (custom build, dual-architecture) |
| Generator runtime | llama-server, 250W power limit on 3090 |
| Embedding runtime | llama-server, systemd-managed |
| Reranker runtime | FastAPI service alongside embedder on 5060 Ti |
| Orchestration | Python 3.12, async pipeline |
| Storage backend | ChromaDB |
| Process management | systemd services for all model runtimes |
All software is open source.
Model Selection
| Role | Model | Quantization | Hardware |
|---|---|---|---|
| Generator | Qwen3.6-27B | Q4_K_XL (4-bit) | RTX 3090 (24 GB) |
| Embedding | Qwen3-Embedding-4B | Q8_0 (8-bit) | RTX 5060 Ti (16 GB) |
| Reranker | Qwen3-Reranker-4B | fp16 (16-bit) | RTX 5060 Ti (16 GB, alongside embedder) |
Why these models:
- Qwen3.6-27B (Q4_K_XL) was selected as the generator after benchmarking 9B, 14B, 27B, and a 35B MoE variant across multiple Qwen generations. The 27B at 4-bit quantization fits on a single 24 GB consumer GPU and represents the optimal trade-off between reasoning capacity and deployment envelope. The 35B MoE ran approximately 3.5× faster than the 27B but scored meaningfully lower on LongMemEval-S, and hallucinated considerably more on multi-fact synthesis tasks; the speed gain did not justify the quality regression. In real-world use cases outside of an intentionally punishing benchmark, the 35B is a model worth testing.
- Qwen3-Embedding-4B (Q8_0) provides high-quality semantic search. 8-bit quantization was chosen so embedder and reranker can both reside on the 16 GB 5060 Ti simultaneously, eliminating model swaps during inference.
- Qwen3-Reranker-4B (fp16) runs at full precision because reranker output drives the final retrieval set. The accuracy cost of quantizing the reranker exceeded the memory savings.
All three models are open-weights, downloadable from HuggingFace, and reproducible with the exact GGUF file hashes provided in the reproduction recipe.
Evaluation Methodology
LongMemEval-S consists of 500 questions across six categories: single-session-user recall, single-session-assistant recall, single-session-preference, knowledge-update, temporal-reasoning, and multi-session synthesis. Each question is paired with a multi-session conversation history and an expected answer.
Evaluation configuration:
- Benchmark: Standard LongMemEval-S, full 500 questions, no subset
- Evaluation harness: Standard LongMemEval-S harness (github.com/xiaowu0162/LongMemEval)
- Judge model: GPT-4o, temperature 0
- Answer generation: Single final-answer generation per question (no best-of-N, no self-consistency voting, no multiple sampling). Indexing and retrieval are multi-call pipeline stages that prepare the evidence set the final-answer call reasons over.
- Run mode: Single run end-to-end
- Fine-tuning: Zero-shot (no training on LongMemEval data)
Every aspect of the run, the question set, the harness, the GPT-4o judge configuration, the temperature, the no-best-of-N rule, was executed exactly as the LongMemEval authors specified in the original benchmark. No protocol deviations, no judge swaps, no relaxed rules. This is the same methodology used by every system in the leaderboard. Direct comparison is apples-to-apples on the judge, the dataset, and the scoring rules.
Per-Category Results
| Category | Echo Locate (Qwen 27B) | Mastra OM (GPT-4o) | Supermemory (GPT-4o) | Zep (GPT-4o) | Full Context (GPT-4o) |
|---|---|---|---|---|---|
| single-session-user | 95.71% | 98.60% | 97.14% | 92.9% | 81.4% |
| single-session-assistant | 85.71% | 82.10% | 96.43% | 80.4% | 94.6% |
| single-session-preference | 60.00% | 73.30% | 70.00% | 56.7% | 20.0% |
| knowledge-update | 89.74% | 85.90% | 88.46% | 83.3% | 78.2% |
| temporal-reasoning | 92.48% | 85.70% | 76.69% | 62.4% | 45.1% |
| multi-session | 75.94% | 79.70% | 71.43% | 57.9% | 44.3% |
| Overall | 85.40% | 84.23% | 81.60% | 71.20% | 60.20% |
Echo Locate's strongest categories are temporal-reasoning (92.48%, +15.8 over Supermemory's GPT-4o), multi-session synthesis (75.94%), and knowledge-update (89.74%). These are the three categories that matter most for production memory deployments: tracking how facts evolve over time, combining information across distant conversations, and reasoning about when events happened relative to each other. Echo Locate trails on single-session-assistant and single-session-preference; both are areas where the comparison system uses prompt engineering specifically tuned for those failure modes, and where additional pipeline iteration would close the gap.
Why the temporal-reasoning result is hard to fake
The most natural critique of any benchmark result is "you tuned the prompts until the score went up." That critique applies cleanly to single-session categories where short questions, narrow context, and simple expected outputs make output-format optimization the dominant lever. It does not apply cleanly to temporal-reasoning.
Temporal-reasoning questions on LongMemEval-S require the system to reconstruct event ordering, compute time differences between facts mentioned in different sessions, distinguish historical from current state, and ground relative time references ("about a month ago") against absolute session dates. The judge does not accept format tricks. It accepts correct dates and correct relative-time conclusions or it does not.
Echo Locate scored 92.48% in this category. The closest published GPT-4o-judged comparison, Supermemory with GPT-4o, scored 76.69%. That is a 15.79-point absolute gap on the category where benchmark gaming is least available as an explanation. The result is consistent with Echo Locate's pipeline doing genuinely better fact extraction, ordering, and time-grounding than its competitors, not with prompt-level optimization against a known evaluator.
Per-category breakdowns for Mastra OM and Hindsight are not published by those systems in a directly-comparable form; overall scores from those systems are reflected in the scoreboard above. Supermemory per-category data sourced from supermemory.ai/research.
Development
The result above is the product of several weeks of iterative pipeline development on top of months of foundational work. The system progressed through multiple internal versions, each targeting specific failure modes observed in benchmark output: temporal reasoning errors, counting drift, knowledge-update confusion between historical and current state, unit-preservation edge cases, and abstention handling. We pushed the model until it broke, assessed why it broke, updated the pipeline accordingly, and then pushed it further until something else broke. Push, break, fix, repeat.
The engineering insight that emerged across iterations: the retrieval and orchestration layer carries far more weight than the generator's parameter count. Targeted improvements to indexing strategy, fact extraction prompting, per-question-type reranker instructions, candidate aggregation, and answer-formation produced cumulative gains that compounded into the final result. The generator itself improved marginally over the development window. The pipeline improved by more than 15 points on the same model.
Pipeline evolution timeline:
| Date | Pipeline version | Generator | Score | Notes |
|---|---|---|---|---|
| Early April | V8.3 baseline | Qwen 3.5-27B Q4_K_M | 81.0% | Baseline |
| Mid April | V8.x.1 | Qwen 3.5-27B Q4_K_M | 82.4% | Unit-preservation patch to pipeline |
| Late April | V8.3 (current) | Qwen3.6-27B Q4_K_XL | 85.40% | Pipeline rewrite, query expansion, 3.6 model upgrade |
The 3.0-point lift between V8.x.1 and the current V8.3 result reflects multiple bundled changes: a generator upgrade from Qwen 3.5 to Qwen 3.6, a switch from local to remote reranker hosting, and category-specific query expansion logic added for temporal-reasoning and counting-math questions. We are not separating the contribution of each change in this report; isolating individual factors would require additional controlled benchmark runs and is planned for future work. What this report demonstrates is that the system as engineered today, on the hardware as configured today, scores 85.40% on the standard LongMemEval-S harness with the standard GPT-4o judge.
A concrete example of the kind of engineering that moves the number: one of the later patches was a unit-preservation guard in the answer parser. On questions asking "how many hours…" or "how many days…", the model would sometimes return a bare numeric answer ("7") when the expected answer format required units ("7 days"). A small post-processor detects numeric-only answers to unitized questions, pulls the appropriate unit from the question text, and appends it before scoring. The patch is trivial. It doesn't involve the model. And it recovered a meaningful number of category-specific errors that had nothing to do with reasoning capability and everything to do with output format. There are dozens of patches like that in the pipeline. Collectively, these patches are why a 27B sits on the same leaderboard as GPT-5-mini.
These patches aren't benchmark tricks. They're the engineering work that lets a 27B model share a leaderboard with trillion-parameter cloud systems on a real memory task, and that translates directly into production agentic workloads where retrieval quality determines whether the system actually behaves like it remembers anything.
The specific implementation of each component constitutes Echo Locate's commercial differentiation and is retained as proprietary IP.
Published vs Proprietary
Published (this document and accompanying livestream):
- Full hardware stack with sources and pricing
- Full software stack with versions and configurations
- Model selections, quantization choices, and rationale
- High-level architectural overview
- Evaluation methodology with judge configuration
- Per-category result breakdown
Retained as proprietary:
- Prompt templates and orchestration logic
- Indexing and fact extraction implementation
- Aggregation and routing algorithms
- Pipeline routing logic
- All productization, deployment, monitoring, and customer-tuning code
This boundary is deliberate. Researchers can verify the result through the published methodology and evaluate the deployment envelope through the hardware and software documentation. Organizations evaluating Echo Locate for procurement can assess fit through the per-category breakdown and architectural overview. The orchestration logic that produces the result is the company's commercial differentiation and is available only through commercial licensing.
Reproduction
The GGUF model file hashes for verification:
ff6941ded525b34eb159496762c29dd0ec6e71dc31b74d57e75d871a03eec259
b60ae5ce2dd6a0b77f82cadf21def1f310a3e10cde380ad0081b07a9d416949d
cf2e87cbf71fa628961532232e04dd6c19702a0a057f5e2aff95ea1aca4fd488
model-00002-of-00002.safetensors:
78946d22b7f6456ea7a5358dbdf3982de36c5bac1f166a5fd58e18e31db8048a
Combined hash of shard hashes:
4cbb849fe2bdfa040a2ba6b5d040f775f1b2619fdc3734bbefd831dc1f494b67
Limitations and Honest Caveats
Single benchmark. This result is on LongMemEval-S only. Strong performance on one benchmark does not automatically transfer to every long-horizon memory workload. The next planned evaluation is LoCoMo (Maharana et al., 2024), a long-horizon conversational memory benchmark with different structure (longer multi-turn dialogues, different question taxonomy, different failure modes). LoCoMo results will be published when complete, regardless of outcome. The strongest defense against benchmark overfitting is generalization to a second evaluation.
Latency. Per-question inference time on the 2015 workstation runs approximately 10-12 minutes on average during LongMemEval-S evaluation. This number reflects the benchmark's worst-case design, which forces the pipeline to re-index dozens of conversation sessions (typically 45 to 55) from scratch per question. Real-world deployment does not look like this. Enterprise queries against an already-indexed conversation history skip the indexing stage entirely (which is the bulk of the per-question runtime) and reduce to a retrieval, reranking, and generation pass against pre-built indices. Concrete latency benchmarks for production-representative workloads will be published alongside the single-box reproduction run.
Model weight licensing. Qwen models are open-weights but subject to the Qwen license terms. Organizations with strict open-source requirements should review the license for their specific use case.
Not beating the frontier, not claiming to. Echo Locate does not beat GPT-5-mini. It does not beat Gemini-3-pro. It does not beat Gemini-3-flash or GPT-OSS-120B. The systems above it on the leaderboard are stronger on this benchmark. What Echo Locate does is outperform every GPT-4o-class system on the leaderboard, while running on consumer hardware that costs less than a refurbished MacBook Pro and never touching the network. The gap between Echo Locate and the trillion-parameter frontier is single digits, not the orders-of-magnitude difference the industry narrative implies and has been scaling toward. That gap is the entire point.
Why This Result Is Defensible
The result was achieved using the standard LongMemEval-S evaluation harness, the same GPT-4o judge, the same 500 questions, the same single-pass scoring rules used by Mastra, Hindsight, Supermemory, Zep, and every other system on the board.
The full hardware stack, software dependencies, model selections, and evaluation methodology are published here. The orchestration logic that produces the result, the prompt templates, the indexing algorithms, the per-category routing, remains proprietary as the company's commercial differentiation. The boundary is deliberate: the result is verifiable, the methodology is transparent, the productized system is what customers buy.
The Unexpected Conclusion
The most interesting result from building Echo Locate is not the score. It's that a 27-billion-parameter quantized open model, running on a workstation built during the Obama administration, can sit on the same leaderboard as systems calling GPT-5-mini and Gemini-3-pro, with nothing between its answers and the user but local disk and a pair of consumer GPUs.
That isn't because the 27B is secretly competitive with trillion-parameter systems. It isn't. It's because the AI Model was never supposed to be where the memory happens. The entire industry is built around this assumption, but when the retrieval pipeline does the work the pipeline should do, the weight class of the AI Model matters less than the entire industry assumed.
Commercial Availability
Echo Locate is available to design partners in regulated industries (legal, healthcare, defense, finance) and other organizations with strict data sovereignty requirements. The system ships as a deployable appliance on customer-controlled infrastructure with no cloud dependencies.
For design partner inquiries, technical questions, or commercial discussions: echolocate@fastmail.com
Acknowledgements
LongMemEval-S benchmark by Wu et al. (2024). Qwen models by Alibaba DAMO Academy. llama.cpp by Georgi Gerganov and contributors. Reference comparison data from Mastra OM, Hindsight, Supermemory, Zep, EmergenceMem, and the original LongMemEval paper.
Echo Locate is a self-hosted, fully air-gapped AI memory system built for organizations that can't send their data to the cloud. Full technical methodology, hardware stack, and evaluation results at echolocate.ai.