DOC_ID: ECHO_LOCATE_WP_V2 STATUS: PUBLISHED CLASSIFICATION: PUBLIC BENCHMARK: LongMemEval-S JUDGE: GPT-4o · T=0

Echo Locate

A 27B open-weights model on a 2015 workstation, in the same leaderboard band as the trillion-plus-parameter frontier models hosted in a data center.

Full results, full hardware, full methodology. No cloud. No network. Under $2,000. Scrap eBay hardware.

HEADLINE_RESULT
SCORE85.40%
RANK#06
HARDWARE~$1,990
MODELQWEN_3.6_27B

The Scoreboard

Rank System Backend Score Deployment
01Mastra OMGPT-5-mini94.87%Cloud
02Mastra OMGemini-3-pro-preview93.27%Cloud
03HindsightGemini-3-pro-preview91.40%Cloud
04Mastra OMGemini-3-flash-preview89.20%Cloud
05HindsightGPT-OSS-120B89.00%Cloud
06Echo LocateQwen3.6-27B Q4_K_XL85.40%Local, air-gapped
07SupermemoryGemini-3-pro-preview85.20%Cloud
08SupermemoryGPT-584.60%Cloud
09Mastra OMGPT-4o84.23%Cloud
10HindsightGPT-OSS-20B83.60%Cloud
11EmergenceMem SimpleGPT-4o82.40%Cloud
11Echo LocateQwen3.5-27B Q4_K_M82.40%Local, air-gapped
13SupermemoryGPT-4o81.60%Cloud
14Mastra RAGGPT-4o80.05%Cloud
15Zep GraphitiGPT-4o71.20%Cloud
16Full ContextGPT-4o60.20%Cloud

LongMemEval-S (Wu et al., 2024). GPT-4o judge, temperature 0. Full 500 questions. Standard harness (github.com/xiaowu0162/LongMemEval). No best-of-N, no self-consistency voting, no judge swaps, no recall@N substitutions.

Look at that table for a moment before reading further.

Every system above Echo Locate calls a model with hundreds of billions to multiple trillions of parameters, runs in a cloud data center, and depends on a live internet connection to function. GPT-5-mini, Gemini-3-pro, Gemini-3-flash, GPT-OSS-120B; these are frontier-class systems backed by multi-billion-dollar infrastructure.

Every system below Echo Locate uses GPT-4o or weaker as its backbone. That includes Mastra OM (84.23%), the previous best GPT-4o-class result on this benchmark.

Echo Locate sits between those two tiers with a 27-billion-parameter open-weights model, quantized to 4 bits, running on a 2015 HP Z840 workstation with retired mining GPUs. Total hardware cost: under $2,000. No API calls. No network at inference. The entire deployed system runs in an Obama era workstation we bought off eBay.

It outperforms every GPT-4o-class system on the leaderboard, and sits within single digits of systems running trillion-parameter models in cloud data centers.

That isn't a claim about beating the frontier outright. It's a claim about the assumption that AI Memory is a model problem that can only be solved by scaling. We approached it as an Engineering problem, and closed the gap on a benchmark explicitly designed to test long-horizon memory.

The Argument

For the past two years, the AI industry has operated on a quiet assumption: solving long-horizon memory, the kind that lets an assistant actually remember what you told it three months ago, track how facts about you have evolved, and synthesize answers across hundreds of past conversations, requires frontier-scale models. Trillion-parameter systems. Massive context windows. Cloud data centers full of H100s. The bigger the model, the better the memory.

The leaderboard above suggests that assumption is more fragile than it looks.

A quantized 27B open-weights model has no business being within a handful of points of GPT-5-mini on a memory benchmark. Except it is. And the reason it is has nothing to do with the model getting better, and everything to do with the pipeline Echo Locate engineered around it doing the work the model was previously being asked to do alone.

Memory is an indexing problem, not a parameter problem; we proved it.

When you stop treating the LLM as the place where memory happens, and start treating it as one carefully-bounded component in an engineered retrieval pipeline (embedding, reranking, fact extraction, session-level aggregation, type-specific routing), the parameter-count race fades into the background. The model stops being asked to remember things. It gets asked to reason over a curated evidence set that has been assembled by purpose-built components running beside it.

Parameter count is where Intelligence lies, not memory.

The 27B model in Echo Locate is not smarter than GPT-5-mini. It isn't even in the same conversation. What it has is a pipeline that hands it the right facts to reason over, every time, in a format it can handle without hallucinating. The trillion-parameter systems on the leaderboard above are doing long-horizon memory the hard way: by stuffing huge contexts into huge models and hoping the attention mechanism surfaces what matters. That works, and it works well, but it can't be infinitely scaled to solve the memory problem. It's also expensive, cloud-dependent, and, as the leaderboard shows, not actually that much better than a well-engineered quantized 27B pipeline for this specific problem.

This is not a paradigm shift for every AI use case. Frontier models still matter for raw intelligence, novel synthesis, code generation, and tasks where the model's internal world knowledge is the bottleneck. But for the specific problem of long-horizon interactive memory, recalling, updating, and synthesizing facts across long conversation histories, pipeline design is more important than parameter count.

Why This Matters Beyond the Benchmark

For organizations that can use cloud AI freely, the existing commercial memory layers will continue to work fine. The cloud APIs are easy to integrate, the latency is better, and convenience wins most procurement decisions.

For everyone else, the legal firms who can't send privileged client data to OpenAI, the hospital networks bound by HIPAA, the defense contractors operating on classified networks, the financial institutions facing strict data residency requirements, the government agencies whose data simply cannot leave the building, the calculus has shifted.

Your data never leaves the building. You can pull the Cat-5 out of the machine and the system keeps working.

There is a second procurement path converging on the same architecture: teams whose workloads routinely exceed the one-million-token context limits of frontier cloud models. For them the bottleneck isn't sovereignty, it's that no model's context window is long enough, and brute-forcing memory by scaling context further has hard limits. They end up needing the same thing the regulated buyer needs: a memory layer that runs alongside the model rather than inside it.

The previous option space was:

  1. Use a cloud-backed memory system and accept the compliance risk, or
  2. Use a local system and accept dramatically worse performance.

The leaderboard above narrows that gap from "dramatically worse" to "single digits behind the frontier, ahead of most GPT-4o-class deployments." For buyers in regulated industries, that changes the procurement question from "do we accept cloud risk in exchange for AI memory capabilities?" to "can a local deployment now meet our performance bar?" It is a different conversation than the market has been having for the past few years.

System Overview

Echo Locate is a fully self-hosted long-horizon memory system with three components running entirely on local hardware:

  1. Retrieval layer, Embedding model and Reranker
  2. Generation layer, Quantized open-weights LLM
  3. Orchestration layer, Python pipeline coordinating Indexing, Retrieval, Ranking, and Answer Synthesis

All inference happens locally via llama.cpp. No external network calls at any point during operation.

Hardware

Production deployment, the system customers receive:

ComponentSpecificationSourceApprox Cost
WorkstationHP Z840 (2015)eBay$500
CPU2× Intel Xeon E5-2650 v4 (24C/48T total); second CPU sourced separately from eBayeBay$15
RAM64 GB DDR4 ECCIncluded w/ Z840
GPU 0NVIDIA RTX 5060 Ti 16 GBMicro Center$525
GPU 1NVIDIA RTX 3090 FTW3 24 GB (retired mining GPU)Used market$900
Boot drive512 GB NVMe SSDUsed market$50
Storage2× 512 GB SATA SSDIncluded w/ Z840
Bulk storage6 TB HDDIncluded w/ Z840
PSU1125 W (factory)Included w/ Z840
Total~$1,990

The RTX 3090 was released in 2020 and salvaged from a retired crypto mining rig. The RTX 5060 Ti is the only new GPU in the stack and could easily be replaced by any GPU with 12 GB or more of VRAM.

Benchmark configuration, the rig that produced the score:

The evaluation reported here was run on a two-machine topology: the Z840 above with a second RTX 3090 added as a parallel evaluation worker, plus a separate companion machine, a Lenovo S20 with DDR3 ECC memory, hosting the embedding and reranker services on the RTX 5060 Ti. The second 3090 was added solely to parallelize the 500-question run and roughly halve wall-clock benchmarking time during iterative development. The companion-machine split was an artifact of the test bench, not a requirement of the system.

The production deployment described in the table above runs the entire pipeline in a single Z840 chassis with one RTX 3090 and the RTX 5060 Ti. The score in the leaderboard above was achieved with functionally equivalent software and pipeline logic; the only difference is hardware topology.

This matters, and I'd rather flag it here than have someone question it later. Testing pipeline patches with a single machine took too long, so a second 3090 was sourced from eBay.

Software Stack

ComponentVersion / Configuration
OSUbuntu 24.04 LTS
Kernel6.17.0-19
CUDA13.0 (sm_86 + sm_120 builds)
Inference enginellama.cpp (custom build, dual-architecture)
Generator runtimellama-server, 250W power limit on 3090
Embedding runtimellama-server, systemd-managed
Reranker runtimeFastAPI service alongside embedder on 5060 Ti
OrchestrationPython 3.12, async pipeline
Storage backendChromaDB
Process managementsystemd services for all model runtimes

All software is open source.

Model Selection

RoleModelQuantizationHardware
GeneratorQwen3.6-27BQ4_K_XL (4-bit)RTX 3090 (24 GB)
EmbeddingQwen3-Embedding-4BQ8_0 (8-bit)RTX 5060 Ti (16 GB)
RerankerQwen3-Reranker-4Bfp16 (16-bit)RTX 5060 Ti (16 GB, alongside embedder)

Why these models:

All three models are open-weights, downloadable from HuggingFace, and reproducible with the exact GGUF file hashes provided in the reproduction recipe.

Evaluation Methodology

LongMemEval-S consists of 500 questions across six categories: single-session-user recall, single-session-assistant recall, single-session-preference, knowledge-update, temporal-reasoning, and multi-session synthesis. Each question is paired with a multi-session conversation history and an expected answer.

Evaluation configuration:

Every aspect of the run, the question set, the harness, the GPT-4o judge configuration, the temperature, the no-best-of-N rule, was executed exactly as the LongMemEval authors specified in the original benchmark. No protocol deviations, no judge swaps, no relaxed rules. This is the same methodology used by every system in the leaderboard. Direct comparison is apples-to-apples on the judge, the dataset, and the scoring rules.

Per-Category Results

Category Echo Locate (Qwen 27B) Mastra OM (GPT-4o) Supermemory (GPT-4o) Zep (GPT-4o) Full Context (GPT-4o)
single-session-user95.71%98.60%97.14%92.9%81.4%
single-session-assistant85.71%82.10%96.43%80.4%94.6%
single-session-preference60.00%73.30%70.00%56.7%20.0%
knowledge-update89.74%85.90%88.46%83.3%78.2%
temporal-reasoning92.48%85.70%76.69%62.4%45.1%
multi-session75.94%79.70%71.43%57.9%44.3%
Overall85.40%84.23%81.60%71.20%60.20%
FIG_01 · PER_CATEGORY_BREAKDOWN 8_SYSTEMS · 6_CATEGORIES
ECHO_LOCATE · QWEN_3.6_27B COMPETITORS
TEMPORAL_REASONINGCAT_01
01MASTRA_OM · GPT-5-MINI95.50%
02ECHO_LOCATE · QWEN_3.6_27B92.48%
03MASTRA_OM · GPT-4o85.70%
04SUPERMEMORY · GEMINI-3-PRO81.95%
05SUPERMEMORY · GPT-581.20%
06SUPERMEMORY · GPT-4o76.69%
07ZEP · GPT-4o62.40%
08FULL_CONTEXT45.10%
KNOWLEDGE_UPDATECAT_02
01MASTRA_OM · GPT-5-MINI96.20%
02ECHO_LOCATE · QWEN_3.6_27B89.74%
02SUPERMEMORY · GEMINI-3-PRO89.74%
04SUPERMEMORY · GPT-4o88.46%
05SUPERMEMORY · GPT-587.18%
06MASTRA_OM · GPT-4o85.90%
07ZEP · GPT-4o83.30%
08FULL_CONTEXT78.20%
MULTI_SESSIONCAT_03
01MASTRA_OM · GPT-5-MINI87.20%
02MASTRA_OM · GPT-4o79.70%
03SUPERMEMORY · GEMINI-3-PRO76.69%
04ECHO_LOCATE · QWEN_3.6_27B75.94%
05SUPERMEMORY · GPT-575.19%
06SUPERMEMORY · GPT-4o71.43%
07ZEP · GPT-4o57.90%
08FULL_CONTEXT44.30%
SINGLE_SESSION_USERCAT_04
01MASTRA_OM · GPT-4o98.60%
02SUPERMEMORY · GEMINI-3-PRO98.57%
03SUPERMEMORY · GPT-597.14%
03SUPERMEMORY · GPT-4o97.14%
05ECHO_LOCATE · QWEN_3.6_27B95.71%
06MASTRA_OM · GPT-5-MINI95.70%
07ZEP · GPT-4o92.90%
08FULL_CONTEXT81.40%
SINGLE_SESSION_ASSISTANTCAT_05
01SUPERMEMORY · GPT-5100.00%
02SUPERMEMORY · GEMINI-3-PRO98.21%
03SUPERMEMORY · GPT-4o96.43%
04MASTRA_OM · GPT-5-MINI94.60%
04FULL_CONTEXT94.60%
06ECHO_LOCATE · QWEN_3.6_27B85.71%
07MASTRA_OM · GPT-4o82.10%
08ZEP · GPT-4o80.40%
SINGLE_SESSION_PREFERENCECAT_06
01MASTRA_OM · GPT-5-MINI100.00%
02SUPERMEMORY · GPT-576.67%
03MASTRA_OM · GPT-4o73.30%
04SUPERMEMORY · GEMINI-3-PRO70.00%
04SUPERMEMORY · GPT-4o70.00%
06ECHO_LOCATE · QWEN_3.6_27B60.00%
07ZEP · GPT-4o56.70%
08FULL_CONTEXT20.00%

Echo Locate's strongest categories are temporal-reasoning (92.48%, +15.8 over Supermemory's GPT-4o), multi-session synthesis (75.94%), and knowledge-update (89.74%). These are the three categories that matter most for production memory deployments: tracking how facts evolve over time, combining information across distant conversations, and reasoning about when events happened relative to each other. Echo Locate trails on single-session-assistant and single-session-preference; both are areas where the comparison system uses prompt engineering specifically tuned for those failure modes, and where additional pipeline iteration would close the gap.

Why the temporal-reasoning result is hard to fake

The most natural critique of any benchmark result is "you tuned the prompts until the score went up." That critique applies cleanly to single-session categories where short questions, narrow context, and simple expected outputs make output-format optimization the dominant lever. It does not apply cleanly to temporal-reasoning.

Temporal-reasoning questions on LongMemEval-S require the system to reconstruct event ordering, compute time differences between facts mentioned in different sessions, distinguish historical from current state, and ground relative time references ("about a month ago") against absolute session dates. The judge does not accept format tricks. It accepts correct dates and correct relative-time conclusions or it does not.

Echo Locate scored 92.48% in this category. The closest published GPT-4o-judged comparison, Supermemory with GPT-4o, scored 76.69%. That is a 15.79-point absolute gap on the category where benchmark gaming is least available as an explanation. The result is consistent with Echo Locate's pipeline doing genuinely better fact extraction, ordering, and time-grounding than its competitors, not with prompt-level optimization against a known evaluator.

The +15.79 temporal gap is the single strongest piece of evidence that the overall 85.40% is real engineering, not a benchmark artifact.

Per-category breakdowns for Mastra OM and Hindsight are not published by those systems in a directly-comparable form; overall scores from those systems are reflected in the scoreboard above. Supermemory per-category data sourced from supermemory.ai/research.

Development

The result above is the product of several weeks of iterative pipeline development on top of months of foundational work. The system progressed through multiple internal versions, each targeting specific failure modes observed in benchmark output: temporal reasoning errors, counting drift, knowledge-update confusion between historical and current state, unit-preservation edge cases, and abstention handling. We pushed the model until it broke, assessed why it broke, updated the pipeline accordingly, and then pushed it further until something else broke. Push, break, fix, repeat.

The engineering insight that emerged across iterations: the retrieval and orchestration layer carries far more weight than the generator's parameter count. Targeted improvements to indexing strategy, fact extraction prompting, per-question-type reranker instructions, candidate aggregation, and answer-formation produced cumulative gains that compounded into the final result. The generator itself improved marginally over the development window. The pipeline improved by more than 15 points on the same model.

Pipeline evolution timeline:

DatePipeline versionGeneratorScoreNotes
Early AprilV8.3 baselineQwen 3.5-27B Q4_K_M81.0%Baseline
Mid AprilV8.x.1Qwen 3.5-27B Q4_K_M82.4%Unit-preservation patch to pipeline
Late AprilV8.3 (current)Qwen3.6-27B Q4_K_XL85.40%Pipeline rewrite, query expansion, 3.6 model upgrade

The 3.0-point lift between V8.x.1 and the current V8.3 result reflects multiple bundled changes: a generator upgrade from Qwen 3.5 to Qwen 3.6, a switch from local to remote reranker hosting, and category-specific query expansion logic added for temporal-reasoning and counting-math questions. We are not separating the contribution of each change in this report; isolating individual factors would require additional controlled benchmark runs and is planned for future work. What this report demonstrates is that the system as engineered today, on the hardware as configured today, scores 85.40% on the standard LongMemEval-S harness with the standard GPT-4o judge.

A concrete example of the kind of engineering that moves the number: one of the later patches was a unit-preservation guard in the answer parser. On questions asking "how many hours…" or "how many days…", the model would sometimes return a bare numeric answer ("7") when the expected answer format required units ("7 days"). A small post-processor detects numeric-only answers to unitized questions, pulls the appropriate unit from the question text, and appends it before scoring. The patch is trivial. It doesn't involve the model. And it recovered a meaningful number of category-specific errors that had nothing to do with reasoning capability and everything to do with output format. There are dozens of patches like that in the pipeline. Collectively, these patches are why a 27B sits on the same leaderboard as GPT-5-mini.

These patches aren't benchmark tricks. They're the engineering work that lets a 27B model share a leaderboard with trillion-parameter cloud systems on a real memory task, and that translates directly into production agentic workloads where retrieval quality determines whether the system actually behaves like it remembers anything.

The specific implementation of each component constitutes Echo Locate's commercial differentiation and is retained as proprietary IP.

Published vs Proprietary

Published (this document and accompanying livestream):

Retained as proprietary:

This boundary is deliberate. Researchers can verify the result through the published methodology and evaluate the deployment envelope through the hardware and software documentation. Organizations evaluating Echo Locate for procurement can assess fit through the per-category breakdown and architectural overview. The orchestration logic that produces the result is the company's commercial differentiation and is available only through commercial licensing.

Reproduction

The GGUF model file hashes for verification:

QWEN3.6-27B Q4_K_XL (generator)
ff6941ded525b34eb159496762c29dd0ec6e71dc31b74d57e75d871a03eec259

QWEN3-EMBEDDING-4B Q8_0 (embedder)
b60ae5ce2dd6a0b77f82cadf21def1f310a3e10cde380ad0081b07a9d416949d

QWEN3-RERANKER-4B fp16 (reranker, sharded)
model-00001-of-00002.safetensors:
cf2e87cbf71fa628961532232e04dd6c19702a0a057f5e2aff95ea1aca4fd488
model-00002-of-00002.safetensors:
78946d22b7f6456ea7a5358dbdf3982de36c5bac1f166a5fd58e18e31db8048a
Combined hash of shard hashes:
4cbb849fe2bdfa040a2ba6b5d040f775f1b2619fdc3734bbefd831dc1f494b67

Limitations and Honest Caveats

Single benchmark. This result is on LongMemEval-S only. Strong performance on one benchmark does not automatically transfer to every long-horizon memory workload. The next planned evaluation is LoCoMo (Maharana et al., 2024), a long-horizon conversational memory benchmark with different structure (longer multi-turn dialogues, different question taxonomy, different failure modes). LoCoMo results will be published when complete, regardless of outcome. The strongest defense against benchmark overfitting is generalization to a second evaluation.

Latency. Per-question inference time on the 2015 workstation runs approximately 10-12 minutes on average during LongMemEval-S evaluation. This number reflects the benchmark's worst-case design, which forces the pipeline to re-index dozens of conversation sessions (typically 45 to 55) from scratch per question. Real-world deployment does not look like this. Enterprise queries against an already-indexed conversation history skip the indexing stage entirely (which is the bulk of the per-question runtime) and reduce to a retrieval, reranking, and generation pass against pre-built indices. Concrete latency benchmarks for production-representative workloads will be published alongside the single-box reproduction run.

Model weight licensing. Qwen models are open-weights but subject to the Qwen license terms. Organizations with strict open-source requirements should review the license for their specific use case.

Not beating the frontier, not claiming to. Echo Locate does not beat GPT-5-mini. It does not beat Gemini-3-pro. It does not beat Gemini-3-flash or GPT-OSS-120B. The systems above it on the leaderboard are stronger on this benchmark. What Echo Locate does is outperform every GPT-4o-class system on the leaderboard, while running on consumer hardware that costs less than a refurbished MacBook Pro and never touching the network. The gap between Echo Locate and the trillion-parameter frontier is single digits, not the orders-of-magnitude difference the industry narrative implies and has been scaling toward. That gap is the entire point.

Why This Result Is Defensible

The result was achieved using the standard LongMemEval-S evaluation harness, the same GPT-4o judge, the same 500 questions, the same single-pass scoring rules used by Mastra, Hindsight, Supermemory, Zep, and every other system on the board.

The full hardware stack, software dependencies, model selections, and evaluation methodology are published here. The orchestration logic that produces the result, the prompt templates, the indexing algorithms, the per-category routing, remains proprietary as the company's commercial differentiation. The boundary is deliberate: the result is verifiable, the methodology is transparent, the productized system is what customers buy.

We tried to break it and pushed it this far. We are asking that you try to break it as well.

The Unexpected Conclusion

The most interesting result from building Echo Locate is not the score. It's that a 27-billion-parameter quantized open model, running on a workstation built during the Obama administration, can sit on the same leaderboard as systems calling GPT-5-mini and Gemini-3-pro, with nothing between its answers and the user but local disk and a pair of consumer GPUs.

That isn't because the 27B is secretly competitive with trillion-parameter systems. It isn't. It's because the AI Model was never supposed to be where the memory happens. The entire industry is built around this assumption, but when the retrieval pipeline does the work the pipeline should do, the weight class of the AI Model matters less than the entire industry assumed.

These assumptions are wrong. Memory is an engineering problem, and we have the numbers to prove it.

Commercial Availability

Echo Locate is available to design partners in regulated industries (legal, healthcare, defense, finance) and other organizations with strict data sovereignty requirements. The system ships as a deployable appliance on customer-controlled infrastructure with no cloud dependencies.

For design partner inquiries, technical questions, or commercial discussions: echolocate@fastmail.com

Acknowledgements

LongMemEval-S benchmark by Wu et al. (2024). Qwen models by Alibaba DAMO Academy. llama.cpp by Georgi Gerganov and contributors. Reference comparison data from Mastra OM, Hindsight, Supermemory, Zep, EmergenceMem, and the original LongMemEval paper.

Echo Locate is a self-hosted, fully air-gapped AI memory system built for organizations that can't send their data to the cloud. Full technical methodology, hardware stack, and evaluation results at echolocate.ai.

PUBLISHED: April 25, 2026 · REVISED: May 1, 2026
ECHO_LOCATE · ZERO_CLOUD_DEPENDENCY · AIR_GAPPED · RETURN_TO_MAIN