CORTEX: Neural Memory for AI Agents

00 Status, Up Front

I am putting this section first because I want to be honest about what you are reading. CORTEX is a research project I run by myself, on my own hardware, in my own time. It is not a product. There is no API you can call, no service you can sign up for, and no claim that I have solved memory for AI agents.

What I do have is a working system, a real benchmark with real numbers, and a clear-eyed list of what is and is not validated. The benchmark results are positive on one important axis. They do not mean CORTEX is ready to drop into a production agent. I list the limitations in section 05. Please read them before quoting the numbers elsewhere.

I am publishing this case study because the work is real, the methodology is honest, and the result is a useful data point in the wider conversation about how AI agents should remember things. If you are looking for a turnkey memory layer for your agent, this is not it. If you want to talk about the approach, I would love to.

01 The Problem I Was Trying To Solve

AI agents that run locally on Ollama forget everything between turns unless you paste the history back in. If you do paste the history back in, the prompt grows and the model slows down. If you do not, the agent has no idea what you talked about five minutes ago.

The standard fix is RAG, usually some flavour of K nearest neighbours over embeddings of past messages. It works, but it is blunt. It retrieves the top-k most similar past messages whether or not they are actually relevant to the current question. You end up injecting noise into the prompt and the agent gets confused.

I wanted something more careful. A memory layer that knows when retrieval would help and when it would just clutter the context window. A small controller, sitting between the agent and the memory store, that decides for each turn: retrieve or not, and if so, what to retrieve.

02 The Approach

CORTEX is built in four parts.

1. Sparse history compression. Past conversation history goes through a SparseHistoryCompressor that uses a fast GRU for recent turns and a slow GRU for older turns, then projects to a 1024-dim overcomplete representation with BatchTopK keeping the 48 most active slots. The result is a fixed-size, sparse vector that summarises a long conversation without losing the parts that matter.

2. Fast-weight controller. A StableFastWeightController reads the sparse state and emits eight decision signals: valence, arousal, verbosity, certainty, topic shift, whether to retrieve memory, and the context budget to allocate. The fast weights are computed as a weighted outer product of past states with a learned exponential decay. There is a history= parameter that supports differentiable training of the decay rate, which is a small research contribution in its own right.

3. Context assembly. A ContextAssembler takes the controller's decision signals, decides whether to query the memory store, picks the right memory budget, sets temperature and token limits, and writes the final system prompt. It is the bit that actually talks to the LLM.

4. End-to-end agent loop. A CORTEXAgent that wraps the whole thing: take a user message, compress, decide, assemble context, call Ollama, stream the response, store the new turn, repeat. It also includes an anti-loop system that injects a checkpoint message if the model starts calling the same tool ten times in a row.

The full pipeline, from a user message to a streamed response, is about 250 lines of well-commented Python in the src/cortex/agent/ directory. The data labelling and training code is another 800 lines, mostly the kimi-k2 integration and the ChatML formatters.

03 The Benchmark

I benchmarked three configurations on 30 real conversations, scored by kimi-k2 acting as a judge (response quality, 1 to 10). All three use the same MiniLM embeddings, the same Ollama model, the same conversations. The only difference is the memory layer.

No memory, the agent sees only the current turn. Baseline. 7.700 ± 2.704.
KNN-RAG, the agent retrieves the top-5 most similar past messages and injects them. 7.100 ± 1.921.
CORTEX, the agent uses the controller to decide whether to retrieve, and what to retrieve. 7.700 ± 1.735.

CORTEX ties no-memory on simple tasks where retrieval is not needed, and beats KNN-RAG by 8.5 percent on tasks where it would have injected irrelevant context. The variance is also lower (1.735 vs 1.921), which suggests the controller is making more consistent decisions than the blunt top-k retrieval.

The interpretation: KNN-RAG retrieves indiscriminately. CORTEX retrieves when the controller thinks it will help, and does not when it will not. The win is not in retrieving better memories. The win is in knowing when not to retrieve at all.

Seventy tests pass, one is xfailed (the cascade network's XOR test, see limitations). The training data is committed: conversations_real.jsonl (36 MB) and conversations_real_signals.jsonl (37 MB) are the labelled conversations, training_chat.jsonl (110 MB) is the larger unlabelled set, and the trained model checkpoint is in the repo at cortex_boot.pt (13.9 MB).

04 What I Learned

The single most useful result from the benchmark is that, for many agent tasks, retrieving more context hurts. The KNN-RAG baseline scored lower than no-memory at all, which means the noise from bad retrievals is worse than the signal you would have gotten from ignoring the past. That is a useful thing to know regardless of whether CORTEX itself is the right answer.

The second thing I learned is that labelling matters more than architecture. The first version of the controller used hand-derived heuristic labels (high verbosity, low certainty, and so on) and the trained controller was noisy. Switching to kimi-k2 as the labeller, asking it for structured decision signals per turn, and training on those gave a much cleaner controller. The lesson: when you do not have ground truth, get the best labelling model you can and treat the labels as soft targets.

The third thing is that sparse overcomplete representations work well for memory. The 48 active slots in a 1024-dim vector are not a magic number, but the pattern (compress, sparsify, then learn from the sparse state) is much more sample-efficient than feeding dense embeddings straight into a controller. I would start here for any new memory project.

05 What I Have Not Validated

This section is here because I would rather list the gaps myself than have someone else find them. If you are considering CORTEX for something serious, these are the things I cannot tell you yet.

I have not compared against MemGPT, streaming memory, or any other contemporary memory architecture. KNN-RAG and no-memory are the only baselines. CORTEX beats KNN-RAG, which is interesting, but it might still lose to a system I have not tested.
The cascade network's XOR test is xfailed. The growable cascade is for regression over decision targets and it works in that role, but it does not capture XOR-style parity. That is a benchmark limitation, not a workflow failure, but it is a real gap in the validation.
I have not measured end-to-end agent quality on long, real conversations. The benchmark uses 30 conversations of moderate length. I have not run CORTEX through a 200-turn test. Behaviour at long horizons is unvalidated.
I have not scaled the labelled dataset beyond the current size. Generating more conversations via kimi-k2 is slow (about 40 minutes for 30). Scaling requires either batching, a faster model, or human labelling. None of those are in place.
The system is not tested under adversarial inputs. A user deliberately trying to make the controller misfire is unvalidated. I would expect it to fail in interesting ways and I have not stress-tested it.

The honest summary: CORTEX is a working research prototype with one promising benchmark result and a clear set of open questions. It is not a production system. I do not sell it, license it, or recommend anyone deploy it without significant additional work.

06 What Is Next

The next round of work has three parts.

Long-context validation. Run CORTEX, KNN-RAG, and no-memory on 100+ turn conversations and measure how retrieval quality and response quality change as the conversation grows. This is the test that matters most for real agents.
Adversarial probing. Build a small set of test inputs designed to fool the controller (deliberately misleading memory entries, contradictory context, off-topic chaff) and see how often CORTEX retrieves the wrong thing.
Faster labelling. Replace the single-conversation kimi-k2 calls with batched generation so I can build a labelled set of 1,000+ conversations. The current 30 is a starting point, not a conclusion.

None of those are trivial. The long-context test in particular requires a real workload and a real agent to run against, not a synthetic benchmark. I am open to collaboration there.

07 Why I Built It

Because I run a lot of agents locally and I got tired of watching them forget things that mattered. The first version was a quick script. The benchmark came after, because I wanted to know if the controller was actually doing anything useful or if I was just feeling productive.

The honest answer from the benchmark is: yes, the controller is doing something, but only on the tasks where it is supposed to. On the tasks where retrieval is the wrong call, the controller correctly does not retrieve. That is the result I am most pleased with, because it is the result that says CORTEX knows the limits of its own knowledge.

If you are working on agent memory, sparse compression, fast-weight controllers, or evaluating RAG alternatives, I would like to talk. The code is not currently published as a public repo, but it is real and the numbers in section 03 are reproducible from the data files in the repo.