From a billion tokens to the few that count: how Querio tames context so your AI stops hallucinating
Aug 8, 2025
TL;DR;
Big context windows ≠ better answers. Querio tags each table/column, pulls only what a query needs with RAG, then prunes the rest—so the LLM sees just the vital tokens and stops hallucinating.
Context is everything, we need context for every single thing that we do in our lives, our personality, opinions and actions are dictated by the context that we have, and this can be very dangerous because we can form opinions based on lies, and act based on those opinion. Also, in the wrong context, with the wrong people, we can acquire bad habits and personality. It may look like I'm saying nonsense, but LLM's try to mimic our brain and how we think (LLM's can't think btw, but this is a topic for another blog post), so they also have the same problems, and that's what we are going to discuss today.
Today the blog is going to be very boring cause my boss said my last blog sucks, it wasn't technical, so be prepared. The focus of this blog will be what we do at Querio, but it can be used as food for thought for building AI agents that are not related to data analysis. So let's get started
The billion dollar problem
I'm not joking, this is a billion dollar problem. Entire companies are build around this, and this is not a solved problem yet. Every single robust AI Agent implement it's context in a different way because of a simple problem: context is limited.
The physical context limit
Every LLM has a fixed context length, i.e., how much data we can send to the LLM without it imploding. I remember when gpt-3 came out with it's amazing 2048 token limit, and considering each token is around 3 characters, gpt-3 was able to understand ~6000 characters. This is not much, but it was a start. Every AI company understood that the architecture of an LLM (aka transformers) was good, but the context limit was a problem, so they started to try to solve it. I'm not going to go through the history, just know that Google, Microsoft, OpenAI and Anthropic entered the race pretty quickly, the race to the biggest context length.
This race is not even close to the end, in 2023 we already had papers talking about 1 billion context length. This is 1.000.000.000 tokens, or ~500.000x more than the original gpt-3, but it's not good enough yet to be in a commercial model. Currently, the models usually cap at 1 million tokens, which is pretty good, but not even close to 1 billion. The top models from the biggest AI companies at the moment have a limit of 1M, including the models we use at Querio, this means that if we wanted to dump the whole schema of basically any databaase to the LLM we would be able to, so why do we spend so much time on this topic?
The practical context limit
Every LLM has ADHD, it's not something that we can change. The LLM attention span is very limited, this means that if we send too many data to the LLM it's going to forget information and, worst case scenario, it's going to hallucinate. That's why we limit the context in 1 million, the current architecture just can't have good results with large context windows. This is something that all the biggest AI companies (Meta, Google, Microsoft, DeepSeek, Anthropic, the list goes on) are aware of and, usually, when a new AI model releases, the company releases how well the LLM performs in specific context length. This is usually tested with the company dumping a massive amount of data, like the whole Harry Potter franchise (yes, this is a way to test this), and asking the LLM to retrieve and tell where some information is located. Currently, every high-end LLM can perform very well on this test, but retrieving data is different than using the data.
Basically every person that tried to do an AI agent that requires too many tokens knows that the agent get's worse exponentially. It starts to hallucinate, use the wrong tools, and make up data. So even though the LLM can accept 1M tokens, one of the hardest part of building an agent is to send the least amount of tokens possible. This usually can be done with semantic search, but there are better ways to do this, which we are going to discuss later
Giving the LLM the right context
This is where we are going to become Querio-specific, you won't be able to do a carbon copy and apply everything in every agent. There's no “one size fits all” solution. So let's get started
The right context
What is the right context? Well, for us, we are going to focus on the database structure. We are very lucky because the data in a database is already structured (we don't support no-sql databases), so we just need to think: what about the structure the LLM needs to know? Well, I can think of 3 main things: The tables, the columns of the tables, and the relation between them. Topic 1 and 2 is very easy, the third one is very hard to get, but those are the informations we need from the schema. But what if the column or table don't have a very straightforward name and the agent don't know how to use it? That's where descriptions come in handy.
If for every table and column we save a small description, we are going to enhance exponentially the performance of the agent. But imagine this scenario (that's a real scenario btw): you connect to the database, get the schema, and the database has more than 200 tables, and most of the tables contains more than 10 columns, that's a massive amount of data to send to the agent. So we need to do a technique called RAG. I'm not going to explain what is RAG because this blog is already massive, but I'll explain what we do.
Semantic search and the problems with it
Imagine the user asked “Users registration in the last 30 days”. You just need the users
table, not the other X tables, so we can use something called vector search, where we create embeddings of the question and all of the descriptions, and use cosine similarity to find the closest embeddings to the question. Again, i’m not going to explain it in details, there’s thousands of papers that explain what semantic search/RAG is.
This is almost certainly (almost) going to return the table users with a high confidence, like 0.8, but it’s going to bring a lot of other tables with high confidence too. With semantic search it’s very hard to know exactly what you want, we usually set a threshold, but what if the right table was not pick up in the threshold? Then we need a bigger one, this will get even more data, and then the agent gets exponentially worse.
Agent pruning
We have a step where we use Gemini 2.0-flash-lite to analyze all the tables that came from the the semantic search and prune down