Context: The billion dollar problem

From a billion tokens to the few that count: how Querio tames context so your AI stops hallucinating

Pedro Oscar

Founding Engineer at Querio

Pedro Oscar

Founding Engineer at Querio

Aug 8, 2025

TL;DR;

Big context windows ≠ better answers. Querio tags each table/column, pulls only what a query needs with RAG, then prunes the rest—so the LLM sees just the vital tokens and stops hallucinating.

Context is everything; we need context for every single thing that we do in our lives. Our personality, opinions, and actions are dictated by the context that we have, and this can be very dangerous because we can form opinions based on lies and act based on those opinions. Also, in the wrong context, with the wrong people, we can acquire bad habits and a bad personality. It may look like I'm saying nonsense, but LLMs try to mimic our brains and how we think (LLMs can't think, btw, but this is a topic for another blog post), so they also have the same problems, and that's what we are going to discuss today.

Today, the blog is going to be very boring because my boss said my last blog sucks, it wasn't technical, so be prepared. The focus of this blog will be what we do at Querio, but it can be used as food for thought for building AI agents that are not related to data analysis. So let's get started.

The billion-dollar problem

I'm not joking: this is a billion-dollar problem. Entire companies are built around this, and this is not a solved problem yet. Every single robust AI agent implements its context in a different way because of a simple problem: context is limited.

The physical context limit

Every LLM has a fixed context length, i.e., how much data we can send to the LLM without it imploding. I remember when GPT-3 came out with its amazing 2,048-token limit, and considering each token is around 3 characters, GPT-3 was able to understand ~6,000 characters. This is not much, but it was a start. Every AI company understood that the architecture of an LLM (a.k.a. transformers) was good, but the context limit was a problem, so they started to try to solve it. I'm not going to go through the history; just know that Google, Microsoft, OpenAI, and Anthropic entered the race pretty quickly: the race to the biggest context length.

This race is not even close to the end; in 2023 we already had papers talking about a 1-billion-token context length. This is 1,000,000,000 tokens, or ~500,000× more than the original GPT-3, but it isn't good enough yet to be in a commercial model. Currently, the models usually cap at 1 million tokens, which is pretty good, but not even close to 1 billion. The top models from the biggest AI companies at the moment have a limit of 1M, including the models we use at Querio. This means that if we wanted to dump the whole schema of basically any database to the LLM we would be able to, so why do we spend so much time on this topic?

The practical context limit

Every LLM has ADHD, it's not something that we can change. An LLM’s attention span is very limited, this means that if we send too much data to the LLM, it's going to forget information and, worst-case scenario, it's going to hallucinate. That's why we limit the context to 1 million, the current architecture just can't have good results with large context windows. This is something that all the biggest AI companies (Meta, Google, Microsoft, DeepSeek, Anthropic and the list goes on) are aware of, and, usually, when a new AI model is released, the company releases how well the LLM performs at specific context lengths. This is usually tested with the company dumping a massive amount of data, such as the whole Harry Potter franchise (yes, this is a way to test this), and asking the LLM to retrieve and tell where some information is located. Currently, every high-end LLM can perform very well on this test, but retrieving data is different from using the data.

Basically, every person that tried to build an AI agent that requires too many tokens knows that the agent gets worse exponentially. It starts to hallucinate, use the wrong tools, and make up data. So even though the LLM can accept 1M tokens, one of the hardest parts of building an agent is to send the least amount of tokens possible. This can usually be done with semantic search, but there are better ways to do this, which we are going to discuss later.

Giving the LLM the right context

This is where we are going to become Querio-specific; you won't be able to do a carbon copy and apply everything in every agent. There's no “one-size-fits-all” solution. So let's get started.

The right context

What is the right context? Well, for us, we are going to focus on the database structure. We are very lucky because the data in a database is already structured (we don't support NoSQL databases), so we just need to think: what about the structure the LLM needs to know? Well, I can think of 3 main things: the tables, the columns of the tables, and the relationships between them. Topics 1 and 2 are very easy; the third one is very hard to get, but those are the information we need from the schema. But what if a column or a table doesn't have a very straightforward name and the agent doesn't know how to use it? That's where descriptions come in handy.

If for every table and column we save a small description, we are going to enhance the performance of the agent exponentially. But imagine this scenario (that's a real scenario, btw): you connect to the database, get the schema, and the database has more than 200 tables, and most of the tables contain more than 10 columns; that's a massive amount of data to send to the agent. So we need to use a technique called RAG. I'm not going to explain what RAG is because this blog is already massive, but I'll explain what we do.

Vector search and the problems with it

Imagine the user asked “user registrations in the last 30 days.” You just need the users table, not the other X tables, so we can use something called vector search, where we create embeddings of the question and all of the descriptions, and use cosine similarity to find the closest embeddings to the question. Again, I'm not going to explain it in detail, there are thousands of papers that explain what vector search/RAG is.

This is almost certainly (almost) going to return the users table with high confidence, like 0.8, but it’s going to bring a lot of other tables with high confidence too. With semantic search it’s very hard to know exactly what you want, we usually set a threshold, but what if the right table isn’t picked up in the threshold? Then we need a bigger one, and this will get even more data, and then the agent gets exponentially worse.

LLM pruning

We have a step where we use Gemini 2.0-flash-lite to analyze all the tables that came from the semantic search and prune it down to only the relevant tables. This adds latency (around 1.5 seconds) and can introduce errors because Gemini 2.0-flash-lite is significantly worse than Claude 4 (the main model we use for the agent). But even knowing this, adding this step significantly improves the agent because it enhances the semantic search, and the trade-offs are worth it.

Currently, the workflow is the following: We use semantic search to get the most relevant tables based on the query → We generate a plan of action, enumerating all the tables, how to join them, and what is necessary for the query → We send this plan to Gemini and Gemini figures out only the relevant tables based on the plan.

This is very effective, as we can shrink a database with 500 tables down to only 2 or 3, and this increases the accuracy of the agent exponentially.

Why it’s still not perfect

We rely on the planning model to figure out all the tables we are going to need, and we rely on Gemini to extract the right tables. This is proving to be effective, and we have not had problems with this approach yet, but either of these steps can fail, which is why we need something more deterministic, something that is going to bring all the right tables, the right relationships, and always produce the same output if you ask the same query. That’s Querio’s billion-dollar problem.

Enhancing the context

LLMs are evolving at an insane speed, and using them in a SOTA-way is getting harder and harder. Currently, LLMs can do basically anything text-related, but you need to know how to teach the LLM to do this, how to split tasks, etc., etc. With the current agent, we use a very simple context model for the database structure, exactly how I told you in the last topic, but this is not enough; we need more structure. Fortunately, this is a field that has been researched even before LLMs, especially for making query builders and BI tools, so we have a lot of research about this topic. But everyone seems to agree that there’s one way that is really good at representing a database: a graph.

Graphs are data structures made up of nodes (also called vertices) and edges that connect pairs of nodes. Each node can hold attributes, and each edge can carry its own attributes, indicating relationships between the nodes. When we think of a database as a graph, the tables become collections of nodes and the relationships (joins) become edges. This means that a database can be mapped into a graph by turning each table into a node and each foreign-key link into an edge, and this representation allows us to traverse, filter, and manipulate the graph using graph algorithms that are already well-researched.

After a lot of research, we at Querio ended up agreeing on a simple structure that represents the whole database, and this is the structure:

So, let’s go in depth into what this is and what everything means.

Ontology

An ontology is a formal, explicit specification of a shared conceptualization of a domain. It defines the key entities, concepts, and relationships among them, using a precise vocabulary so that humans and machines can understand, interpret, and reason about the domain consistently. Ontologies serve as the backbone for knowledge representation, enabling interoperability, data integration, and intelligent inference across diverse systems.

The ontology is where all the context will live. It’s where we define everything about the data source, like the tables, columns, and relationships between the tables. Our ontology is defined by:

Let’s go down the rabbit hole.

Tables

This is the table (or a temporary view) in the database. It’s the main entity in our ontology, and everything relates to it. You can’t have anything that is not related to a table. Let T={t1,t2,…} be the set of tables, each table t can be defined as: t=(name,description,columns), where:

name: the exact table name in the database
description: one sentence that says what the table stores
columns: a set of its columns
Each column is a tuple c=(name,description), where:
- name: column name inside its table
- description: a short explanation of what the column means

Our ontology contains more than just those attributes, but those are the most important ones to understand how it works.

Joins

Joins are simpler, they can be represented as j=(src, tgt, kind, on), where:

source: the “left” (source) table
target: the “right” (target) table
kind: left/right/inner/full outer join
on: the join condition

Joins are what make our ontology a graph, they’re basically the edges of the graph. Here is how we can figure out how to join two tables programmatically, instead of relying on the agent.

Metrics

This is where we differ from a simple graph and start to do things in a more deterministic way. So, what is a metric? Citing my new best friend (gpt-oss):

A metric is a quantitative measure used to assess, compare, or track performance, characteristics, or progress of something. It provides a standardized way to capture data—such as numbers, percentages, or ratios—so that objectives can be objectively evaluated, trends observed, and decisions informed. In business, metrics might track sales growth, customer satisfaction, or website traffic, enabling teams to gauge success and identify areas needing improvement.

This means that we can define a metric, like gmv, and every time the user talks about gmv, we know how to compute it. We know the aggregation, we know the tables that are used to calculate GMV and all the joins between those tables. It’s where the magic happens, and we start to do more than we did before.

This adds a whole new deterministic layer to the agent, which is exactly what we want. We don’t want answers that are right 90% of the time, it should be 100%. Data should not lie.

Going back to the boring stuff, a metric can be represented as m=(name, description, formula, agg, factTable, groupBy, filtersReq, filtersOpt), where:

name: what we will call the metric. Should be unique.
description: short description of what the metric is.
formula: how to calculate this metric. Like product.price * order_product.amount
agg: sum, avg, etc.
factTable: This is the table that you select the metric from. Every query has a single FROM, and this is the fact table. It can contain joins, but it’s still a single fact table.
groupBy: This is where we define all of the possible groupBy’s. For example, you can group a metric by year or month, but it won’t necessarily do that.
filtersReq: Here we define the required filters for this metrics. For example, it makes no sense to calculate revenue with orders that were refunded, so we need to keep those out. That’s why we define this here.
filtersOpt: Here are the optional filters. We can filter by a specific store, for example.

Using the ontology

Here is where all the extra work pays off. If we have a well-defined ontology, we can:

Pruning through the Ontology

If we know that the user wants a specific metric or column, we can skip the whole pruning and only get the tables that are related to that metric. This makes pruning deterministic, and we don’t need to do a similarity search. If it’s not a defined metric, on the other hand, we will still need to do semantic search, but the ontology helps even with the semantic search.

Let’s imagine we did semantic search and got two tables that are not related, like orders and products. Through the ontology we can find the relationship between these tables and retrieve order_products, and return a new subgraph with all of the related tables. This is not something you can do easily outside of this structure.

Query Builder

We can also build a very deterministic query builder. If we know the metric, the dimension, and the filters, we can consult the ontology to get the fact table, all of the joins, and build an SQL query. The best part? We can port this to any flavor that we want. We can define a simple API, like select(["gmv"]).by("month"), and let the ontology figure out how to do the joins and everything. This is extremely powerful and can only be possible with a robust ontology layer.

Conclusion

In this blog, we saw the problems with LLMs, which is the context, and how to optimize the context in a structure that allows us to prune easily and give only relevant information to the agent. This was the result of years building agents at Querio and extensive research on this topic, and I hope you learned something after reading this blog!

Blog

Context: The billion dollar problem

Blog

Context: The billion dollar problem

The AI BI platform that lets you query, report and explore data at any technical level.

Product

Features