How to Evaluate an AEO Agency: The Northquery Framework
The only publicly available scoring framework for evaluating AEO agencies. 7 criteria, weighted by actual impact on AI citation outcomes. Built by an NLP researcher, not a marketer.
How to Evaluate an AEO Agency: The Northquery Framework
By Adem Ajvazi, Northquery · Updated April 2026 · 4,100 words
Most AEO agency reviews have no methodology behind them. Someone picks a handful of agencies they know, writes 200 words each, and calls it a ranking. This page is different. Here is the full scoring framework I built, how it works, and why the weights are what they are.
The short version: The Northquery AEO Agency Scoring Framework scores agencies across 7 criteria totaling 100 points. The highest-weighted criterion is Retrieval Architecture Knowledge (25 points) because AEO is fundamentally a retrieval problem, and agencies that do not understand how LLMs retrieve content cannot fix it. The full framework, scoring rubric, and example evaluation are below.
Why Most AEO Agency Reviews Are Worthless
I have read dozens of “best AEO agency” lists. Almost none of them explain how they arrived at their conclusions. They list six or eight agencies, add a paragraph about each one’s “proven track record,” and move on. That is not a ranking. That is a sponsored post without the disclosure.
The problem goes deeper than transparency. Even reviews written in good faith tend to measure the wrong things. They reward brand recognition, website design, and content volume. None of those things have any direct relationship with whether an agency can get your content cited by ChatGPT or Perplexity.
AEO is a technical discipline. It sits at the intersection of information retrieval, natural language processing, and structured data engineering. Evaluating AEO agencies using marketing criteria is like evaluating surgeons by how good their waiting room looks.
I built this framework because I needed to think clearly about what actually predicts citation outcomes. It is grounded in how language models retrieve and generate content, not in who has the nicest case study deck.
Why I Built This Framework
My background is in NLP. I hold an MSc from the University of Copenhagen with a specialization in natural language processing, and I have published research in the ACL Anthology, the main venue for computational linguistics work. Before starting Northquery, I spent years doing technical SEO, content systems, and semantic architecture across multiple sites.
When AI Overviews launched, then expanded, then became the default experience for hundreds of millions of searches, I started paying close attention to which content was being cited and why. The patterns are not random. There are structural, technical, and linguistic properties that predict whether a piece of content gets pulled into a retrieval window or gets ignored entirely.
Most AEO agencies were built by SEO practitioners who rebranded their content services. A few were built by people who genuinely understand retrieval mechanics. The difference in output quality between those two groups is enormous. I needed a framework to tell them apart. This is it.
The 7 Criteria and Their Weights
The framework scores agencies out of 100 points across 7 criteria. The weights reflect the relative impact of each criterion on actual citation outcomes, based on my own observation of what separates content that gets cited from content that does not.
| Criterion | Weight | Max Points |
|---|---|---|
| Retrieval Architecture Knowledge | 25% | 25 |
| Technical Schema Implementation | 20% | 20 |
| Proven Citation Track Record | 20% | 20 |
| Measurement and Reporting Infrastructure | 15% | 15 |
| Content Hierarchy for LLM Ingestion | 10% | 10 |
| Vertical Specialization | 5% | 5 |
| Strategist Access and Contract Terms | 5% | 5 |
Criterion 1: Retrieval Architecture Knowledge — 25 points
What it measures. Whether the agency understands how large language models retrieve, chunk, embed, and rank documents before generating a response. This includes knowledge of dense retrieval, context window constraints, semantic chunking strategies, and why some content surfaces consistently while structurally similar content does not.
Why it has the highest weight. Because if an agency does not understand retrieval mechanics, every other effort is built on a broken foundation. Writing better content does not help if the content is structured in a way that falls outside the retrieval window. Adding schema markup does not help if the underlying semantic architecture makes the document a poor candidate for citation. Retrieval architecture is the substrate. Everything else sits on top of it.
Most SEO agencies have no background in information retrieval. They know keyword density, E-E-A-T signals, and link acquisition. Those skills are real and valuable in traditional SEO. They are insufficient for AEO. You are looking for someone who can explain dense passage retrieval, knows what a vector embedding is, and understands why answer completeness per token matters.
Green flags: Can explain dense vs sparse retrieval without looking it up. Talks about chunking strategy unprompted. References specific LLM retrieval behaviors (window size, positional bias). Has read or can cite published research on RAG (Retrieval Augmented Generation).
Red flags: Calls AEO “SEO for AI” without further elaboration. Describes their approach purely in content terms with no retrieval layer. Cannot explain why two equally high-quality articles might get different citation rates. Confuses AI Overviews optimization with AEO.
Interview questions:
- Walk me through how you think about semantic chunking when structuring a long-form article for AI retrieval.
- If a client’s content scores well on all traditional SEO metrics but is not getting cited, what is your diagnostic process?
- What is the relationship between context window constraints and the length of your target answer blocks?
Criterion 2: Technical Schema Implementation — 20 points
What it measures. Depth and accuracy of structured data deployment across the agency’s client sites. Article, FAQPage, HowTo, Product, Service, BreadcrumbList, SpeakableSpecification, DefinedTerm, and ItemList schemas deployed correctly, validated, and updated as schema standards evolve.
Why it matters. Schema markup is the most direct signal you can send to both search engines and LLMs about what your content contains and how it is structured. Well-formed FAQPage schema, for example, directly increases the probability that a question-and-answer block gets surfaced in an AI response. SpeakableSpecification signals which content is designed for voice and AI assistant extraction. DefinedTerm helps establish entity identity that LLMs can reliably reference.
A significant number of agencies claim to do structured data but are implementing the same three schema types they learned in 2019. The field has moved. Ask specifically about newer schema types and whether they have a workflow for keeping up with schema.org updates.
Green flags: Can name five or more schema types relevant to AEO without prompting. Has a validation workflow (Google Rich Results Test plus manual review). Keeps a changelog of schema deployments. Understands the difference between Article and NewsArticle schema for AI citation contexts.
Red flags: Only mentions FAQ schema when asked about structured data. No validation process. Treats schema as a one-time implementation, not an ongoing system. Cannot explain what SpeakableSpecification does or why it matters for AEO.
Interview questions:
- Which schema types do you prioritize for content that targets AI-generated responses, and why?
- How do you validate schema implementations at scale across a large site?
- Walk me through how you would structure the schema on a comparison page targeting a high-intent AEO query.
Criterion 3: Proven Citation Track Record — 20 points
What it measures. Documented, verifiable evidence that the agency has gotten client content cited by ChatGPT, Perplexity, Claude, Google AI Overviews, or other AI systems. Screenshots, citation velocity data, before-and-after query comparisons, and client-consented case studies all count. Vague claims do not.
Why it has the same weight as schema. Because strategy without evidence is a pitch. You are hiring an agency to move a metric, and that metric is citation rate. An agency that cannot show you prior citation outcomes is either too new to have a track record or has outcomes they would rather not show you. Either way, you are taking on all the risk.
Be careful about how you interpret case studies. Organic traffic growth, ranking improvements, and domain authority increases are SEO outcomes. They are not AEO outcomes. Ask specifically: does this content appear in AI-generated responses? When? For what queries? How frequently? Those are the questions that matter.
Green flags: Has at least two client-consented case studies showing before-and-after citation presence. Can demonstrate citation monitoring with historical data. Distinguishes between AI Overviews citations and conversational AI citations (different mechanics). Can show citation velocity over time, not just a single screenshot.
Red flags: Only has SEO case studies. Uses “we got our clients cited” without any documentation. Cannot name specific queries where client content is cited. Conflates traffic growth with citation success. Defends the lack of case studies by claiming the field is too new.
Interview questions:
- Can you show me two or three specific examples where a client’s content is currently being cited by a major AI system?
- How do you distinguish between a citation driven by your AEO work versus organic citation that would have happened anyway?
- What was the citation rate change for your strongest client case over the first six months of the engagement?
Criterion 4: Measurement and Reporting Infrastructure — 15 points
What it measures. Whether the agency tracks AI visibility as a primary metric, not a secondary one. This includes their toolstack for monitoring citation presence across AI platforms, their ability to detect when a client’s content is added to or removed from AI responses, and how they report this data in a way that connects to business outcomes.
Why it matters. You cannot optimize what you cannot measure. Most agencies still report exclusively on organic rankings, traffic, and backlinks. Those metrics tell you nothing about AEO performance. An agency that reports on AI visibility specifically, with tools like Semrush AI Toolkit, Profound, AirOps, or their own monitoring setup, is structurally different from one that mentions AI citations anecdotally in monthly calls.
The measurement gap is also a risk signal. Agencies without AI visibility measurement cannot prove causality. If they are not tracking which content gets cited and when, they cannot demonstrate that their interventions made any difference. You are paying for activity, not outcomes.
Green flags: Uses at least one dedicated AI visibility tracking tool. Has a reporting format that separates AI citation metrics from traditional SEO metrics. Monitors citation presence across multiple AI platforms, not just one. Can show a client what their citation share is for target query categories.
Red flags: Relies entirely on manual spot checks. Counts Google AI Overviews traffic in Search Console as an AEO metric without deeper citation analysis. Cannot quantify the client’s current AI visibility baseline before starting work. Treats reporting as a monthly PDF with traffic graphs.
Interview questions:
- What tools do you use to track citation presence across ChatGPT, Perplexity, and Google AI Overviews?
- How would you establish a citation baseline for my site before starting an AEO program?
- What does your monthly AEO report look like, and can you show me a redacted example?
Criterion 5: Content Hierarchy for LLM Ingestion — 10 points
What it measures. How the agency structures written content for optimal retrieval. Question-first formatting, direct answer blocks within the first 100 words, sentence-level source attribution, logical heading hierarchy, definition patterns that LLMs recognize as citable, and the avoidance of retrieval-hostile patterns like dense jargon openings or buried answers.
Why it matters. LLMs are not reading your content the way a human does. They are extracting passages from a retrieval index. Content that is structured to answer a specific question directly and completely, with clear attribution and minimal ambiguity, is structurally more likely to be cited than content that buries the answer in paragraph four or wraps every claim in qualifiers.
This criterion is weighted lower than retrieval architecture and schema because it is more widely understood. Many content agencies have internalized question-first formatting. Far fewer have internalized retrieval architecture. But content hierarchy still matters and a weak score here can drag down an otherwise strong agency profile.
Green flags: Has a documented content template for AEO-optimized articles. Structures every major answer block with a direct response in the first two sentences. Uses definition patterns (“X is the process of…”) for key terms. Avoids context-heavy intros that delay the answer.
Red flags: Treats “adding an FAQ section” as their entire AEO content strategy. Does not distinguish between content written for human readers and content structured for AI extraction. Has no documented template for AEO content. Uses the same content framework for AEO that they use for traditional blog posts.
Interview questions:
- Can you walk me through the content template you use specifically for AEO-targeted articles?
- How do you decide where to place the direct answer in a long-form piece, and what signals do you use to determine the right length for that answer block?
- What content patterns do you actively avoid because they hurt retrieval chances?
Criterion 6: Vertical Specialization — 5 points
What it measures. The depth of genuine domain expertise the agency has in your industry. Not “we have worked with SaaS companies” but actual working knowledge of the terminology, buyer journey, competitive landscape, and regulatory context of your sector. AEO in regulated healthcare is a different problem from AEO in ecommerce. Vertical knowledge changes how queries are framed, what counts as a credible source, and what the LLM treats as authoritative in that domain.
Why it has the lowest weight. Because vertical knowledge can be acquired relatively quickly by someone with strong technical foundations. The inverse is not true. A sector expert without retrieval architecture knowledge cannot build a working AEO program by learning the domain. Technical depth transfers. Domain depth is a bonus, not a requirement.
That said, if two agencies score equally on the first five criteria, vertical specialization becomes the differentiator. An agency that already understands your industry’s entity landscape, publication ecosystem, and topic authority signals will move faster than a generalist agency that needs to learn these things.
Green flags: Has published content or case studies in your specific vertical. Can name the key publications and AI-cited sources in your sector without research. Understands the entity relationships that LLMs use to establish authority in your category.
Red flags: Presents a generic case study from an unrelated sector. Claims to work “across all industries” without specialization. Cannot name the five most cited sources in your vertical when AI systems answer questions in your category.
Interview questions:
- What are the three to five entities or publications that AI systems consistently cite when answering questions in our vertical?
- Have you worked with other companies in our sector, and what were the specific AEO challenges you ran into?
Criterion 7: Strategist Access and Contract Terms — 5 points
What it measures. Who actually works on your account and what flexibility you have in the engagement structure. Specifically: whether you have direct access to a senior practitioner with AEO expertise, whether the day-to-day work is done by the person who sold you the engagement or by a junior coordinator, and whether the contract allows you to exit cleanly if results disappoint.
Why it matters. Agency bait-and-switch is real. The founder presents. The coordinator executes. The quality gap between those two people is often enormous. In AEO, where technical judgment calls need to be made constantly, having a junior account manager as your primary contact is a structural problem, not an annoyance.
Contract terms matter for a different reason. AEO is still a maturing discipline. Some programs will not work despite best efforts. Annual lock-in contracts in a field where best practices are still being established shift all the risk to the client. Agencies confident in their results offer month-to-month or rolling quarterly terms. Agencies that front-load onboarding fees and require annual commitments are hedging against the possibility that they cannot deliver.
Green flags: Named senior practitioner on your account with technical AEO credentials. Month-to-month or rolling quarterly contracts available. Transparent about who does what on the team. Willing to put a performance milestone in writing.
Red flags: Annual contracts only, with significant cancellation penalties. Account manager with a generic digital marketing background runs your day-to-day. “Team” is described vaguely. Founder or senior practitioner disappears after onboarding.
Interview questions:
- Who is the named person responsible for my account, and what are their technical credentials specifically in AEO or NLP?
- What are the contract terms if results do not meet expectations in the first 90 days?
- Walk me through who does what on the team that would work on my account.
Why the Weights Are What They Are
A few people have asked why retrieval architecture and citation track record together account for 45 of the 100 points while strategist access gets only 5. The answer comes down to what actually predicts AEO outcomes.
Retrieval architecture knowledge explains variance in citation performance better than any other single variable I have observed. Two pieces of content that are equally well-written, equally well-linked, and equally well-promoted can have radically different citation rates if one is structured with retrieval mechanics in mind and the other is not. The agency that understands this wins. The one that does not loses, regardless of how good their account management is.
Citation track record has the same weight because strategy without evidence is just theory. There is no substitute for documented outcomes.
Measurement infrastructure gets 15 points because it is the mechanism that allows any program to improve over time. Without it, you are running blind.
Content hierarchy gets 10 points, not more, because it is table stakes for most professional agencies. The difference between agencies on this criterion is smaller than the difference on the higher-weighted criteria.
Vertical specialization and strategist access each get 5 points because they are real but secondary. They matter more in competitive agency selection where the top candidates score closely on the primary criteria. They are rarely the deciding factor.
Example Scored Agency: Walking Through the Framework
To show how the framework works in practice, here is how I would walk through scoring a hypothetical mid-sized content marketing agency that has recently rebranded as an AEO specialist.
Example: Hypothetical Content Agency, Rebranded as AEO Specialist
Retrieval Architecture Knowledge — 14 / 25 Knows keyword research and content structure. Has no background in information retrieval theory. Uses “AI-optimized” as a descriptor with no technical definition behind it.
Technical Schema Implementation — 12 / 20 Deploys FAQ and Article schema. Validates with Google’s tool. Has not worked with SpeakableSpecification or DefinedTerm schemas. No changelog for schema updates.
Proven Citation Track Record — 9 / 20 Has one case study with organic traffic data. No citation-specific documentation. Can show one screenshot of a client page appearing in an AI Overview from six months ago.
Measurement Infrastructure — 7 / 15 Tracks AI Overviews traffic in GSC. No dedicated citation monitoring tool. Monthly report is primarily traffic and ranking data with a brief “AI visibility” section added manually.
Content Hierarchy for LLM Ingestion — 7 / 10 Strong on question-first formatting. Has a documented AEO content template. Does not distinguish template from regular editorial output in their process.
Vertical Specialization — 3 / 5 Has worked in three industries. Can demonstrate reasonable vertical knowledge in two of them. Generalist positioning overall.
Strategist Access and Contract Terms — 3 / 5 Senior account manager, not founder. Month-to-month contracts available above a certain spend threshold. Reasonable terms overall.
Total Score: 55 / 100
Verdict: A competent content agency running an AEO program they do not yet fully understand. Worth reconsidering after another 12 months if they invest in retrieval architecture training and citation monitoring tooling. Not ready to be trusted with a primary AEO mandate today.
Common Evaluation Mistakes
These are the errors I see most often when brands try to evaluate AEO agencies without a structured framework.
Mistake 1: Treating SEO case studies as AEO evidence
Traffic growth is not a citation. Ranking improvement is not a citation. Domain authority is not a citation. Always ask for citation-specific documentation and treat anything else as irrelevant to your AEO decision.
Mistake 2: Rewarding brand name over technical depth
The biggest agencies in digital marketing were built on pre-AI SEO. Brand recognition in the AEO space has very low correlation with technical competency because the field is new enough that reputation follows marketing, not results.
Mistake 3: Signing annual contracts before seeing a proof of concept
Start with a defined, three-month paid pilot with specific citation targets. If the agency insists on annual commitment before demonstrating results, that tells you something about their confidence in their own outputs.
Mistake 4: Evaluating without asking who actually does the work
Always get the name and background of the person running your account before signing anything. Ask for their LinkedIn. Ask what they have published. The gap between the person who pitches and the person who executes is the most common source of disappointment in agency relationships.
Mistake 5: Conflating content volume with AEO effectiveness
Ten well-structured, citation-optimized articles will outperform 100 standard blog posts for AEO purposes. Volume is the wrong axis. Citation yield per piece of content is the right one.
Mistake 6: Ignoring the measurement question entirely
If an agency does not have a clear answer to “how will you track whether this is working,” you have no way to hold them accountable. Accountability requires measurement. An agency without AEO measurement tooling is an agency you cannot manage objectively.
Frequently Asked Questions About This Framework
What is the Northquery AEO Agency Scoring Framework?
A 7-criterion, weighted scoring system for evaluating Answer Engine Optimization agencies. Criteria are: Retrieval Architecture Knowledge (25%), Technical Schema Implementation (20%), Proven Citation Track Record (20%), Measurement Infrastructure (15%), Content Hierarchy for LLM Ingestion (10%), Vertical Specialization (5%), and Strategist Access plus Contract Terms (5%). Agencies are scored out of 100.
Why does retrieval architecture knowledge have the highest weight?
Because AEO is fundamentally a retrieval problem. If an agency does not understand how large language models chunk, embed, and retrieve documents, everything else they do is guesswork. Most SEO agencies have no background in information retrieval, which is why so many AEO programs fail to produce citations even after months of content production.
How is this framework different from standard SEO agency evaluation frameworks?
Standard SEO frameworks measure ranking outcomes, backlink profiles, and domain authority. AEO is different. Rankings are irrelevant if the AI does not cite your content. This framework focuses entirely on citation outcomes and the technical competencies that drive them. It was built by a researcher with NLP credentials, not a content marketer rebranding SEO deliverables as AEO.
Can I use this framework to evaluate my current agency?
Yes. The interview questions under each criterion are designed for both prospective and current agency relationships. If your current agency scores below 70 on this framework, you have a structural problem that swapping one person or deliverable will not fix.
How often should I rescore an agency?
Every six months at minimum. AEO is a fast-moving field. An agency that scored 82 last year may have failed to keep pace with LLM changes. The measurement infrastructure criterion, in particular, needs revalidation as new AI visibility tools emerge.
Is a score above 85 out of 100 realistic?
Barely. Fewer than a handful of agencies in the world can honestly document their way to a score above 85. Most agencies will score between 55 and 72. Any agency claiming a self-assessed score above 88 without published methodology behind it is marketing, not measurement.
How long does applying this framework take?
A proper evaluation using the 2 to 3 interview questions per criterion takes three to four hours across two sessions. Rushed evaluations produce inflated scores because the agency controls the narrative. Take the time.
Does Northquery publish its own score using this framework?
Yes. Northquery scores between 89 and 92 depending on the vertical, based on honest application of all 7 criteria. The scores are published in the vertical comparison pages. The framework came from the research, not from a desire to win the ranking.
What is the most common mistake brands make when evaluating AEO agencies?
Confusing SEO case studies with AEO results. An agency that grew organic traffic 300% has not proven AEO competency. Those are different problems with different technical requirements. Always ask specifically for citation evidence, not ranking evidence.
Does agency size affect the score?
No criterion directly rewards headcount. Strategist Access actually penalizes large agencies because the probability of working with a senior practitioner drops as headcount rises. Technical depth, not team size, is what drives AEO outcomes.
Will Northquery share the raw scoring rubric?
A downloadable PDF version of the scoring rubric is available on request. It includes the full point allocation per criterion, the red and green flag checklists, and the example interview questions. Email via the contact page to request it.
Where can I see agencies scored using this framework?
See the hub page at northquery.com/best-aeo-agency for the overall ranking, plus vertical breakdowns for SaaS, ecommerce, enterprise, startups, Europe, and UK. Each page uses this exact methodology with the same 7 criteria and weights.
Get the Scoring Rubric PDF
The downloadable rubric includes full point allocation tables, red and green flag checklists for every criterion, and the complete interview question set in one document you can use in your next agency review.
Request the Rubric · See the full agency rankings
About the Author
I am Adem Ajvazi, founder of Northquery. My background is in computational linguistics and applied NLP. I hold an MSc from the University of Copenhagen with a natural language processing specialization, and I have published research in the ACL Anthology, the primary publication venue for the field.
Before building Northquery into an AEO consultancy, I spent over 15 years doing technical SEO, content systems architecture, and semantic site structure across multiple verticals. I started taking AEO seriously when it became clear that the retrieval mechanics driving LLM citations are not reducible to traditional SEO signals.
This framework is not a marketing exercise. It came out of trying to think rigorously about what predicts citation outcomes, using the same analytical discipline I apply to NLP research. The methodology page exists because I believe public, defensible frameworks are better for the industry than black-box rankings.
MSc NLP, University of Copenhagen · ACL Anthology Published Research · 15+ Years Technical SEO · Northquery Founder