How AI Agents Discover Websites

The complete agent discovery sequence — from robots.txt to A2A handshakes

Last updated: March 2026

AI agents discover websites through a layered sequence: first checking robots.txt permissions, then reading discovery files like llms.txt and agents.md for context, scanning structured data (JSON-LD) for machine-readable information, and finally connecting through protocol endpoints like MCP servers and A2A agents. Each AI tool — ChatGPT, Perplexity, Claude, Gemini — discovers differently, but all reward sites that make themselves explicitly machine-readable.

When you ask ChatGPT a question about a product, or Perplexity searches for a comparison, or Claude looks up documentation — how does the AI find the right website? The answer has changed fundamentally in the last 12 months. Traditional SEO optimised for Google's crawler. The agentic web requires optimising for dozens of AI crawlers, each with different discovery mechanisms, different capabilities, and different expectations.

This guide maps the complete discovery sequence — every touchpoint where an AI agent looks for information about your site, from the first robots.txt check to a full agent-to-agent handshake. Understanding this sequence is the difference between your site being invisible to AI and being the authoritative source that every AI tool cites.

The discovery stack has five layers, each building on the last. You can implement just the first two and see significant results. Implement all five and your site becomes a fully programmable node in the agentic web — discoverable, comprehensible, and actionable by any AI agent.

The Five Discovery Touchpoints

Every AI agent that interacts with your website follows a predictable sequence. Not every agent checks every touchpoint, but the order is consistent. Understanding this sequence lets you optimise each layer for maximum discoverability.

1. robots.txt — The Permission Layer

The first thing any well-behaved AI crawler does is check your robots.txt file. This is the same protocol that Googlebot has respected since 1994, but now there are over a dozen AI-specific crawlers that check it.

The critical user-agents to address in 2026:

User-Agent	Company	Purpose	Default if not specified
GPTBot	OpenAI	Training data + SearchGPT index	Crawls (may train on data)
ChatGPT-User	OpenAI	Real-time browsing when user asks	Crawls on demand
OAI-SearchBot	OpenAI	SearchGPT search index	Crawls for search
ClaudeBot	Anthropic	Web search for Claude	Crawls on demand
anthropic-ai	Anthropic	Training data collection	Crawls (may train on data)
PerplexityBot	Perplexity	Real-time search and citation	Crawls aggressively
Google-Extended	Google	Gemini training data	Crawls (may train on data)
Applebot-Extended	Apple	Apple Intelligence features	Crawls
Bytespider	ByteDance	Training data for TikTok AI	Crawls aggressively
CCBot	Common Crawl	Open dataset (used by many LLMs)	Crawls everything
cohere-ai	Cohere	Enterprise AI model training	Crawls
Amazonbot	Amazon	Alexa + Amazon AI features	Crawls
Meta-ExternalAgent	Meta	Meta AI training data	Crawls

The recommended robots.txt configuration explicitly allows all AI crawlers while blocking sensitive paths:

# Allow all AI crawlers
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Amazonbot
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Bytespider
Allow: /

User-agent: CCBot
Allow: /

# Block sensitive paths for all agents
User-agent: *
Disallow: /admin/
Disallow: /api/admin/
Disallow: /dashboard/

Sitemap: https://yoursite.com/sitemap.xml

If your robots.txt blocks these crawlers — either explicitly or through a blanket Disallow: / — you are invisible to AI. Many sites accidentally block AI crawlers because their robots.txt was written before these user-agents existed. Check yours today.

2. llms.txt — The Context Layer

After confirming it has permission to crawl, an AI agent looks for llms.txt at the root of your domain. This is a plain-text file that gives the AI a concise summary of what your site does, what tools or APIs are available, and how to interact with them. Think of it as a README for AI agents.

The llms.txt standard was proposed by Jeremy Howard (fast.ai founder) in late 2024 and has gained rapid adoption. As of March 2026, thousands of sites serve an llms.txt file, and AI tools including ChatGPT, Claude, and Perplexity actively check for it during browsing sessions.

A well-structured llms.txt file answers three questions immediately: what does this site do, what can an AI do with it, and how should the AI interact with it. Here is an example:

# p0stman
> AI-native product studio. We build AI applications, agents, and
> production-ready platforms for businesses.

## What we do
- Build AI agents and multi-agent systems
- Take prototypes from Bolt/Lovable/Replit to production
- Fractional AI leadership for growing companies
- Agentic web implementation (MCP, A2A, discovery)

## Tools available
- MCP server at /api/mcp (Bearer auth)
- A2A agent at /api/agent (JSON-RPC 2.0)
- Public context at /api/ai/context (no auth)

## Key pages
- /services - All services
- /case-studies - Portfolio of client work
- /contact - Get in touch
- /agentic-web - Guide to the agentic web

## Contact
hello@p0stman.com

Keep llms.txt under 100 lines. AI agents scan it quickly to build an initial mental model of your site. The detail comes from deeper discovery files.

3. agents.md and context.md — The Deep Context Layer

While llms.txt is the quick summary, agents.md and context.md provide deep context that AI agents use to fully understand your site's capabilities.

agents.md is an extended instruction manual for AI agents. It includes detailed tool schemas, error handling instructions, rate limits, authentication flows, and usage examples. If llms.txt is the elevator pitch, agents.md is the full technical documentation.

context.md provides business and domain context: company background, ideal customer profile, pricing structure, feature lists, competitive positioning, and technical architecture. This is what AI agents read when they need to answer questions like "What does this company do?" or "Is this the right tool for my use case?"

Together, these files give AI agents enough context to represent your site accurately in conversations. Without them, the AI is guessing based on whatever fragments it can scrape from your HTML — which often leads to outdated or incomplete citations.

4. mcp.json — The Capability Manifest

The fourth discovery touchpoint is mcp.json, a machine-readable manifest that lists your site's MCP (Model Context Protocol) tools and capabilities. This is where the transition from passive discovery (being read) to active discovery (being used) begins.

An MCP manifest tells AI agents exactly what tools are available, what inputs they accept, what outputs they return, and where the endpoint lives. A typical mcp.json:

{
  "name": "postman-mcp",
  "description": "AI product studio tools and services",
  "endpoint": "https://p0stman.com/api/mcp",
  "protocol": "mcp-http",
  "auth": {
    "type": "bearer",
    "header": "Authorization"
  },
  "tools": [
    {
      "name": "get_services",
      "description": "List all services offered by p0stman",
      "scope": "read"
    },
    {
      "name": "get_case_studies",
      "description": "Retrieve case studies and portfolio examples",
      "scope": "read"
    },
    {
      "name": "submit_enquiry",
      "description": "Submit a project enquiry",
      "scope": "write",
      "input": {
        "name": "string",
        "email": "string",
        "message": "string"
      }
    }
  ]
}

AI agents that support MCP — including Claude (natively), and increasingly ChatGPT and others through plugins — can read this manifest and immediately understand how to interact with your site programmatically.

5. .well-known/agent.json — The Agent Identity Layer

The final discovery touchpoint is the AgentCard at .well-known/agent.json. This is part of the A2A (Agent-to-Agent) protocol created by Google, and it serves as your site's identity in the agent network.

Where mcp.json says "here are tools you can use," the AgentCard says "here is an intelligent agent you can collaborate with." The distinction matters: MCP connects models to tools (like calling a function), while A2A connects agents to agents (like sending a message to a colleague).

An AgentCard describes the agent's name, description, skills, supported input/output modes, authentication requirements, and the endpoint where tasks can be sent. When another AI agent wants to delegate work or request information, it fetches the AgentCard first to understand what your agent can do.

How Each AI Tool Discovers Differently

Not all AI tools discover websites the same way. Each has its own crawling infrastructure, indexing strategy, and content preferences. Understanding these differences lets you optimise for the tools your audience actually uses.

ChatGPT and GPTBot

ChatGPT discovers websites through three distinct mechanisms, each serving a different purpose.

GPTBot crawling: OpenAI's GPTBot crawler continuously indexes the web, building a knowledge base that informs ChatGPT's training data and SearchGPT results. GPTBot respects robots.txt, follows sitemaps, and prioritises high-authority content. It crawls at moderate frequency — typically hitting a mid-size site every few days.

Real-time browsing: When a user asks ChatGPT to look something up (or when ChatGPT determines it needs current information), it uses the ChatGPT-User agent to browse the web in real-time. This is a targeted, on-demand crawl — it follows links, reads page content, and can navigate multi-page sites. During real-time browsing, ChatGPT reads llms.txt files, parses JSON-LD schema, and extracts structured data.

SearchGPT indexing: OpenAI's search product (OAI-SearchBot) maintains its own web index, separate from the training data pipeline. This index powers ChatGPT's web search tool and the standalone SearchGPT product. Sites that rank well in SearchGPT tend to have strong traditional SEO signals combined with structured data that makes content easy to extract and cite.

What ChatGPT prioritises: Clear, authoritative content with specific facts and figures. JSON-LD schema (especially FAQPage and HowTo). Direct answers near the top of the page. Well-structured HTML with semantic headings. Sites that allow GPTBot in robots.txt.

Perplexity and PerplexityBot

Perplexity's discovery mechanism is fundamentally different from ChatGPT's because Perplexity performs real-time web searches for every single query. There is no pre-built index that Perplexity relies on for most answers — it crawls and reads pages live, in real-time, as the user asks their question.

PerplexityBot is one of the most aggressive AI crawlers. It follows links deeply, reads full page content (not just snippets), and processes multiple pages per query. Perplexity's Pro Search feature can read 20+ pages in a single query session, synthesising information across sources.

What Perplexity prioritises: Pages with a clear, direct answer at the top (the "answer capsule" pattern). Structured content with headings that match search intent. Tables and comparison data. Recent publication dates and "last updated" signals. Citations and links to primary sources. Perplexity cites every claim inline, so pages that make it easy to extract citable facts get referenced more often.

The most important optimisation for Perplexity is having your answer in the first 30% of the page. Perplexity's citation algorithm heavily weights content near the top of the document. An answer capsule with a direct, factual response to the page's core question dramatically increases citation frequency.

Claude and ClaudeBot

Anthropic's Claude discovers websites through its web search capability, which uses ClaudeBot as the user-agent. Unlike GPTBot, ClaudeBot does not continuously crawl the web for training data — it performs targeted, on-demand searches when Claude needs current information to answer a question.

Claude's search is powered by a combination of web search APIs and direct page fetching. When Claude searches, it reads full page content, processes structured data, and evaluates source quality. Claude is particularly good at reading and understanding technical documentation, structured formats, and long-form content.

Claude also has native MCP support. When Claude encounters an MCP server manifest (mcp.json), it can connect to the server and invoke tools directly within the conversation. This makes MCP discovery particularly important for Claude users — if your site has an MCP server, Claude can interact with it programmatically, not just read its content.

What Claude prioritises: Well-structured technical content. Markdown-formatted documentation. Clear, factual prose without marketing fluff. MCP server availability for interactive capabilities. JSON-LD schema for machine-readable metadata.

Gemini and Google-Extended

Google's Gemini has a unique advantage: it builds on Google's existing search infrastructure. Gemini doesn't need to crawl your site independently because Google's main index already contains your content. Google-Extended is the separate crawler that collects data specifically for Gemini's training, but Gemini's real-time answers draw from Google's main search index.

This means that traditional Google SEO directly impacts your visibility in Gemini. If you rank well on Google, Gemini is more likely to cite you. But Gemini also has AI Overviews (formerly SGE), which synthesise multi-source answers at the top of search results. Getting featured in AI Overviews requires structured content, authoritative sourcing, and direct answers.

Gemini also interacts with the broader Google ecosystem. Google is the creator of the A2A protocol, and Gemini-powered agents can discover and communicate with other agents via AgentCards. If your site has a .well-known/agent.json, Gemini-based agents can find you through the A2A registry.

What Gemini prioritises: Everything that ranks well on Google, plus structured data (JSON-LD), FAQ schema, HowTo schema, and content that directly answers questions. Google's Knowledge Graph entities. Sites with strong E-E-A-T signals (Experience, Expertise, Authoritativeness, Trustworthiness).

Microsoft Copilot and Bing

Microsoft Copilot discovers websites through Bing's search index, augmented with real-time browsing. Copilot's discovery is Bing-first: if you rank on Bing, you are visible to Copilot. If you have poor Bing SEO, Copilot may not find you even if you rank well on Google.

Copilot also supports real-time browsing through the Bing Chat user-agent (copilot.microsoft.com referrer). During browsing sessions, Copilot reads page content, follows links, and extracts structured data. It supports IndexNow for instant content updates — if you implement the IndexNow protocol, your content changes are reflected in Copilot within minutes rather than days.

What Copilot prioritises: Bing SEO signals. IndexNow submissions for fresh content. Structured data markup. Clear, authoritative content. Bing Webmaster Tools verification. Sites that submit sitemaps to Bing specifically (many sites only submit to Google Search Console).

Passive Discovery vs Active Discovery

There is a fundamental distinction between being found by AI agents and being used by AI agents. This is the difference between passive and active discovery, and it determines whether your site is a static information source or a dynamic participant in the agentic web.

What is passive discovery?

Passive discovery is when your website is found, crawled, and cited by AI agents without any special machine-readable infrastructure. The AI reads your HTML content, extracts information, and includes it in responses. This is analogous to how Google has always worked — you publish content, a crawler finds it, and it appears in search results.

Most websites today are limited to passive discovery. The AI can read your content and cite it, but cannot interact with your site. It cannot execute a search, check pricing, submit a form, or trigger a workflow. The relationship is read-only.

Passive discovery is driven by:

Allowing AI crawlers in robots.txt
Publishing well-structured HTML content with semantic headings
Adding JSON-LD schema (Article, FAQPage, HowTo) for machine-readable metadata
Including answer capsules with direct answers
Providing llms.txt and context.md for AI-specific context
Having a comprehensive sitemap.xml

Passive discovery is the minimum viable agentic web strategy. It gets you cited. It gets you traffic from AI tools. But it doesn't let AI agents do anything with your site.

What is active discovery?

Active discovery (also called agent handshakes) means your site explicitly advertises programmable capabilities that AI agents can invoke. The site is no longer just a document to be read — it is a service to be used.

Active discovery is enabled by:

MCP servers: Your site exposes tools (functions) that AI agents can call. A tool might search your product catalogue, check availability, generate a quote, or submit an order.
A2A endpoints: Your site runs an AI agent that other agents can communicate with. Another agent can send a task ("Find me a web development agency in London that specialises in AI"), and your agent responds with a structured answer.
WebMCP registration: Your site registers tools with the browser's built-in AI via navigator.modelContext, enabling Chrome's AI assistant to interact with your site without any server-side call.
data-mcp-tool attributes: Interactive HTML elements (forms, buttons) are annotated with machine-readable descriptions so browser AI agents understand what each element does.

Active discovery is where the agentic web gets genuinely transformative. An AI agent helping a user plan a trip can discover your hotel's A2A agent, check availability, negotiate a price, and book a room — all without the user ever visiting your website directly. The agent-to-agent interaction replaces the human-to-website interaction.

Aspect	Passive Discovery	Active Discovery
Relationship	Read-only	Interactive
Outcome	Cited in AI answers	Used as a service by AI agents
Touchpoints	robots.txt, llms.txt, HTML, JSON-LD	mcp.json, MCP endpoint, AgentCard, A2A endpoint, WebMCP
Effort	Low (content + markup)	Medium-high (API development)
Impact	Traffic and brand visibility	Revenue and automated workflows
Analogy	A brochure	A team member

How Agent-to-Agent Discovery Works

The most advanced form of discovery in the agentic web is agent-to-agent (A2A) discovery. This is where AI agents find, evaluate, and connect to other AI agents autonomously — without a human directing them to a specific URL.

A2A discovery works through a three-step handshake:

Step 1: Registry lookup. An AI agent that needs help with a task queries an agent registry (such as a2aregistry.org or an enterprise-internal registry) to find agents with relevant skills. The registry returns a list of AgentCards matching the query.

Step 2: AgentCard evaluation. The requesting agent fetches each AgentCard from .well-known/agent.json and evaluates the agent's skills, authentication requirements, supported input/output modes, and capabilities. It selects the most appropriate agent for the task.

Step 3: Task delegation. The requesting agent sends a task to the selected agent's A2A endpoint using JSON-RPC 2.0. The receiving agent processes the task, potentially asking for additional input, and returns a result. The entire interaction happens programmatically.

Here is a concrete example. Imagine a user asks their personal AI assistant: "Find me a digital agency that can build an AI-powered customer service bot."

The assistant agent queries agent registries for agents with skills matching "AI development", "digital agency", "customer service".
The registry returns several AgentCards, including p0stman's at p0stman.com/.well-known/agent.json.
The assistant agent reads the AgentCard, sees skills like "AI agent development" and "prototype to production", and determines p0stman's agent is relevant.
The assistant sends a task: "The user is looking for a digital agency to build an AI customer service bot. What services do you offer and what is your pricing?"
p0stman's A2A agent processes the task using its knowledge of services, case studies, and pricing, then returns a structured response.
The assistant presents this information to the user alongside responses from other agencies.

This entire workflow happens without the user visiting any website. The agents discover each other, negotiate, and exchange information autonomously. This is why having an AgentCard and A2A endpoint is becoming critical for business discovery.

MCP Discovery: How AI Finds Your Tools

The Model Context Protocol (MCP) enables AI models to discover and use tools exposed by your website. MCP discovery works differently from A2A discovery because it's about finding tools (functions), not agents (intelligent participants).

There are three channels for MCP discovery:

Server-side MCP discovery

The primary MCP discovery mechanism is the mcp.json manifest file served from your domain root. AI agents fetch this file to discover available tools, their input schemas, and the endpoint URL.

Once an agent knows about your MCP server, it can send a GET request to the endpoint to retrieve server information (capabilities, protocol version) and then POST requests to invoke specific tools. The discovery flow is:

Agent fetches https://yoursite.com/mcp.json
Agent reads available tools, authentication requirements, and endpoint URL
Agent sends GET to /api/mcp for server info
Agent invokes tools via POST with the appropriate input parameters

MCP servers can also be registered in public registries like Smithery, mcp.so, and PulseMCP. These registries act as directories where AI developers and tools discover MCP servers by category, capability, or domain.

Browser-level MCP discovery (WebMCP)

WebMCP is a browser API shipping in Chrome 146 and later that allows websites to register tools directly with the browser's built-in AI. When a user visits your site, your client-side JavaScript registers tools via navigator.modelContext.addTool().

This is a fundamentally different discovery channel because it is initiated by the website, not the AI agent. When the user interacts with Chrome's AI assistant while on your site, the browser AI already knows about your tools and can invoke them directly.

// WebMCP registration example
if (navigator.modelContext?.addTool) {
  navigator.modelContext.addTool({
    name: "get_services",
    description: "Get a list of services offered by p0stman",
    handler: async () => ({
      services: [
        { name: "AI Agent Development", slug: "/services/ai-agents" },
        { name: "Prototype to Production", slug: "/prototype-to-production" },
        { name: "Fractional AI Leadership", slug: "/fractional-ai-leadership" }
      ]
    })
  });
}

WebMCP is the most frictionless form of MCP discovery because it requires no server-side endpoint, no authentication, and no external registry. The tools live in the browser and are available instantly. The limitation is that only lightweight, public tools make sense here — anything requiring server-side logic, authentication, or database access still needs a traditional MCP endpoint.

HTML-level tool hints (data-mcp-tool)

The third MCP discovery channel is HTML annotation. By adding data-mcp-tool attributes to interactive elements, you help browser AI agents understand what each element does:

<form data-mcp-tool="submit_enquiry"
      data-mcp-description="Submit a project enquiry to p0stman">
  <input name="name" data-mcp-param="name"
         data-mcp-description="Contact name" />
  <input name="email" data-mcp-param="email"
         data-mcp-description="Contact email" />
  <textarea name="message" data-mcp-param="message"
            data-mcp-description="Project description"></textarea>
  <button type="submit">Send</button>
</form>

This is the simplest form of MCP discovery and works even without a dedicated MCP server. Browser AI agents can read these annotations and understand how to interact with your page elements programmatically.

Testing Your Site's Discoverability

Implementing discovery files is only half the work. You need to verify that each layer is working correctly. Here is a systematic testing checklist for each touchpoint.

Testing robots.txt

Fetch your robots.txt and confirm AI crawler user-agents are explicitly allowed:

curl https://yoursite.com/robots.txt

Verify that GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Applebot-Extended are all listed with Allow: /. If they are not explicitly mentioned, they follow the default User-agent: * rules. If your default rule is Disallow: /, all AI crawlers are blocked.

Testing llms.txt

curl https://yoursite.com/llms.txt

Confirm it returns a valid plain-text response with a 200 status code. Verify the content accurately describes your site, lists key pages, and mentions any available tools or APIs.

Testing structured data (JSON-LD)

Use Google's Rich Results Test (search.google.com/test/rich-results) or the Schema.org Validator to check your JSON-LD markup. Every content page should have at minimum an Article or WebPage schema, and content pages with FAQs should include FAQPage schema.

You can also inspect JSON-LD directly in the browser's DevTools:

// In browser console:
document.querySelectorAll('script[type="application/ld+json"]')
  .forEach(el => console.log(JSON.parse(el.textContent)));

Testing MCP discovery

# Test mcp.json manifest
curl https://yoursite.com/mcp.json

# Test MCP server info
curl https://yoursite.com/api/mcp

# Test a tool invocation (with auth)
curl -X POST https://yoursite.com/api/mcp \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tool": "get_services"}'

Testing A2A discovery

# Test AgentCard
curl https://yoursite.com/.well-known/agent.json

# Test A2A endpoint info
curl https://yoursite.com/api/agent

# Test a task submission
curl -X POST https://yoursite.com/api/agent \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tasks/send",
    "params": {
      "id": "test-task-001",
      "message": {
        "role": "user",
        "parts": [{"type": "text", "text": "What services do you offer?"}]
      }
    }
  }'

End-to-end testing with real AI tools

The definitive test is asking real AI tools questions that your site should answer. Open each tool and ask a question that targets your content:

ChatGPT: Ask a specific question that your content answers. Check if ChatGPT cites your site in its response.
Perplexity: Search for your target keyword. Check if Perplexity lists your site in its citations.
Claude: Ask Claude to search for information on your topic. Check if Claude finds and cites your site.
Gemini: Search on Google and check if your content appears in AI Overviews.

If you are not being cited, work backwards through the discovery stack. Is your robots.txt blocking crawlers? Is your content structured for extraction? Do you have JSON-LD schema? Is there an answer capsule? Each missing layer reduces your discoverability.

The Discovery Stack Checklist

Implement the agent discovery stack in this order, from lowest effort and highest impact to highest effort and longest-term impact. Each layer builds on the previous ones.

Level 1: Be visible (30 minutes)

The absolute minimum to stop being invisible to AI agents.

Update robots.txt to explicitly allow GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and other AI crawlers
Verify your sitemap.xml is current and submitted to Google Search Console and Bing Webmaster Tools
Ensure your site loads fast and returns proper status codes (no soft 404s)

Level 2: Be understandable (2-3 hours)

Make your content easy for AI agents to read, extract, and cite correctly.

Create llms.txt with a plain-text site summary
Add context.md with deep business context
Add JSON-LD schema (Article, FAQPage, HowTo) to every content page
Add answer capsules to key content pages (direct answer in the first paragraph)
Structure all content with semantic headings (H1 as the question, H2/H3 as sub-topics)

Level 3: Be actionable (1-2 days)

Expose tools and capabilities that AI agents can invoke programmatically.

Create mcp.json manifest listing available tools
Build an MCP server endpoint at /api/mcp
Register public tools via WebMCP (navigator.modelContext.addTool())
Add data-mcp-tool attributes to key interactive elements
Create /api/ai/context endpoint for structured site context

Level 4: Be collaborative (2-3 days)

Enable autonomous agent-to-agent communication and task delegation.

Create an AgentCard at .well-known/agent.json
Build an A2A endpoint at /api/agent (JSON-RPC 2.0)
Define agent skills with clear descriptions and example prompts
Register your agent in A2A registries
Create agents.md with detailed agent interaction documentation

Level 5: Be monitored (ongoing)

Track agent interactions and optimise based on real data.

Implement bot crawl tracking (log AI crawler visits to a database)
Track AI referral traffic separately from organic search
Monitor MCP tool invocations and A2A task submissions
Set up IndexNow for instant content update notifications
Regularly test discoverability by querying AI tools for your target topics

Real-World Example: How p0stman.com Is Discovered by Five AI Tools

To make this concrete, here is exactly how p0stman.com — a live production site with the full discovery stack implemented — is discovered by each major AI tool.

ChatGPT discovers p0stman.com

GPTBot has been allowed in p0stman.com's robots.txt since launch. When ChatGPT searches the web for topics like "AI product studio UK" or "prototype to production agency", it finds p0stman.com through its SearchGPT index. During real-time browsing, ChatGPT reads the llms.txt file for a quick overview, then navigates to relevant pages. The JSON-LD schema (Article, FAQPage) on each content page helps ChatGPT extract structured facts for its responses. The answer capsule at the top of each guide page provides a direct, citable answer.

Perplexity discovers p0stman.com

When a Perplexity user searches for "what is an MCP server" or "how to make website agent-ready", PerplexityBot crawls p0stman.com's content pages in real-time. It reads the answer capsule first, extracts the key fact, and cites it inline in the Perplexity response. The deep content (6,000-10,000 words per page) provides enough material for Perplexity to cite multiple facts from the same page. The structured headings (each one answering a specific question) map directly to sub-queries in Perplexity's multi-step research.

Claude discovers p0stman.com

When Claude searches for agentic web topics, ClaudeBot reads page content directly. But Claude has a second, more powerful discovery path: MCP. Because p0stman.com has a live MCP server at /api/mcp with a published mcp.json manifest, Claude users who configure the p0stman MCP server can invoke tools directly from within Claude. Claude can call get_services, get_case_studies, or submit_enquiry without ever visiting the website's HTML pages.

Gemini discovers p0stman.com

Gemini leverages Google's existing search index, so p0stman.com's Google SEO performance directly impacts Gemini visibility. The FAQPage schema on content pages feeds into Google's rich results and AI Overviews. Google-Extended is allowed in robots.txt, ensuring content is available for Gemini's training. The AgentCard at .well-known/agent.json is discoverable by Gemini-powered agents through the A2A protocol — fitting, since Google created A2A.

Copilot discovers p0stman.com

Microsoft Copilot uses Bing's index, so p0stman.com is discoverable through Bing's crawling. The site submits content updates via IndexNow, which Bing supports natively. This means new content pages are indexed by Copilot within minutes of publication, rather than the days or weeks it might take through standard crawling. The structured data and clear content hierarchy help Copilot extract accurate information for its responses.

Common Discovery Failures and How to Fix Them

If AI agents are not finding or citing your site, the problem is almost always one of these common failures. Work through them in order — the most impactful issues are listed first.

Failure 1: robots.txt blocks AI crawlers

Symptom: No AI tool ever cites your site, even for branded queries.
Cause: A blanket User-agent: * / Disallow: / rule, or specific blocks on GPTBot, ClaudeBot, etc.
Fix: Explicitly allow all AI crawlers in robots.txt. If you have privacy concerns about training data, allow browsing agents (ChatGPT-User, PerplexityBot) while blocking training-specific agents (GPTBot, Google-Extended).

Failure 2: No structured data (JSON-LD)

Symptom: AI tools find your site but extract information inaccurately or incompletely.
Cause: No JSON-LD schema, so the AI relies entirely on parsing unstructured HTML.
Fix: Add Article, FAQPage, HowTo, or WebApplication schema to every content page. The structured data gives AI agents a machine-readable summary alongside the human-readable content.

Failure 3: Answer buried deep in content

Symptom: Perplexity and ChatGPT find your page but cite a competitor's answer instead.
Cause: Your page starts with a lengthy introduction before answering the core question. AI tools cite content from the first 30% of the page 44% of the time.
Fix: Add an answer capsule — a 1-2 sentence direct answer in a <div class="answer-capsule"> — above the fold. Lead with the answer, then elaborate.

Failure 4: Thin content

Symptom: Your page is indexed but rarely cited. Competitors with longer content on the same topic get cited instead.
Cause: Pages under 2,000 words consistently underperform in AI citations. AI tools prefer comprehensive, authoritative content over thin summaries.
Fix: Expand key content pages to 4,000-10,000 words. Add worked examples, comparison tables, step-by-step guides, and FAQ sections. Depth beats breadth for AI discoverability.

Failure 5: JavaScript-rendered content

Symptom: Google indexes your content but AI crawlers see empty pages.
Cause: Content is rendered entirely client-side via JavaScript. Most AI crawlers do not execute JavaScript — they read the initial HTML response.
Fix: Use server-side rendering (SSR) or static site generation (SSG). For Next.js, use Server Components or generateStaticParams(). For SPAs, implement pre-rendering or dynamic rendering for bot user-agents.

Failure 6: No llms.txt or context files

Symptom: AI tools find your site but describe your business inaccurately.
Cause: Without llms.txt and context.md, the AI constructs its understanding of your site by piecing together fragments from individual pages. This often leads to an incomplete or distorted picture.
Fix: Create llms.txt (quick summary) and context.md (deep context) to give AI agents an accurate, authoritative overview of your site. These files are your chance to control how AI represents your business.

Failure 7: MCP or A2A endpoint returns errors

Symptom: Your mcp.json or AgentCard exists but agents fail when trying to interact.
Cause: The endpoint URL in the manifest is wrong, CORS headers are missing, or the endpoint returns unexpected responses.
Fix: Test every endpoint manually with curl. Verify CORS headers allow cross-origin requests. Ensure the response format matches the protocol specification exactly. Monitor error logs for failed agent requests.

Failure 8: Bing neglected

Symptom: You appear in ChatGPT and Perplexity but not in Copilot.
Cause: Your site has strong Google SEO but poor Bing coverage. Copilot relies on Bing's index.
Fix: Submit your sitemap to Bing Webmaster Tools. Implement IndexNow for instant updates. Verify your site is indexed on Bing (search site:yoursite.com on Bing).

The Future of Agent Discovery

Agent discovery is evolving rapidly. Several trends are shaping where this is heading in 2026 and beyond.

Standardisation: llms.txt, MCP, and A2A are all converging toward stable specifications. As more sites adopt these standards, AI tools will increasingly rely on them as primary discovery channels rather than fallbacks. Expect llms.txt support to become as universal as robots.txt support within 18 months.

Agent registries: Public registries for AI agents (like a2aregistry.org) and MCP servers (like Smithery and mcp.so) are growing rapidly. These registries enable AI agents to find relevant services without crawling the web — they query the registry directly. Registering your agent and MCP server in these directories is becoming as important as submitting your sitemap to Google.

Browser-native AI: Chrome 146+ with WebMCP, and similar features expected from other browsers, will make every website a potential tool provider. This shifts the discovery model from "agents search the web for tools" to "tools register themselves when users visit." The implications are profound: your website becomes an interactive participant in every AI conversation that happens while a user is on your page.

Multimodal discovery: Current discovery is primarily text-based, but AI agents are becoming multimodal. Future discovery mechanisms will include image analysis (understanding what a site does from its UI), video content indexing, and audio transcription. Sites with rich media content will have additional discovery surface area.

Trust and verification: As agent-to-agent communication scales, trust becomes critical. How does an agent know that the AgentCard at .well-known/agent.json is legitimate? Expect to see cryptographic verification, domain-linked trust chains, and reputation systems for AI agents — similar to how HTTPS and certificate authorities work for websites today.

The sites that implement the full discovery stack now are building a compounding advantage. Every month that passes, more AI tools adopt these standards, and early adopters accumulate more agent interactions, better registry rankings, and stronger discoverability signals.

Frequently Asked Questions

How does ChatGPT find and use information from websites?

ChatGPT discovers websites through two mechanisms. First, GPTBot crawls the web and indexes content for OpenAI's training and retrieval systems, respecting robots.txt directives. Second, when ChatGPT browses the web in real-time (via the browsing tool), it follows links, reads page content, and extracts structured data like JSON-LD schema. ChatGPT also reads llms.txt files when present, and its SearchGPT feature indexes content similarly to a traditional search engine. Sites that explicitly allow GPTBot in robots.txt and provide structured discovery files get indexed more completely.

What is the difference between passive and active agent discovery?

Passive discovery means your website is found and indexed by AI crawlers without your site doing anything special — similar to how Google discovers pages by following links. Active discovery (also called agent handshakes) means your site explicitly advertises its capabilities through machine-readable files like mcp.json, .well-known/agent.json, and API endpoints. Passive discovery gets you cited in AI answers. Active discovery lets AI agents interact with your site programmatically — executing tools, sending tasks, and completing workflows.

Do I need to allow AI crawlers in my robots.txt?

If you want AI agents to discover and cite your website, yes. Most AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) respect robots.txt. If you block them, your content won't appear in AI-generated answers. The recommended approach is to explicitly allow AI crawlers while still blocking sensitive paths like admin dashboards and API endpoints that shouldn't be publicly crawled.

What is an AgentCard and how does it help with discovery?

An AgentCard is a JSON file placed at .well-known/agent.json that describes your AI agent's capabilities, skills, authentication requirements, and communication endpoint. It's part of Google's A2A (Agent-to-Agent) protocol. When another AI agent wants to collaborate with your agent, it fetches the AgentCard to understand what your agent can do, how to authenticate, and which endpoint to send tasks to. Think of it as a business card for AI agents — it enables automated agent-to-agent discovery and interaction.

How does Perplexity discover and cite websites differently from ChatGPT?

Perplexity performs real-time web searches for every query, crawling and reading pages live rather than relying primarily on a pre-built index. PerplexityBot follows links aggressively and reads full page content in real-time. It prioritizes pages with direct answers near the top (answer capsules), clear structure (headings, lists, tables), and authoritative sourcing. Because Perplexity cites sources inline, having well-structured content with a clear answer at the top significantly increases your chances of being cited.

What is WebMCP and how does it enable browser-level discovery?

WebMCP is a browser API (navigator.modelContext) shipping in Chrome 146+ that lets websites register tools directly with the browser's built-in AI. When a user visits your site, your JavaScript can register lightweight tools — like "get pricing" or "search products" — that Chrome's AI assistant can invoke without any server-side infrastructure. This enables a third discovery channel: server-side (MCP endpoints), protocol-level (A2A), and now browser-level (WebMCP). It's particularly powerful because it works without the user explicitly asking an external AI about your site.

How do I test if AI agents can discover my website?

Test each layer of the discovery stack systematically. For robots.txt, use Google's robots.txt tester and verify AI crawler user-agents are allowed. For llms.txt, fetch yoursite.com/llms.txt and confirm it returns valid content. For structured data, use Google's Rich Results Test or Schema.org validator. For MCP, send a GET request to your /api/mcp endpoint. For A2A, fetch .well-known/agent.json and validate the JSON. Then test end-to-end by asking ChatGPT, Perplexity, and Claude questions that your site should answer — if they cite you, discovery is working.

What order should I implement the agent discovery stack?

Implement in order of impact and effort: 1) robots.txt allowing AI crawlers (5 minutes, immediate impact), 2) llms.txt with a plain-text site summary (30 minutes), 3) JSON-LD schema on all content pages (1-2 hours), 4) answer capsules on key pages (1 hour), 5) context.md with deep site context (1 hour), 6) mcp.json manifest (30 minutes), 7) MCP server endpoint (2-4 hours), 8) .well-known/agent.json AgentCard (1 hour), 9) A2A endpoint (4-8 hours), 10) WebMCP registration (1-2 hours). Steps 1-5 cover passive discovery and should be done first. Steps 6-10 enable active agent interaction.

Make your website discoverable by AI agents

We implement the full agent discovery stack — from robots.txt and llms.txt to MCP servers and A2A endpoints. Get discovered, cited, and used by every major AI tool.

Agentic Web Readiness Talk to Us