Methodology Specification · Versioned · Citeable

PTODA C01 Crawler Methodology v1.3

The deterministic robots.txt access scanner used for the Global Digital Authority Benchmark Series crawler-access studies. This document is the versioned, reproducible specification studies cite.

Version1.0

Published17 June 2026

InstrumentPTODA C01 Crawler — deterministic robots.txt + llms.txt scanner

Maintained byPeriodic Table of Digital Authority

Cite asPTODA C01 Crawler Methodology v1.3 (2026)

Scope

What this instrument measures

The C01 Crawler determines, for a defined sample of domains, the publicly declared access configuration for major AI crawler user-agents — by retrieving and parsing each domain's robots.txt, homepage robots meta tags, and relevant HTTP headers, then detecting CMS and CDN/host signals.

It measures declared crawler-access policy, not crawler behaviour and not citation outcomes. It is deterministic: the same domain in the same state returns the same classification, which makes every study reproducible by re-running the instrument against the published sample.

Reporting model

Two layers — policy and infrastructure

v1.2 separates observable crawler policy from infrastructure non-response. The two are never mixed, and policy rates are computed only on domains whose robots.txt could actually be read.

Policy layer

Open · Fully Blocked · Partially Blocked. Computed only on domains whose robots.txt was successfully retrieved and parsed. These are the headline crawler-policy outcomes.

Infrastructure layer

Access Denied (HTTP 401 / 403 / 429) · Unscannable (connection failure, timeout, 5xx). Reported separately and excluded from policy denominators — no crawler policy could be observed, so the domain is never recorded as open or blocked.

A block declared in robots.txt is a policy decision. A 403, timeout or unscannable response is an access outcome, not evidence of crawler policy. Reporting the two layers separately prevents over-counting blocks and over-counting openness — results are reported only where behaviour was directly observed.

Layer 2 · New in v1.3

llms.txt adoption — presence and conformance

v1.3 adds a machine-readable-guidance layer to the v1.2 access instrument. On each domain, in the same scan pass, the crawler assesses whether the site has adopted llms.txt. The Layer 1 access methodology is unchanged and reproducible; this is an additive layer, not a revision to access measurement.

The crawler fetches /llms.txt and, immediately before it, a randomly-named control path that cannot exist. A file is recorded as present only when /llms.txt returns a genuine HTTP 200 and the control path does not return matching content — this excludes servers that return 200 for every path (soft-404s) from being miscounted as adopters. It is the single most important control in the adoption measurement.

Presence is separated from conformance. A present file is recorded as conformant only when it is non-empty, is not HTML (by body or content-type), is markdown-structured with at least one heading, and is not a placeholder. Adoption and conformance are reported as distinct rates, so a present-but-invalid file (empty, HTML, or placeholder) counts toward adoption awareness but not toward valid guidance.

Classification: present_conformant · present_nonconformant · absent · unscannable. As with the access layer, infrastructure non-response (timeout / 5xx) is excluded from adoption denominators. The paired analysis — llms.txt presence against AI-crawler access — uses only domains where both layers were observable. For every present file, the crawler also records byte size, markdown link count, content-type, and a SHA-256 content hash, retained to support later analysis of whether richer files correlate with AI citation.

Version history

The instrument is versioned. Each version is dated; access-layer methodology is frozen across the series.

v1.2 — Access benchmark. robots.txt access measurement; the policy / infrastructure two-layer reporting model; access-denied vs unscannable separation. Frozen for the AU / US / GB / SG access studies.

v1.3 — Added llms.txt assessment layer. Adds llms.txt presence and conformance (soft-404 control, capture-now fields) in the same scan pass. No change to access-layer methodology — v1.3 is a strict superset of v1.2, and re-running v1.3 on a v1.2 frame reproduces the v1.2 access numbers exactly. Instrument for the AI Accessibility & Machine-Readable Guidance Benchmark.

User-agents

Crawler test list (frozen across the series)

Two groups, interpreted differently. The crawler list is frozen across all volumes so that findings are comparable across countries and dates. New crawlers are not added mid-series.

Group A — Retrieval & Citation (14 user-agents · the headline metric)

GPTBot OpenAI

OAI-SearchBot OpenAI

ChatGPT-User OpenAI

ClaudeBot Anthropic

anthropic-ai Anthropic

Claude-Web Anthropic

Claude-SearchBot Anthropic

Claude-User Anthropic

PerplexityBot Perplexity

Perplexity-User Perplexity

Bingbot Bing / Copilot

MistralAI-User Mistral

DuckAssistBot DuckDuckGo

Googlebot baseline

Group B — Training (7 user-agents · reported separately, never merged)

Google-Extended Gemini training

CCBot Common Crawl

Bytespider ByteDance

Applebot-Extended Apple

meta-externalagent Meta

FacebookBot Meta

Amazonbot Amazon

Googlebot is included as a baseline. If a site blocks Googlebot at the same rate it blocks AI crawlers, the block is a broad restriction rather than an AI-specific decision — the single most important interpretive control in the method.

On Grok (xAI): deliberately excluded. xAI publishes no crawler documentation and its retrieval traffic uses residential-IP rotation with spoofed browser user-agents, presenting no declared user-agent and honouring no robots.txt contract. It cannot be measured by a robots.txt instrument, and including a token for it would produce meaningless results. Microsoft Copilot is covered by Bingbot; Gemini retrieval by Googlebot and Google-Extended.

Accuracy

False-positive prevention (frozen)

The primary failure mode is misclassifying benign housekeeping directives as AI blocks. The following are never counted as blocks, in any volume:

WordPress Disallow: /wp-admin/ — standard admin housekeeping. Not an AI block.
Crawl-delay: N — a politeness directive at any value. Recorded, never counted as blocking.
Sitemap: declarations — informational, never a block.
Empty Disallow: — explicitly means allow all.
Path-specific disallows that leave the homepage crawlable (/cart, /search, /admin/, /login) — housekeeping, not content/AI blocks.

A user-agent is classified blocked only where the governing robots.txt group disallows / or the homepage/primary content path. The classifier is validated against fixture tests before any batch is run.

Protocol

Measurement steps (reproducible)

1 · Retrieve robots.txt

Fetch https://{domain}/robots.txt; record HTTP status. 404 = no robots = open by default. 401 / 403 / 429 = access denied (infrastructure layer). Timeout / 5xx / connection failure = unscannable.

2 · Parse per user-agent

Identify the most-specific matching user-agent group (own-UA group overrides *); determine whether it path-blocks the homepage/root for that bot.

3 · Homepage signals

Fetch the homepage; check robots <meta> (noindex/nofollow/noai) and X-Robots-Tag headers. Recorded as secondary signals; robots.txt is primary.

4 · CMS & CDN detection

Detect CMS (WordPress, Shopify, Drupal, etc.) and CDN/host (Cloudflare, Akamai, etc.) from homepage HTML and response headers — enabling the infrastructure-vs-explicit and CMS-correlation analyses.

5 · Classify

Assign each domain to a policy-layer class (open / partial / fully blocked) or an infrastructure-layer outcome (access-denied / unscannable). Block origin classified as explicit / infrastructure-imposed (Cloudflare managed signature) / indeterminate.

6 · Date-stamp

The whole sample is scanned in a tight window — a point-in-time snapshot, date recorded.

Sampling

Sample construction

Samples must be named, public, reproducible and unbiased — never "sites we happened to scan."

Source. Domains drawn from named public directories or rankings, stated explicitly per study. No client sites; no selection by outcome.
Size. The real number scanned is published honestly. Unscannable and access-denied domains are reported, not hidden.
Unit. Root-level domains. Where franchise, regional or affiliate domains maintain independent robots.txt control, each is treated as an independent observation; domains that redirect to or share a canonical robots.txt are consolidated to the canonical operating domain.
Record per domain. domain, sector, source, CMS, CDN/host, scan date, per-bot status, block origin, classification.

Limitations

Standing caveats

Directive, not enforcement. robots.txt is a directive; some crawlers may not honour it. The method measures declared configuration, not actual crawler behaviour.
Training vs retrieval. Blocking training crawlers can be a deliberate, legitimate content-protection choice. Group A (retrieval) and Group B (training) are reported separately and never merged.
Point-in-time. Date-stamped snapshot; robots files change.
Observational. Measures access configuration, not causal citation outcomes.

Version history

Changelog

v1.2 — 17 June 2026 (current)

Harmonised series release. Expands the Group A retrieval list to 14 user-agents (adds Bingbot, Claude-SearchBot, Claude-User, MistralAI-User, DuckAssistBot) and documents Grok as unmeasurable by robots.txt; extends access-denied handling to HTTP 401 and 429 alongside 403; adds the series-wide entity-type rule (commercial operating entities only — portals, aggregators, government, industry bodies, research institutes and not-for-profits excluded) so every country volume measures one comparable population; and formalises the two-layer data architecture (classifications assigned pre-scan, held in a frozen metadata file, joined to observations on domain). The authoritative dataset for the AU, US, GB and SG comparative analysis. Supersedes v1.0 figures.

v1.0 — 16 June 2026

Initial published version. Establishes the policy/infrastructure two-layer reporting model; defines HTTP 403 as access denied (infrastructure layer, excluded from policy denominators) and connection failure / timeout / 5xx as unscannable; freezes the Group A retrieval and Group B training crawler lists; codifies the false-positive prevention rules and the Googlebot baseline control. Basis for AU Vol 1 and US Vol 1.

Future versions will be published at distinct URLs (e.g. /methodology/c01-crawler-v2) so each study cites a fixed, immutable specification.

Attribution

Roles & disclosure

Periodic Table of Digital Authority (PTODA) owns and maintains this methodology. The PTODA C01 Crawler is the reference instrument. AUTHORITY44 provides technical infrastructure and execution support as commercial operator. Douglas Lord is founder of both PTODA and AUTHORITY44; this relationship is disclosed in full. Shared ownership is stated openly — the methodology's credibility rests on versioning, traceable datasets, disclosed limitations, and claims proportionate to evidence, all of which are public.

Cite as: PTODA C01 Crawler Methodology v1.3 (Periodic Table of Digital Authority, 2026), periodictableofdigitalauthority.com/methodology/c01-crawler-v1. The Periodic Table of Digital Authority™ (TM 2644497) and AUTHORITY44™ (TM 2643932) are trade marks pending. © Digital Dominator Pty Ltd ABN 28 616 931 116.