The deterministic robots.txt access scanner used for the Global Digital Authority Benchmark Series crawler-access studies. This document is the versioned, reproducible specification studies cite.
The C01 Crawler determines, for a defined sample of domains, the publicly declared access configuration for major AI crawler user-agents — by retrieving and parsing each domain's robots.txt, homepage robots meta tags, and relevant HTTP headers, then detecting CMS and CDN/host signals.
It measures declared crawler-access policy, not crawler behaviour and not citation outcomes. It is deterministic: the same domain in the same state returns the same classification, which makes every study reproducible by re-running the instrument against the published sample.
v1.2 separates observable crawler policy from infrastructure non-response. The two are never mixed, and policy rates are computed only on domains whose robots.txt could actually be read.
A block declared in robots.txt is a policy decision. A 403, timeout or unscannable response is an access outcome, not evidence of crawler policy. Reporting the two layers separately prevents over-counting blocks and over-counting openness — results are reported only where behaviour was directly observed.
v1.3 adds a machine-readable-guidance layer to the v1.2 access instrument. On each domain, in the same scan pass, the crawler assesses whether the site has adopted llms.txt. The Layer 1 access methodology is unchanged and reproducible; this is an additive layer, not a revision to access measurement.
The crawler fetches /llms.txt and, immediately before it, a randomly-named control path that cannot exist. A file is recorded as present only when /llms.txt returns a genuine HTTP 200 and the control path does not return matching content — this excludes servers that return 200 for every path (soft-404s) from being miscounted as adopters. It is the single most important control in the adoption measurement.
Presence is separated from conformance. A present file is recorded as conformant only when it is non-empty, is not HTML (by body or content-type), is markdown-structured with at least one heading, and is not a placeholder. Adoption and conformance are reported as distinct rates, so a present-but-invalid file (empty, HTML, or placeholder) counts toward adoption awareness but not toward valid guidance.
Classification: present_conformant · present_nonconformant · absent · unscannable. As with the access layer, infrastructure non-response (timeout / 5xx) is excluded from adoption denominators. The paired analysis — llms.txt presence against AI-crawler access — uses only domains where both layers were observable. For every present file, the crawler also records byte size, markdown link count, content-type, and a SHA-256 content hash, retained to support later analysis of whether richer files correlate with AI citation.
The instrument is versioned. Each version is dated; access-layer methodology is frozen across the series.
v1.2 — Access benchmark. robots.txt access measurement; the policy / infrastructure two-layer reporting model; access-denied vs unscannable separation. Frozen for the AU / US / GB / SG access studies.
v1.3 — Added llms.txt assessment layer. Adds llms.txt presence and conformance (soft-404 control, capture-now fields) in the same scan pass. No change to access-layer methodology — v1.3 is a strict superset of v1.2, and re-running v1.3 on a v1.2 frame reproduces the v1.2 access numbers exactly. Instrument for the AI Accessibility & Machine-Readable Guidance Benchmark.
Two groups, interpreted differently. The crawler list is frozen across all volumes so that findings are comparable across countries and dates. New crawlers are not added mid-series.
Googlebot is included as a baseline. If a site blocks Googlebot at the same rate it blocks AI crawlers, the block is a broad restriction rather than an AI-specific decision — the single most important interpretive control in the method.
On Grok (xAI): deliberately excluded. xAI publishes no crawler documentation and its retrieval traffic uses residential-IP rotation with spoofed browser user-agents, presenting no declared user-agent and honouring no robots.txt contract. It cannot be measured by a robots.txt instrument, and including a token for it would produce meaningless results. Microsoft Copilot is covered by Bingbot; Gemini retrieval by Googlebot and Google-Extended.
The primary failure mode is misclassifying benign housekeeping directives as AI blocks. The following are never counted as blocks, in any volume:
Disallow: /wp-admin/ — standard admin housekeeping. Not an AI block.Crawl-delay: N — a politeness directive at any value. Recorded, never counted as blocking.Sitemap: declarations — informational, never a block.Disallow: — explicitly means allow all./cart, /search, /admin/, /login) — housekeeping, not content/AI blocks.A user-agent is classified blocked only where the governing robots.txt group disallows / or the homepage/primary content path. The classifier is validated against fixture tests before any batch is run.
https://{domain}/robots.txt; record HTTP status. 404 = no robots = open by default. 401 / 403 / 429 = access denied (infrastructure layer). Timeout / 5xx / connection failure = unscannable.*); determine whether it path-blocks the homepage/root for that bot.<meta> (noindex/nofollow/noai) and X-Robots-Tag headers. Recorded as secondary signals; robots.txt is primary.Samples must be named, public, reproducible and unbiased — never "sites we happened to scan."
Future versions will be published at distinct URLs (e.g. /methodology/c01-crawler-v2) so each study cites a fixed, immutable specification.
Periodic Table of Digital Authority (PTODA) owns and maintains this methodology. The PTODA C01 Crawler is the reference instrument. AUTHORITY44 provides technical infrastructure and execution support as commercial operator. Douglas Lord is founder of both PTODA and AUTHORITY44; this relationship is disclosed in full. Shared ownership is stated openly — the methodology's credibility rests on versioning, traceable datasets, disclosed limitations, and claims proportionate to evidence, all of which are public.
Cite as: PTODA C01 Crawler Methodology v1.3 (Periodic Table of Digital Authority, 2026), periodictableofdigitalauthority.com/methodology/c01-crawler-v1. The Periodic Table of Digital Authority™ (TM 2644497) and AUTHORITY44™ (TM 2643932) are trade marks pending. © Digital Dominator Pty Ltd ABN 28 616 931 116.