Study 01 · Global Digital Authority Benchmark Series · Singapore 2026

One in three Singapore business websites with an observable crawler policy block the AI crawlers they need

A structured benchmark of 268 publicly identifiable Singapore business websites (commercial operating entities) across 9 industry groups, measuring which AI search crawlers can — and cannot — access them.

Author Douglas Lord

Published by Periodic Table of Digital Authority

Research instrument PTODA C01 Crawler v1.2 — deterministic robots.txt scanner

Methodology PTODA C01 Crawler Methodology v1.2

Commercial operator AUTHORITY44™

Scan date 17 June 2026, SGT

Sample 268 domains · 9 sectors · 174 with readable robots.txt policy

33^%

of Singapore business domains with a readable robots.txt file block at least one AI retrieval crawler

Of 174 domains whose crawler policy could be directly observed, 58 — 33.3% — block at least one crawler used by AI search systems to discover and cite content.

We separate observable AI-crawler policy from infrastructure non-response. A block declared in robots.txt is a policy decision; a 403, timeout or unscannable response is an access outcome, not evidence of crawler policy. These are reported separately below.

Of the blocked sites, 86.2% are broad access restrictions catching AI crawlers incidentally rather than AI-specific decisions.

Key findings

The numbers at a glance

Policy-layer figures are based on 174 domains whose robots.txt was successfully retrieved and parsed. A further 94 domains returned no observable policy (13 access-denied, 81 unscannable) and are reported separately in the Infrastructure layer section. They are not counted as open or blocked.

66.7%

Open to all AI crawlers

116 of 174 policy-observed sites, with no robots.txt restriction preventing AI search discovery

28.2%

Fully blocked to all AI crawlers

49 sites block all tested retrieval crawlers

5.2%

Partially blocked

9 sites block some crawlers but not all

86.2%

Broad blocks — not targeted at AI

50 of 58 blocked sites also block Googlebot, so the block is a broad restriction rather than an AI-specific decision

13.8%

Deliberate AI-only blocks

8 sites specifically blocked AI crawlers while keeping Googlebot accessible

AI crawlers tested

14 retrieval (Group A) and 7 training (Group B) user-agents, the harmonised series crawler set

The central finding: Most blocked businesses are not actively choosing to exclude AI search. They have broad access restrictions set years ago that are now inadvertently catching AI crawlers. This is a configuration problem, not a strategic decision.

Infrastructure layer

Access outcomes — not crawler policy

We separate observable AI-crawler policy from infrastructure non-response. A block in robots.txt is a policy decision. A 403, timeout or unscannable response is an access outcome, not evidence of crawler policy. These 94 domains are reported here and excluded from every policy-layer figure.

Access denied (HTTP 403/401/429)

4.9% of the 268 domains approached. The edge refused the robots.txt request, so no crawler policy could be observed. Not counted as open or blocked.

Unscannable

30.2% of domains approached. No readable response through connection failure, timeout or 5xx error. No policy observable.

174

Policy observed

64.9% of domains approached returned a readable robots.txt. This is the denominator for all policy-layer findings on this page.

Why this matters: a domain that denies the crawler at the infrastructure layer has not expressed an AI-crawler policy — it has prevented one from being read. Treating such a response as “open” would overstate access; treating it as “blocked” would overstate restriction. Reporting it separately keeps the policy-layer figures based only on directly observed robots.txt behaviour.

The Singapore signature: non-response, not denial. Like Great Britain, Singapore restricts access mostly by not responding rather than by actively denying. Of the 94 domains that returned no observable policy, 81 (30.2% of those approached, the highest unscannable rate in the series) were unscannable through connection failure or timeout, against just 13 active access denials. This is passive non-response, the opposite of the United States pattern of managed-WAF edge denial. Singapore also has the lowest policy-layer block rate in the series at 33.3%, consistent with a market whose largest businesses lean toward global visibility.

Block origin

Intentional vs infrastructure-imposed

Of the 58 sites blocking AI retrieval crawlers, the source of the block was classified into three categories.

62.1%

Explicit — author-set

36 sites. Block is in the site's own robots.txt. May be intentional or legacy configuration.

29.3%

Indeterminate

17 Cloudflare-hosted sites without a managed-robots signature. Likely explicit blocks — cannot be confirmed by automated analysis alone.

8.6%

Infrastructure-imposed

5 sites. Block originates from Cloudflare's managed robots.txt feature — a platform default the owner may never have consciously set.

The infrastructure-imposed subset is the most commercially significant finding: these site owners may be blocking AI search discovery without ever having made that decision. The indeterminate category of 17 Cloudflare-hosted sites most likely represents explicit blocks, but the configuration path cannot be confirmed by automated means alone.

Sector analysis

Block rates by industry

Block rates vary across sectors. Healthcare, Retail and Hospitality are highest; Building & Trades lowest. Rates are computed on policy-observed domains per sector (readable robots.txt only). As a concentrated city-state market, several Singapore sectors are small; rates rest on modest bases and should be read accordingly.

% blocking ≥1 retrieval crawler (of policy-observed domains per sector)

Healthcare

50.0% (n=12)

Retail & Ecommerce

47.6% (n=21)

Hospitality & Tourism

47.4% (n=19)

Real Estate

31.6% (n=19)

Technology & SaaS

31.6% (n=38)

Education & Training

29.4% (n=17)

Accounting & Finance

28.6% (n=14)

Professional Services

25.0% (n=16)

Building & Trades

11.1% (n=18)

Healthcare at 50.0%, Retail at 47.6% and Hospitality at 47.4% are the highest-blocking Singapore sectors, though each rests on a small base. Building & Trades at 11.1% is by far the most open, the lowest single sector rate in the entire series. Singapore is a concentrated market and several of these sectors reflect close to the full population of citable named businesses rather than a sample of a larger pool.

Per-crawler analysis

Which crawlers are blocked most

Group A (retrieval/citation crawlers) drives the headline finding. Group B (training crawlers) is reported separately, because blocking training crawlers is often a deliberate and legitimate content-protection decision.

Group A — Retrieval & Citation Crawlers (14 tested)

GPTBot OpenAI32.2%

ClaudeBot Anthropic32.2%

anthropic-ai Anthropic31.0%

Perplexity-User Perplexity30.5%

MistralAI-User Mistral30.5%

ChatGPT-User OpenAI29.9%

Claude-User Anthropic29.9%

Googlebot baseline28.7%

Group B — Training Crawlers (7, separate)

Applebot-Extended Apple33.3%

meta-externalagent Meta33.3%

CCBot Common Crawl32.8%

Bytespider ByteDance32.8%

Amazonbot Amazon32.8%

Google-Extended Gemini training32.2%

FacebookBot Meta31.0%

The Googlebot parity finding holds in Singapore. Googlebot is blocked at 28.7%, right alongside the AI retrieval crawlers (GPTBot 32.2%, ClaudeBot 32.2%). Most Singapore AI-crawler blocks are broad restrictions, not targeted AI decisions. The 14 retrieval crawlers cluster tightly (28.7%–32.2%), indicating that where AI is blocked, it is typically blocked uniformly across operators rather than selectively.

Platform analysis

CMS correlation

Block rates by content management system among policy-observed domains. WordPress is the only platform with a usable base in the Singapore sample; the others rest on very small samples.

Block rate by detected CMS (policy-observed domains)

WordPress

28.3% (n=53)

Drupal

16.7% (n=6)

Most sites return no identifiable CMS signature, so platform-level rates are based on the minority that do. WordPress at 28.3% (n=53) sits a little below the overall Singapore sample average and is the only platform with enough domains for a meaningful rate. Drupal (n=6), Shopify (n=5), Webflow (n=3), Wix (n=3) and Squarespace (n=1) have too few domains in this sample to report reliably.

Methodology

How this study was conducted

Study specification

Methodology version

PTODA C01 Crawler Methodology (v1.2, June 2026). Citeable, versioned specification covering sample criteria, the 21-user-agent crawler list, classification logic, and the policy/infrastructure layer split.

Research instrument

PTODA C01 Crawler v1.2 — a deterministic robots.txt scanner. Same input produces the same result; the study is reproducible by re-running the instrument against the published sample.

Sample

268 publicly identifiable Singapore business websites (commercial operating entities) across 9 industry groups, sourced from named public directories. Singapore was sampled across 9 of the series’ 10 sectors; the Legal sector did not yield enough independently citable named commercial practices in this market to form a sector cell, so it is omitted rather than reported on a tiny base. As a concentrated city-state market, several sectors reflect close to the full population of citable named businesses rather than a sample. Portals, aggregators, government, industry bodies, research institutes and not-for-profits are excluded under the harmonised series entity-type rule. No client sites. No sites selected by outcome.

Sectors

Retail/Ecommerce, Real Estate, Healthcare, Building/Trades, Accounting/Finance, Hospitality/Tourism, Education/Training, Technology/SaaS, Professional Services. The Legal sector is not included; see the sample note above.

Measurement

Public robots.txt parsed per user-agent across 21 AI crawlers (14 retrieval, 7 training). Homepage meta robots and X-Robots-Tag headers examined. CMS and CDN/host detected from homepage signals.

Bot identity

PTODA-C01-Crawler/1.2 — identified honestly in every request. robots.txt respected; polite rate limits applied.

Scan date

17 June 2026, SGT. Point-in-time snapshot.

False positive prevention

WordPress /wp-admin/ disallows, Crawl-delay directives, and sitemap declarations explicitly excluded from blocked classification. Validated against 14 fixture tests before batch ran.

URL structure

Root-level domains only. Singapore is a regional headquarters hub, so a business qualifies where it is Singapore-headquartered or operates through a Singapore-rooted domain; global firms present only as a sub-path of an international domain are excluded.

Policy vs infrastructure layers

Of 268 domains approached, 174 returned a readable robots.txt policy (the policy-layer denominator). 94 returned no observable policy and are reported separately: 13 access-denied (HTTP 403/401/429 at the edge) and 81 unscannable (connection failure, timeout or 5xx). Access-denied and unscannable domains are never counted as open or blocked.

Series freeze reference

Dataset version

AI Crawler Access Study Series v1.2 — frozen 17 June 2026. The authoritative dataset for the Australia, United States, Great Britain and Singapore comparative analysis. All figures on this page derive from the v1.2 series master; figures published under v1.0 are superseded.

Limitations

Caveats

Point-in-time snapshot. Scan conducted 17 June 2026, SGT. robots.txt configurations change; findings reflect the sample state at time of scanning only.
Sample scope. A structured benchmark of 268 publicly identifiable Singapore business websites (commercial operating entities) across 9 industry groups. Findings generalise to that population, not all Singapore businesses.
Directive vs enforcement. robots.txt is a directive, not technical enforcement. Some crawlers may not honour directives. This study measures declared access configuration, not actual crawler behaviour.
Training vs retrieval. Blocking training crawlers (CCBot, Google-Extended, Bytespider) can be a deliberate, legitimate content-protection decision. Training and retrieval blocking are reported separately and the distinction is maintained throughout.
Root-domain limitation. Global firms present in Singapore only as a sub-path of an international domain are not included in this edition. Future benchmark editions will address sub-path and franchise location scoring.
Observational. This study measures access configuration, not causal AI citation outcomes. Study 03 in this series will measure correlation between authority signals and actual AI citation frequency.

Disclosure & Intellectual Property

Roles. This study is published by the Periodic Table of Digital Authority (PTODA), the methodology owner. It was conducted using the PTODA C01 Crawler v1.2, a deterministic robots.txt reference instrument, under PTODA C01 Crawler Methodology v1.2. AUTHORITY44 provided technical infrastructure and execution support as commercial operator. Douglas Lord is the founder of both PTODA and AUTHORITY44; this relationship is disclosed in full. The sample was constructed from named public directories with no reference to commercial relationships. The methodology is fully documented and reproducible. This study publishes aggregate, anonymised findings only. No named individual site results are published.

Attribution chain: Douglas Lord (researcher, author) · Periodic Table of Digital Authority (publisher & methodology owner) · PTODA C01 Crawler v1.2 (research instrument) · AUTHORITY44™ (commercial operator) · Digital Dominator Pty Ltd ABN 28 616 931 116 (operating entity).

Intellectual property notice: This study, its methodology, findings, data, and all associated content are the original work of Douglas Lord and the property of Digital Dominator Pty Ltd (ABN 28 616 931 116). The Periodic Table of Digital Authority™ is a coined framework and trade mark pending (TM 2644497). AUTHORITY44™ is a trade mark pending (TM 2643932). All rights reserved.

You may cite findings from this study with appropriate attribution identifying the author (Douglas Lord), the publisher (Periodic Table of Digital Authority — periodictableofdigitalauthority.com), and the research instrument (PTODA C01 Crawler v1.2). You may not reproduce this study in full, present these findings as your own research, or use the framework name or trade marks without prior written consent. Use of this research is subject to the Terms of Use.