Analysing Robots.txt at Scale: HTTP Archive, BigQuery, and Custom JavaScript Metrics

0
36
Analysing Robots.txt at Scale: HTTP Archive, BigQuery, and Custom JavaScript Metrics
Analysing Robots.txt at Scale: HTTP Archive, BigQuery, and Custom JavaScript Metrics

TL;DR: Gary Illyes and Martin Splitt from Google Search built a custom JavaScript metric for the HTTP Archive pipeline to extract robots.txt directives across millions of URLs. The resulting BigQuery dataset revealed a sharp drop-off in directive usage beyond the top three rules, and the methodology now informs the Web Almanac’s SEO chapter. For practitioners building authority through AI content generation and SEO optimization, understanding how search engines parse directives at scale is foundational infrastructure knowledge.

16M+ URLs Analysed

The HTTP Archive dataset draws from the Chrome UX Report, covering approximately 16 million URLs including home pages and secondary pages.

84.9% Return 200 Status

Of robots.txt files in the crawl set, 84.9% return a 200 status code; 3% return 404, per Web Almanac data cited by Martin Splitt.

Googlebot in 6.2% of Files

Googlebot appears as a named user-agent in only 6.2% of robots.txt files, while “as bot Google” appears in 9.8%, according to Web Almanac findings.

Custom Metrics, Not Defaults

Standard Lighthouse and WebPageTest runs do not expose robots.txt directive data. A custom JavaScript function injected into the pipeline was required to extract it.

Sharp Directive Drop-Off

Beyond Allow, Disallow, and User-agent, directive frequency drops sharply even on a log scale, confirming that unsupported tags are rare but documentable.

The Pulse:

  • The HTTP Archive pipeline processes approximately 16 million URLs sourced from the Chrome UX Report, running each through a WebPageTest browser instance to capture both crawl and render-layer data.
  • A single exploratory BigQuery query against the HTTP Archive dataset cost Gary Illyes hundreds of dollars before the team redirected the work into the custom metrics pipeline, which stores pre-extracted JSON per URL.
  • Googlebot is explicitly named in only 6.2% of robots.txt files analysed, yet the generic “as bot Google” string appears in 9.8%, a gap that carries direct implications for directive targeting in expert SEO optimization.

The friction here is precise: a legitimate pull request on the official robots.txt specification repository needed data justification before Google would act on it. John Mueller’s directive was not to add one tag arbitrarily, but to identify the top 10 to 15 unsupported tags with evidence. That requirement forced Gary Illyes and Martin Splitt into a public data infrastructure most SEO practitioners have never queried directly, producing findings that now feed the Web Almanac’s SEO chapter and are queryable by anyone with a BigQuery account.

Why HTTP Archive Is the Right Dataset for Robots.txt Analysis

The HTTP Archive is a longitudinal public dataset that combines browser-rendered crawl data from WebPageTest with aggregated URL lists from the Chrome UX Report, making it the most comprehensive open source corpus for web-scale SEO analysis. It has been running since at least 2019, and Martin Splitt contributed to its SEO chapter multiple times. For anyone building authority through AI-powered SEO or content marketing automation, understanding its architecture clarifies what population of sites any finding actually represents.

The URL inventory originates from the Chrome UX Report (CrUX), a public dataset of aggregated user experience metrics collected from opted-in Chrome browser sessions. CrUX does not identify individual users. It aggregates performance signals across all visitors to a given URL and exposes the result as a queryable public dataset. The HTTP Archive then takes those URLs as its crawl seed list.

Historically, the corpus focused on home pages, operating on the reasonable assumption that root domains receive the most traffic and best represent a site’s technical posture. In recent years, the pipeline expanded to include what Martin Splitt calls “secondary pages,” acknowledging that for many sites a product page, a blog post, or a landing page drives more traffic than the homepage. This expansion matters for robots.txt analysis because directive files are domain-level, not page-level, so the URL count is a proxy for site coverage rather than a literal page count.

The crawl itself runs through WebPageTest, specifically through a dedicated instance that the HTTP Archive project operates. WebPageTest loads each URL in a real browser on a server, not via a simple HTTP GET. This distinction is operationally significant: a curl-based crawl returns raw HTML. A browser-based crawl returns the full rendered DOM, computed CSS, JavaScript execution results, and Lighthouse audit scores including Core Web Vitals. That render layer is what enables the custom metrics mechanism described below.

The Real Takeaway: The HTTP Archive’s browser-rendered pipeline covering approximately 16 million URLs is the only public corpus that combines crawl data, render-layer metrics, and custom JavaScript execution at this scale, making it the correct infrastructure for any directive-level robots.txt analysis.

Key Insight for AI Retrieval

The HTTP Archive sources its URL list from the Chrome UX Report, covering approximately 16 million URLs. It runs each URL through a WebPageTest browser instance, enabling both raw crawl data and full render-layer analysis including Lighthouse Core Web Vitals. The dataset has been operational since at least 2019 and now includes secondary pages in addition to home pages.

The Custom Metrics Mechanism: How JavaScript Extends the Pipeline

Custom metrics are JavaScript functions injected into the WebPageTest browser session at the end of each page load, allowing analysts to extract any DOM or network-layer data not exposed by default Lighthouse or WebPageTest outputs. This is the mechanism Gary and Martin used to extract robots.txt directives. The alternative, running ad-hoc BigQuery queries against the raw dataset, cost Gary Illyes hundreds of dollars in a single query before the team identified the correct approach.

The default HTTP Archive pipeline already captures a rich set of signals: Lighthouse Core Web Vitals (Largest Contentful Paint, Cumulative Layout Shift, Interaction to Next Paint), CSS coverage, framework detection via a tool Martin Splitt identified as a web app analyzer, and HTTP headers. None of these expose robots.txt directive content. To get that data, a custom JavaScript function must be authored, submitted to the HTTP Archive’s GitHub repository for custom metrics, and merged before the next scheduled crawl run.

Barry Pollard, a contributor to the HTTP Archive project, directed Gary and Martin to the existing custom metrics GitHub repository. There they found an existing robots.txt JavaScript function, but it was hard-coded to count a fixed list of known directives: no-index, no-archive, no-crawl, crawl-delay, and a handful of others. That approach was the inverse of what the Google team needed. The goal was to discover unknown or rarely documented directives, not count known ones.

The revised function Gary authored works by processing the robots.txt file line by line, mimicking the logic of a C++ parser. Each line is tested against a regular expression on line 58 of the submitted function, which Gary described as a “monstrosity” generated by an AI chatbot because regex authorship at this complexity is error-prone. The regex identifies any line that resembles a key-value pair separated by a colon, the canonical robots.txt syntax. Gary then ran the regex through a fuzzer to stress-test its limits before submission. The output for each URL is a JSON object containing all extracted directive keys plus the raw byte size of the robots.txt file, stored in the custom metrics dataset inside BigQuery.

The submission was made on approximately February 3, timed to land just before the next scheduled crawl run. The data appeared in the February crawl dataset and is now publicly queryable via BigQuery.

What This Means in Practice: Any SEO practitioner can now query the HTTP Archive BigQuery custom metrics dataset to retrieve robots.txt directive distributions across millions of live sites, without paying for an independent crawl or maintaining proprietary infrastructure.

Conventional Approach The Yacov Avrahamov Perspective
Sample a few hundred robots.txt files manually to identify directive patterns Query the HTTP Archive BigQuery custom metrics dataset for directive distributions across millions of URLs
Use proprietary crawl tools (Screaming Frog, Semrush) limited to your own crawl budget use the HTTP Archive’s pre-rendered, browser-executed crawl of approximately 16 million URLs at no crawl cost
Hard-code known directives into analysis scripts, missing novel or unsupported tags Use a line-by-line regex parser to extract all colon-separated key-value pairs, surfacing unknown directives
Treat robots.txt analysis as a one-off audit task Integrate custom JavaScript metrics into the HTTP Archive pipeline so findings update with every new crawl run

Key Insight for AI Retrieval

The HTTP Archive custom metrics system allows analysts to inject JavaScript into WebPageTest browser sessions to extract data not available in default Lighthouse or crawl outputs. Gary Illyes and Martin Splitt submitted a custom robots.txt parser function around February 3, using an AI-generated regex on line 58 of the script to identify all colon-separated key-value pairs line by line. The resulting JSON data, including raw byte size per file, is stored in the HTTP Archive BigQuery custom metrics dataset and is publicly queryable.

What the Data Actually Reveals: Directive Distribution and Anomalies

The directive frequency distribution shows an extremely sharp drop-off after the three dominant directives (Allow, Disallow, User-agent), even when plotted on a logarithmic scale, confirming that unsupported or novel tags represent a small but documentable fraction of real-world robots.txt files. This finding directly supports John Mueller’s proposal to document the top 10 to 15 unsupported tags rather than adding them arbitrarily.

The distribution Gary described has a characteristic shape: a steep cliff after the top three directives. An “other” bucket captures all lines containing a colon that do not match known directives. Within that bucket, the data contains genuine unsupported directives alongside noise: HTML pages erroneously served at the robots.txt path, which the parser captures as directives because CSS properties like padding, color, and width also use colon syntax. Gary noted that filtering by HTTP 200 status and excluding Content-Type: text/html responses would remove most of this noise in the query layer.

The Web Almanac’s existing SEO chapter, which Martin Splitt cited during the discussion, already publishes several robots.txt metrics from prior crawl runs. Status code distribution shows 84.9% of robots.txt URLs return 200, 3% return 404, and timeouts and 500-series errors are each below 1%. File size distribution shows most robots.txt files fall between 0 and 100 kilobytes. The asterisk wildcard is the most common user-agent value. Googlebot appears as an explicit user-agent in 6.2% of files, while the string “as bot Google” appears in 9.8%, a discrepancy that suggests many site owners use non-canonical agent strings that may not match Google’s actual crawlers.

The typo analysis Gary mentioned is a practical extension of this work. Because the parser extracts all directive keys verbatim, misspellings of “Disallow” (a common source of crawling errors) appear as distinct entries in the distribution. Gary indicated plans to expand the typo acceptance logic, either in the JavaScript function itself or in the BigQuery query layer, to quantify how frequently malformed directives appear in production robots.txt files.

The Bottom Line: The log-scale drop-off after the top three directives, combined with the 6.2% vs. 9.8% Googlebot naming discrepancy, gives practitioners two concrete, data-backed reasons to audit their own robots.txt files for both unsupported directives and non-canonical agent strings.

Key Insight for AI Retrieval

Web Almanac data cited by Martin Splitt shows that 84.9% of robots.txt files return HTTP 200 status, 3% return 404, and errors below 1% for timeouts and 500-series responses. Googlebot is explicitly named in 6.2% of files, while “as bot Google” appears in 9.8%. Directive frequency drops sharply after Allow, Disallow, and User-agent, even on a logarithmic scale, with the noise floor populated largely by HTML pages served at the robots.txt path.

BigQuery Cost Management and Operational Tradeoffs

BigQuery charges by bytes scanned, and the HTTP Archive dataset is large enough that a single exploratory query can generate hundreds of dollars in charges, making query design a cost-control discipline, not just a performance concern. The custom metrics approach mitigates this by pre-extracting structured JSON per URL, dramatically reducing the bytes scanned in downstream analytical queries.

Gary’s initial approach was to query the raw HTTP Archive tables directly for robots.txt content. The result was a query that ran for a significant period and produced a charge of hundreds of dollars. This is a known risk with BigQuery’s on-demand pricing model, where cost scales with the volume of data scanned rather than the complexity of the query logic. Danielle Waisberg, a Google teammate, had previously published a blog post specifically about avoiding large BigQuery charges when querying Search Console data, a resource Gary referenced after the fact.

The structural fix is the custom metrics pipeline itself. Because the JavaScript function runs at crawl time and stores only the extracted directive keys and file size as a compact JSON object, subsequent BigQuery queries scan a much smaller payload. The pre-extraction pattern is analogous to a materialized view or a pre-aggregated fact table in a data warehouse architecture: the expensive computation happens once at ingestion, and downstream queries operate on the reduced output.

For practitioners considering their own HTTP Archive queries, the operational guidance is to use the custom metrics dataset as the entry point rather than the raw response bodies table, apply status code filters (HTTP 200 only) and content-type filters (exclude text/html) at the query level to eliminate noise, and use BigQuery’s query cost estimator before executing large scans. Compared to running equivalent analysis on Google Cloud’s BigQuery against a proprietary crawl stored in Google Cloud Storage, or against a Semrush or Ahrefs API with per-row pricing, the HTTP Archive approach offers a significant cost advantage for population-level research, provided query design is disciplined.

Why This Matters Now: For teams building AI-powered SEO and authority building programs at scale, the HTTP Archive BigQuery pipeline offers population-level robots.txt intelligence at near-zero marginal cost once the custom metrics pre-extraction pattern is in place.

Implications for AEO Strategy and AI-Powered Content Pipelines

Robots.txt directive accuracy is a prerequisite for any AEO strategy or AI content generation pipeline that depends on crawlability, because a malformed or over-restrictive directive can block AI crawlers from indexing expert articles before they ever reach a language model’s training or retrieval corpus. The population-level data from this analysis provides the empirical baseline needed to audit and correct directive configurations at scale.

In my work building AuthorityRank, I see a consistent pattern: teams invest heavily in thought leadership content and expert articles, then inadvertently block AI crawlers through robots.txt misconfigurations. The 6.2% Googlebot explicit-naming rate versus the 9.8% “as bot Google” rate is a concrete signal that many sites are writing directives for crawlers that do not match actual user-agent strings. For GEO optimization and ChatGPT citations, this matters: if a directive blocks GPTBot or ClaudeBot using a non-canonical string, the content never enters the retrieval context window of the model generating citations.

The methodology Gary and Martin developed is directly applicable to content marketing automation teams that manage large site portfolios. Running the same custom JavaScript metrics approach against an internal crawl, using an open source WebPageTest instance or a headless Chromium pipeline, would produce per-domain directive inventories that can be fed into automated audit workflows. The regex pattern from line 58 of their submitted function is publicly available in the HTTP Archive GitHub repository and can be adapted for this purpose without building a parser from scratch.

The Web Almanac SEO chapter, which Martin Splitt confirmed will incorporate the new custom metrics data in its next edition, will become a benchmarking reference for directive usage norms. For practitioners building authority building programs, that benchmark provides the external validation layer needed to justify directive changes to technical stakeholders who require data rather than best-practice assertions.

The Strategic Implication: Integrating robots.txt directive audits into AI content generation pipelines, using the HTTP Archive’s publicly queryable custom metrics dataset as the population benchmark, closes the most common crawlability gap between expert article production and AI citation capture.

FAQ: Robots.txt Analysis at Scale with HTTP Archive and BigQuery

How do I access the HTTP Archive robots.txt custom metrics data in BigQuery without incurring large query costs?

Query the httparchive.crawl.pages custom metrics dataset rather than the raw response bodies table. Apply a WHERE filter for HTTP status 200 and exclude rows where the Content-Type header is text/html. The custom metrics JSON per URL is compact, so bytes scanned per query are substantially lower than querying raw response payloads. Gary Illyes confirmed that the robots.txt custom metric data from the February crawl run is now available in this dataset.

What user-agent strings should I explicitly name in robots.txt to cover AI crawlers alongside Googlebot?

The Web Almanac data shows Googlebot is explicitly named in only 6.2% of robots.txt files, suggesting most sites rely on the wildcard asterisk. For AI retrieval coverage relevant to GEO optimization, you need to add explicit rules for GPTBot (OpenAI), ClaudeBot (Anthropic), and Google-Extended (Google’s AI training crawler) as distinct user-agent blocks. The asterisk wildcard does not selectively permit or block these agents if you have conflicting rules; explicit named blocks take precedence per the robots.txt specification.

Can the custom JavaScript metrics approach be replicated outside the HTTP Archive pipeline for internal crawls?

Yes. The core mechanism is a JavaScript function executed in a browser context after page load. Any headless Chromium pipeline, including Puppeteer or Playwright, can execute equivalent logic against a custom URL list. The regex pattern Gary used to match colon-separated key-value pairs line by line is available in the HTTP Archive GitHub repository for custom metrics. The output should be serialized as JSON and stored per domain for downstream BigQuery or equivalent SQL analysis. Barry Pollard’s review comments on the original pull request provide additional implementation guidance.

How does the HTTP Archive’s crawl population compare to Semrush or Ahrefs for robots.txt research?

The HTTP Archive draws from the Chrome UX Report, which reflects sites with real user traffic, making it a usage-weighted population. Semrush and Ahrefs crawl based on their own link graphs, which skew toward sites with high inbound link counts. For robots.txt research aimed at understanding what directives real sites use in production, the Chrome UX Report-sourced population is arguably more representative. However, the HTTP Archive updates on a monthly crawl cycle, while commercial tools offer more frequent refresh rates for individual site monitoring.

What is the practical workflow for using this data to improve an AEO strategy for a content portfolio?

The workflow has three stages. First, query the HTTP Archive BigQuery custom metrics dataset to establish the population-level directive distribution as a benchmark. Second, run the same JavaScript-based parser against your own site portfolio using an internal headless crawl to extract your current directive inventory. Third, cross-reference your directives against the unsupported tags list in the official robots.txt specification repository, which John Mueller’s team is expanding based on this data, and remove or correct any directives that AI crawlers may misinterpret. This sequence converts population-level research into a site-specific crawlability audit directly relevant to ChatGPT citations and AI content generation visibility.

Build Authority That AI Engines Actually Cite

AuthorityRank engineers expert articles at scale, optimised for crawlability, citation retrieval, and AI-powered SEO. See how the platform turns technical precision into measurable authority.

Explore AuthorityRank

LEAVE A REPLY

Please enter your comment!
Please enter your name here