TL;DR: Gary Illyes and Martin Splitt from Google Search used HTTP Archive’s custom metrics pipeline, WebPageTest-rendered crawls, and BigQuery to analyse robots.txt directives across millions of URLs. The project, born from a simple pull request on the official robots.txt repository, produced a distribution of real-world directive usage that now informs Google’s documentation and the Web Almanac’s SEO chapter. Understanding this infrastructure is directly relevant to any practitioner building authority-grade, citation-worthy content at scale.
The Pulse:
- HTTP Archive’s custom metrics pipeline, running through WebPageTest browser instances, enables JavaScript-level extraction of robots.txt directives across millions of live URLs: data unavailable from a simple HTML crawl.
- A single miscalibrated BigQuery query against the HTTP Archive dataset cost Gary Illyes hundreds of dollars, underscoring the operational cost tradeoff between exploratory querying and purpose-built custom metrics.
- The Web Almanac’s SEO chapter reports that 84.9% of URLs return a valid robots.txt (HTTP 200), while only 6.2% explicitly reference the Googlebot user-agent string.
The friction here is fundamental: robots.txt is one of the oldest crawl-control mechanisms on the web, yet no public, large-scale dataset of its real-world directive usage existed. Gary Illyes from Google Search needed empirical data to justify adding new directives to the official robots.txt unsupported-tags list, and the path from a GitHub pull request to a BigQuery-backed answer required navigating three distinct infrastructure layers. That journey reveals exactly how authoritative, citation-worthy analysis is built at the intersection of open data, browser-based rendering, and SQL-scale querying.
Why a GitHub Pull Request Triggered a Data Infrastructure Project
The project began when a GitHub user with the handle “3×10 raised to 8” submitted a pull request to the official robots.txt repository, proposing two new directives be added to the unsupported-tags list. Rather than accept or reject the change arbitrarily, John Mueller, Gary’s manager at Google, proposed a data-driven baseline: identify the top 10 to 15 unsupported directives in actual use and document them all at once. That decision transformed a one-line code review into a full-scale data extraction exercise.
Google’s internal standard, as Gary described it, is to avoid arbitrary documentation decisions. The team needed a public repository of robots.txt files at scale. After two days of searching, Gary had found nothing suitable. Martin Splitt pointed him toward HTTP Archive, a dataset Gary had previously encountered only through its published Web Almanac reports, never as a queryable data source.
This distinction matters for anyone building authority-grade content or SEO optimization workflows. The difference between consuming a published report and querying the underlying dataset is the difference between citing a secondary source and producing primary analysis. Primary analysis is what gets cited by AI engines like ChatGPT and referenced in publications like the Web Almanac itself.
| Conventional Approach | The Yacov Avrahamov Perspective |
|---|---|
| Accept a pull request based on team intuition or prior knowledge | Collect empirical data from millions of live URLs before making a documentation decision |
| Use published Web Almanac reports as a citation source | Query the underlying HTTP Archive BigQuery dataset directly to produce primary, citable analysis |
| Run a command-line HTML crawl (curl, wget) for robots.txt analysis | Use WebPageTest browser instances to capture JavaScript-rendered data and custom metrics not available from raw HTML |
| Write regex manually for directive extraction | Use an AI chatbot to generate the regex, then validate it through a fuzzer before deployment |
| Query BigQuery without cost controls, absorbing unpredictable charges | Route data collection through the custom metrics pipeline to avoid runaway query costs |
The Real Takeaway: A single GitHub pull request, handled with data discipline rather than editorial instinct, produced a robots.txt directive distribution dataset now embedded in the HTTP Archive’s February crawl: a reusable public asset that did not exist before this project.
Google’s decision to document unsupported robots.txt directives was triggered by a GitHub pull request from user “3×10 raised to 8”. Rather than act arbitrarily, John Mueller proposed identifying the top 10-15 unsupported directives in real-world use. This required querying HTTP Archive’s BigQuery dataset: a public repository of crawl data covering approximately 16 million URLs sourced from the Chrome UX Report.
How HTTP Archive Actually Works: The Three-Layer Architecture
HTTP Archive operates through three sequential layers: URL sourcing from the Chrome UX Report, browser-based rendering via WebPageTest, and data storage in BigQuery for SQL-scale querying. Each layer adds a distinct capability that the previous one cannot provide alone. Understanding this architecture is prerequisite knowledge for anyone who wants to use the dataset for SEO optimization or authority-building research rather than simply reading its outputs.
Layer One: URL Sourcing from Chrome UX Report
The Chrome UX Report (CrUX) is a public dataset of aggregated, anonymised user-experience metrics collected from Chrome browsers that have opted into usage reporting. It does not identify individual users; it aggregates performance data across all visitors to a given URL. Martin Splitt confirmed the dataset contains approximately 16 million URLs. Historically these were home pages, on the reasoning that home pages attract the most traffic and represent the most optimised surface of any given site. In recent years, HTTP Archive expanded to “secondary pages”: deeper URLs that, for some sites, receive more traffic than the home page itself.
Layer Two: Browser Rendering via WebPageTest
Raw HTML crawling, the kind performed by curl or wget on the command line, cannot capture JavaScript-rendered content, Core Web Vitals, CSS usage ratios, or Lighthouse scores. HTTP Archive addresses this by running URLs through WebPageTest (webpagetest.org), a browser-automation service that loads each URL in an actual browser instance hosted on WebPageTest’s infrastructure. This produces the full rendered DOM, performance timing data, and: critically: the ability to execute custom JavaScript against each loaded page.
The custom JavaScript execution capability is what made Gary and Martin’s robots.txt project possible. When WebPageTest signals that a page has finished loading, the pipeline executes any registered custom metric scripts against the browser context. These scripts can interrogate the DOM, read response headers, or in this case, fetch and parse the robots.txt file associated with each URL’s domain.
Layer Three: BigQuery Storage and SQL Querying
All crawl outputs, rendered metrics, and custom metric results are stored in BigQuery datasets maintained by HTTP Archive. Analysts can write standard SQL queries against these datasets. The operational tradeoff, as Gary discovered directly, is cost: BigQuery charges by data scanned, and the HTTP Archive tables are large. Gary’s first exploratory query, run before the custom metrics approach was established, cost him hundreds of dollars in a single execution. Teammate Danielle Waisberg had previously documented cost-avoidance strategies for BigQuery queries against Search Console data, but Gary had not applied those patterns to HTTP Archive queries.
Compared to alternatives like running your own crawler (Screaming Frog at the desktop level, or a custom Scrapy pipeline at scale) or purchasing a commercial dataset from providers like Semrush or Ahrefs, HTTP Archive offers a unique combination: browser-rendered data, custom JavaScript execution, public accessibility, and BigQuery’s SQL interface. The cost-per-query model is the primary operational risk, and the custom metrics pipeline is the architectural answer to that risk.
What This Means in Practice: Any practitioner querying HTTP Archive’s BigQuery tables without first routing data collection through the custom metrics pipeline risks multi-hundred-dollar query charges for datasets that may not even contain the specific file type they need: as Gary found when initial queries returned no robots.txt files at all.
HTTP Archive’s data pipeline has three layers: URL sourcing from Chrome UX Report (~16 million URLs), browser-based rendering via WebPageTest instances, and BigQuery storage for SQL querying. Custom JavaScript metrics are executed post-render via WebPageTest, enabling extraction of data not available from raw HTML crawls. A single unoptimised BigQuery query against this dataset cost Gary Illyes from Google Search hundreds of dollars before the custom metrics approach was adopted.
Building the Custom JavaScript Metric for Robots.txt Extraction
The custom metric Gary and Martin wrote approximates the behaviour of a C++ robots.txt parser by processing the file line by line and extracting any string that matches a key-value pair separated by a colon. This approach intentionally avoids hard-coding known directives, instead capturing the full distribution of whatever directives are present: including unknown, malformed, and novel ones. Barry Pollard, a contributor to the HTTP Archive project, pointed the team to the existing custom metrics GitHub repository, where a prior robots.txt script existed but only counted a fixed list of known directives.
The core parsing logic uses a regular expression Gary described as a “monstrosity” on line 58 of the script. He generated this regex using an AI chatbot, which he noted is “really good at writing regex for some reason”: likely due to the volume of regex training data available to large language models. Before submitting the script, Gary ran it through a fuzzer to test its limits under adversarial inputs. The regex did not break under fuzzing, giving the team sufficient confidence to submit it to the HTTP Archive GitHub repository.
The extraction logic processes each line of a robots.txt file, identifies lines that contain a colon separator, and extracts the key portion as the directive name. This produces a broad distribution that includes valid directives (allow, disallow, user-agent), unsupported proprietary directives, and noise from malformed robots.txt files that are actually HTML error pages. Gary noted that many URLs in the dataset return HTML pages with CSS properties like “padding”, “img”, “color”, and “width” appearing as apparent directive names: artifacts of sites returning 404 or 500 HTML pages at the /robots.txt path.
The team submitted the custom metric on approximately February 3, just before the next HTTP Archive crawl run. It was merged and included in the February dataset. The resulting directive distribution, which Gary mentioned sharing on Bluesky, shows an extremely sharp drop-off after the three dominant directives (allow, disallow, user-agent): a pattern that holds even when plotted on a logarithmic scale. The output is stored as a JSON object per URL in the custom metrics dataset, alongside the raw byte size of each robots.txt file.
The Strategic Implication: Using an AI-generated regex validated through fuzzing, then submitting it to an open-source pipeline that runs against 16 million URLs, is a replicable model for any SEO practitioner who needs primary data on web-scale directive or markup patterns: without maintaining their own crawl infrastructure.
The custom JavaScript metric for robots.txt extraction was built by Gary Illyes and Martin Splitt, guided by Barry Pollard of HTTP Archive. It uses an AI-generated regular expression to extract all key-value pairs separated by a colon from robots.txt files, rather than counting only pre-known directives. Submitted around February 3, the metric was included in HTTP Archive’s February crawl dataset and is now publicly queryable via BigQuery.
What the Data Actually Shows: Directive Distribution and Web Almanac Findings
The robots.txt directive distribution extracted from HTTP Archive confirms that the vast majority of real-world robots.txt usage concentrates in three directives, with everything else representing a long tail that drops sharply even on a logarithmic scale. The Web Almanac’s SEO chapter, which Martin Splitt referenced directly, provides additional quantitative context that frames the significance of this distribution.
According to the Web Almanac SEO chapter cited by Martin, 84.9% of crawled URLs return a robots.txt file with an HTTP 200 status code. 13% return a 404, and the remaining responses (timeouts, 500-series errors) are each below 1%. File sizes are predominantly between 0 and 100 kilobytes, which Martin noted is consistent with the practical limits of what can be usefully encoded in a robots.txt file.
On user-agent targeting: the asterisk wildcard user-agent, which applies rules to all crawlers, is the most common. The Googlebot string appears in only 6.2% of files, while the broader “Google” string (which would match variations like Googlebot-Image or Googlebot-News) appears in 9.8% of files. This gap suggests that many site operators who intend to target Google specifically are using an overly broad string, or conversely, that many Google-specific rules are being written under the asterisk wildcard rather than the Googlebot user-agent.
The noise in the distribution is itself informative. HTML pages served at /robots.txt paths produce directive-like strings from CSS properties, confirming that a non-trivial portion of the web’s robots.txt “files” are actually error pages. Gary noted that filtering by HTTP status code (excluding non-200 responses) and by Content-Type header (excluding text/html) would significantly clean the dataset for future analysis iterations.
Why This Matters Now: With the February HTTP Archive dataset now containing this directive distribution, the Web Almanac’s SEO chapter has richer robots.txt data available for its next edition: and any practitioner can reproduce or extend this analysis by querying the custom metrics table in BigQuery directly.
Applying This Methodology to Authority Building and AI Content Strategy
The methodology Gary and Martin used: sourcing a public dataset, building a custom extraction layer, and producing a statistically grounded distribution: is precisely the kind of primary research that AI engines like ChatGPT, Perplexity, and Google’s AI Overviews cite as authoritative sources. In my work building AuthorityRank and analysing what content gets cited versus what gets ignored, the pattern is consistent: AI models retrieve content that contains specific, verifiable, numeric claims tied to named entities and documented methodologies.
The robots.txt project is a concrete example of how authority building at scale works in practice. The team did not publish an opinion piece about robots.txt best practices. They produced a dataset, documented their methodology, and derived findings that are reproducible by anyone with BigQuery access. That reproducibility is what makes content citation-worthy rather than merely readable. For AI content generation workflows, this distinction is the difference between producing filler and producing assets that accumulate citations over time.
From an AEO strategy and GEO optimization perspective, the mechanism is straightforward: AI retrieval systems prioritise content that answers specific questions with specific numbers, names, and verifiable processes. The HTTP Archive robots.txt analysis answers “what directives do real websites use” with a distribution backed by millions of data points. That is the structure of a citation-worthy answer unit, regardless of whether it appears in a podcast transcript, a Web Almanac chapter, or a purpose-built expert article.
The operational lesson for content marketing automation is equally clear. Building the custom metric took Gary and Martin days of iteration, GitHub review, and fuzzer testing. But once submitted, the metric runs automatically on every subsequent HTTP Archive crawl. The marginal cost of each additional data point is near zero. That is the architecture of scalable thought leadership content: invest heavily in the extraction mechanism, then harvest the outputs continuously.
The Bottom Line: Primary data analysis structured around named entities, specific metrics, and reproducible methodology is the content format that AI engines extract and cite: making it the highest-use investment in any authority-building or AI-powered SEO program.
Frequently Asked Questions
Can I query the HTTP Archive robots.txt custom metrics data without incurring large BigQuery costs?
Yes, but cost control requires querying the custom metrics table specifically rather than the raw crawl tables. Gary Illyes confirmed that querying the full HTTP Archive tables without targeting the custom metrics subset produced a charge of hundreds of dollars from a single query. The custom metrics table is significantly smaller because it stores pre-extracted JSON objects rather than raw page content. Additionally, Google Cloud’s BigQuery offers a free tier of 1 TB of query processing per month: structuring queries to stay within that threshold is the standard cost-avoidance approach documented by Danielle Waisberg of Google for Search Console data queries.
How does HTTP Archive’s crawl differ from what Googlebot or commercial crawlers like Screaming Frog do?
HTTP Archive uses WebPageTest browser instances to fully render each URL, capturing JavaScript-executed content, Core Web Vitals, and Lighthouse scores that a raw HTTP crawler cannot access. Screaming Frog’s desktop crawler offers JavaScript rendering as an option but is constrained by local machine resources. Googlebot crawls at a scale comparable to HTTP Archive but does not expose its raw crawl data publicly. The key differentiator for HTTP Archive is the combination of browser rendering, custom JavaScript execution, and public BigQuery access: a combination that neither Googlebot nor commercial crawlers provide to external analysts.
What happens to robots.txt files that return HTML error pages instead of plain text?
The custom metric Gary and Martin built will extract CSS property names and HTML attribute strings as apparent directive names, because the regex matches any colon-separated key-value pattern. In the February dataset, this produces noise entries like “padding”, “img”, “color”, and “width” in the directive distribution. Gary noted two practical filters: checking that the HTTP status code is 200, and checking that the Content-Type header is not text/html. Applying both filters in the BigQuery query would remove the majority of HTML error page contamination from the directive distribution without modifying the custom metric itself.
Why does the Googlebot user-agent appear in only 6.2% of robots.txt files if Google is the most commonly referenced crawler?
Martin Splitt’s citation from the Web Almanac SEO chapter shows that “Googlebot” appears in 6.2% of files while the broader “Google” string appears in 9.8%. The gap reflects two patterns: some operators use “Google” as a substring match that catches multiple Google crawlers (Googlebot-Image, Googlebot-News, Google-InspectionTool), and many Google-specific rules are written under the asterisk wildcard user-agent rather than under a named Googlebot directive. The practical implication is that a significant portion of intended Google-specific crawl rules may be applying to all crawlers simultaneously.
How can I use this same custom metrics approach for my own SEO research questions beyond robots.txt?
Barry Pollard, who reviewed Gary and Martin’s submission, maintains the HTTP Archive custom metrics GitHub repository. Any researcher can submit a JavaScript function that runs post-render in the WebPageTest browser context. The function has access to the full DOM, response headers, and can make additional network requests (such as fetching /robots.txt from the page’s domain). Submissions that are merged before a crawl run are included in that crawl’s dataset. Martin Splitt noted that prior custom metrics cover head element analysis, charset detection, and framework identification: meaning the pattern is well-established for extracting any data point not available from default Lighthouse or browser tooling. The February 2024 robots.txt dataset is now publicly available in BigQuery under the HTTP Archive custom metrics tables.
Build Content That AI Engines Actually Cite
AuthorityRank engineers expert articles with the data density, named-entity precision, and structural clarity that ChatGPT, Perplexity, and Google AI Overviews extract as authoritative sources. Generate 30 citation-ready articles in under 5 minutes.
