Speculations
by Joanne McNeil
One Trillion Possibilities: The Internet Archive and the Vanishing Open Web
Twelve slashed zeros and a one, made of wood and latex paint, stand tall on the roof of the former church where the Internet Archive is headquartered in San Francisco. The organization commissioned this artwork by Jesse Walton to mark an unusual achievement this autumn. For the past 30 years, the digital library has preserved websites, and its collection now exceeds a trillion pages. It’s a mind-boggling milestone that speaks to both the deluge of websites created and edited over the decades and the organization’s concurrent endeavor to catalog digital content in its vastness.
The way we access web pages has largely remained the same since the 1990s, when the Internet Archive launched: enter a URL in a browser and the content of the page—typically, text and images—will load through hypertext protocols. The web itself has persevered as a noncommercial application, interoperable and public since it launched in 1991. What’s changed in recent years is a retreat from the open web. Nowadays, websites are regularly locked up in software experiences or proprietary codecs, from social media platforms to movies streaming on demand. You often need to subscribe or log in to view what’s there. If the web should devolve into something more like a delivery channel rather than a destination itself for information freely accessible to the public, what will be left for the organization to add to its Wayback Machine digital archive?
“The internet and the web were built on trust,” Brewster Kahle, the founder of the Internet Archive, told me via video chat in November, citing in contrast the recent erosion of that trust by way of “surveillance capitalism,” Big Tech monopoly power and the “ads and bugs and paywalls” strewn across modern websites. The closing down of the open web is “what really threatens us,” Kahle said. “If you’d only go to the web the way you would FTP or use fax machines—if it wasn’t a pleasure, not where fun things happened—then this whole edifice collapses.”
Back in 1996, when the Internet Archive was founded, Kahle estimated there were only about 50 million pages on the web. These pages turned over rapidly, revised or even disappearing in weeks. “While the changing nature of the internet brings a freshness and vitality, it also creates problems for historians and users alike,” wrote Kahle as he announced the organization’s mission in a special internet-themed edition of Scientific American the following year. Applying the core functions of a library to the digital world, the organization began preserving web content as snapshots from all “publicly accessible World Wide Web pages” (the one trillion–page total includes duplicates of the same page over time). Crawlers that scan, index and download web content were used to gather the collection.
In his 1997 article, Kahle explained that crawlers were best known as the automated scripts powering “search engines, such as AltaVista,” but a contemporary user might be familiar with them as the scraping agents that harvest training data for LLMs. That relationship has led to precipitous new enclosures of the web to the detriment of the Internet Archive’s collection of pages.
Over the summer, Reddit began blocking the Wayback Machine from archiving its content, citing how AI companies have included captured pages in their data sets. To be clear, this was no act of corporate benevolence—the company was not protecting its users from being sold for parts. Reddit has licensed its data to companies like Google and OpenAI; the Wayback Machine threatened the exclusivity it offers in these deals.
“I don’t want to think of the future of the Wayback Machine as this really cool thing that ran for 35 years [or] a time capsule of an infinitely small period of human history,” Chris Freeland, director of open libraries at Internet Archive, told me. Paywalls and other barriers to web archiving will have a “tremendous negative effect on our ability to remember and our ability to hold power brokers to account,” he said.
The team is smaller than the number on the roof would suggest, with a staff of about 200 people. Operating with a budget of around $30 million—about a sixth of the budget of the San Francisco Public Library, as Kahle is quick to point out—and “only 100 racks” of servers, the organization also opts to remain small when it comes to the data it gathers on its patrons. While the Internet Archive logs the rough geographic areas of users, they collect no granular analytics or metrics on how people use the site. It’s a library, after all, their thinking goes, and libraries don’t surveil you.
“I’m not the best at knowing how people are using it because often we don’t even know!” Kahle said, laughing, when I asked for stories of unexpected uses of the Wayback Machine. One example he provided was its role this year, celebrated by journalists and researchers, providing access to web pages purged by the Trump administration. Among the pages and documents flagged for removal in the federal government’s anti-DEI purge and censorship of issues like climate change was an image of the Enola Gay, which likely made it to the chopping block because of its file name.
This archiving was possible because the pages were accessible to the public on the open web. You can’t use the Internet Archive to track changes to a journalist’s reporting on Substack or if it’s behind a pop-up banner announcing, “This post is for paying subscribers.”
A library made of locked diaries wouldn’t be much to explore, and likewise, an infinite collection of books generated automatically defies institutional care and stewardship. The flooding of the web through generative AI in recent years is the subject of a clever new project by the artist Tega Brain called “Slop Evader.” This browser extension limits search engine results to content published before Nov 30, 2022, the day that ChatGPT was released. “It pushes back against false narratives of progress and assumes that the quality of the internet as an information retrieval tool has been in rapid decline since the public uptake of generative AI,” she explains in the product notes.
Slop Evader is both a present-day provocation and a time capsule of the open web’s recent past. The artist told 404 Media that she hoped to “give people examples of how you can refuse this stuff, to furnish one’s imaginary for what a politics of refusal could look like.” The enthusiastic public response to this project suggests that internet users aren’t about to give up on the web just yet. After all, it comprises more than three decades and a trillion pages of people communicating with each other.
“There’s probably a billion voices” among the Internet Archive’s trillion pages, Kahle said. “These are people that chose to share what they know to anybody, for free.” Somewhere in those trillion pages are stories, art and other examples that remind us what the web was and can continue to be.