The Affair of the Vanishing Content

By: Sam Vaknin, Ph.D.

"Digitized information, especially on the Internet, has such rapid turnover these days that total loss is the norm. Civilization is developing severe amnesia as a result; indeed it may have become too amnesiac already to notice the problem properly."
(Stewart Brand, President, The Long Now Foundation )

“(The design of the Internet) naturally create gaps of responsibility for maintaining valuable content that others rely on. Links work seamlessly until they don’t. And as tangible counterparts to online work fade, these gaps represent actual holes in humanity’s knowledge …

It turns out that link rot and content drift are endemic to the web, which is both unsurprising and shockingly risky for a library that has ‘billions of books and no central filing system.’ …

The first study, with Kendra Albert and Larry Lessig, focused on documents meant to endure indefinitely: links within scholarly papers, as found in the Harvard Law Review, and judicial opinions of the Supreme Court. We found that 50 percent of the links embedded in Court opinions since 1996, when the first hyperlink was used, no longer worked. And 75 percent of the links in the Harvard Law Review no longer worked …

People tend to overlook the decay of the modern web, when in fact these numbers are extraordinary—they represent a comprehensive breakdown in the chain of custody for facts …

With John Bowers and Clare Stanton, and the kind cooperation of The New York Times, I was able to analyze approximately 2 million externally facing links found in articles at since its inception in 1996. We found that 25 percent of deep links have rotted. (Deep links are links to specific content …) The older the article, the less likely it is that the links work. If you go back to 1998, 72 percent of the links are dead. Overall, more than half of all articles in The New York Times that contain deep links have at least one rotted link …

As far back as 2001, a team at Princeton University studied the persistence of web references in scientific articles, finding that the raw number of URLs contained in academic articles was increasing but that many of the links were broken, including 53 percent of those in the articles they had collected from 1994. Thirteen years later, six researchers created a data set of more than 3.5 million scholarly articles about science, technology, and medicine, and determined that one in five no longer points to its originally intended source. In 2016, an analysis with the same data set found that 75 percent of all references had drifted …

Because information is so readily placed online, the incentives for creating paper counterparts, and storing them in the traditional ways, declined slowly at first and have since plummeted. Paper copies were once considered originals, with any digital complement being seen as a bonus. But now, both publisher and consumer—and libraries that act in the long term on behalf of their consumer patrons—see digital as the primary vehicle for access, and paper copies are deprecated …

A complementary approach to “save everything” through independent scraping is for whoever is creating a link to make sure that a copy is saved at the time the link is made. Researchers at the Berkman Klein Center for Internet & Society, which I co-founded, designed such a system with an open-source package called Amberlink

I worked with researchers at Harvard’s Library Innovation Lab to start Perma. Perma is an alliance of more than 150 libraries. Authors of enduring documents—including scholarly papers, newspaper articles, and judicial opinions—can ask Perma to convert the links included within them into permanent ones archived at; participating libraries treat snapshots of what’s found at those links as accessions to their collections, and undertake to preserve them indefinitely …

In turn, the researchers Martin Klein, Shawn Jones, Herbert Van de Sompel, and Michael Nelson have honed a service called Robustify to allow archives of links from whatever source, including Perma, to be incorporated into new “dual-purpose” links so that they can point to a page that works in the moment, while also offering an archived alternative if the original page fails. That could allow for a rolling directory of snapshots of links from a variety of archives—a networked history that is both prudently distributed, internet-style, while shepherded by the long-standing institutions that have existed for this vital public-interest purpose: libraries.

(“The Internet is Rotting” by Jonathan Zittrain in The Atlantic, June 30, 2021)

Thousands of articles and essays posted by hundreds of authors were lost forever when surprisingly shut its virtual gates. A sizable portion of the 1960 census, recorded on UNIVAC II-A tapes, is now inaccessible. Web hosts crash daily, erasing in the process valuable content. Access to web sites is often suspended - or blocked altogether - because of a real (or imagined) violation by the webmaster of the host's Terms of Service (TOS). Millions of other web sites - the results of collective, multi-annual, transcontinental efforts - contain unique stores of information in the form of databases, articles, discussion threads, and links to other web sites. Consider "Central Europe Review". Its archives comprise more than 2500 articles and essays about every conceivable aspect of Central and Eastern Europe and the Balkan. It is one of countless such collections.

Similar and much larger treasures have perished since the dawn of the digital age in the 1920's. Very few early radio and TV programs have survived, for instance. The current "digital dark age" can be compared only to the one which followed the torching of the Library of Alexandria. The more accessible and abundant the information available to us - the more devalued and common it becomes and the less institutional and cultural memory we seem to possess. In the battle between paper and screen, the former has won formidably. Newspaper archives, dating back to the 1700's are now being digitized - testifying to the endurance, resilience, and longevity of paper.

Enter the "Internet Libraries", or Digital Archival Repositories (DAR). These are libraries that provide free access to  digital materials replicated across multiple servers ("safety in redundancy"). They contain Web pages, television programming, films, e-books, archives of discussion lists, etc. Such materials can help linguists trace the development of language, journalists conduct research, scholars compare notes, students learn, and teachers teach. The Internet's evolution mirrors closely the social and cultural history of North America at the end of the 20th century. If not preserved, our understanding of who we are and where we are going will be severely hampered. The clues to our future lie ensconced in our past. It is the only guarantee against repeating the mistakes of our predecessors. Long gone Web pages cached by the likes of Google and Alexa constitute the first tier of such archival undertaking.

The Stanford Archival Vault (SAV) in Stanford University assigns a numerical handle to every digital "object" (record) in a repository. The handle is the clever numerical result of a mathematical formula whose input is the number of information bits in the original object being deposited. This allows to track and uniquely identify records across multiple repositories. It also prevents tampering. SAV also offers application layers. These allow programmers to develop digital archive software and permit users to change the "view" (the interface) of an archive and thus to mine data. Its "reliability layer" verifies the completeness and accuracy of digital repositories.

The Internet Archive, a leading digital depository, in its own words:

" working to prevent the Internet - a new medium with major historical significance - and other 'born-digital' materials from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to permanently preserve a record of public material."

Data storage is the first phase. It is not as simple as it sounds. The proliferation of formats of digital content has made it necessary to develop a standard for archiving Internet objects. The size of the digitized collections must pose a serious challenge as far as timely retrieval is concerned. Interoperability issues (numerous formats and readers) probably requires software and hardware plug-ins to render a smooth and transparent user interface.

Moreover, as time passes, digital data, stored on magnetic media, tend to deteriorate. It must be copied to newer media every 10 years or so ("migration"). Advances in hardware and software applications render many of the digital records indecipherable (try reading your word processing files from 1981, stored on 5.25" floppies!). Special emulators of older hardware and software must be used to decode ancient data files. And, to ameliorate the impact of inevitable natural disasters, accidents, bankruptcies of publishers, and politically motivated destruction of data - multiple copies and redundant systems and archives must be maintained. As time passes, data formatting "dictionaries" will be needed. Data preservation is hardly useful if the data cannot be searched, retrieved, extracted, and researched. And, as "The Economist" put it ("The Economist Technology Quarterly, September 22nd, 2001), without a "Rosetta Stone" of data formats, future deciphering of stored the data might prove to be an insurmountable obstacle.

Last, but by no means least, Internet libraries are Internet based. They themselves are as ephemeral as the historical record they aim to preserve. This tenuous cyber existence goes a long way towards explaining why our paperless offices consume much more paper than ever before.

Copyright Notice

This material is copyrighted. Free, unrestricted use is allowed on a non commercial basis.
The author's name and a link to this Website must be incorporated in any reproduction of the material for any use and by any means.

