Archiving the internet

It’s mid-afternoon on a weekday and I’m at work. It’s a busy day, I’ve got a couple of research tasks on the go and then my internet slows to a crawl.
Trying to keep a little calm I ask around the team. Everyone’s experiencing the same issue, not a good sign.
tumblr_inline_mt28eumj891qz4rgp
I call IT and the worst is confirmed, the internet is down. ETA to fix is 3 hours.
…Bugger

More than 90 percent of my role is dependent upon a computer with an internet connection. Laws, commentary and cases are all online. Not only are they online but they are just as authorised as the print, and so from a business (and a librarian’s) perspective it makes sense to maintain an online collection rather than a hardcopy one. This is for various reasons, primarily because an online collection doesn’t get lost as easily as a hardcopy one and more than one person can access it at a time.

So much of our work and daily lives is online. All my work correspondence is online. Like a lot of other industry/professional publications this post is only published online. But when I look to create my Throwback Thursday for the Australian Law Librarians Association NSW blog I don’t look to the internet to see the relevant issues of 10, 15 or 20 years ago. I don’t look to the internet but the shelves. In 20 years time where am I going to look for the Throwbacks? We can easily work to ensure the future accessibility of a physical collection but what do you do for the online collection? What do we as a profession do for the online collection? Not the subscription content but all that stuff that we create, the blogs, the tweets, the journal articles, the slideshares, the videos.
belle-book-shelving
This post is about the permanence of internet content; current projects to create historical snapshots of the web, and the way we as online content creators and administrators can work to ensure the our content is as stable and as permanent as possible.

Wayback Machine – The Internet Archive

The most well-known project is the Internet Archives Wayback Machine. I know this sounds more than a bit nerdy but I think the Wayback Machine is a project of absolute brilliance. I utilise this free product on a weekly, if not, daily basis. The Wayback Machine is beautiful in its simplicity. All you have to do is Insert the url of the website, click go, and it will show you how many point in time snapshots have been made of the site, and you can access them. The comprehensiveness of this project varies. However, you can often obtain access to historical versions of individual pages and also sub-sites and documents.

Whilst the Wayback Machine is the flagship project of the Internet Archive team, the project beginning in 1996, it is not the only project from the Internet Archive team. The not for profit organisation also has collections of texts, audio, moving images, and software.

The Wayback Machine is the most well-known of initiatives but it is not the only one out there. Other free initiatives include:

Pandora http://pandora.nla.gov.au/overview.html – developed by the National Library of Australia and partners
The Internet Memory Foundation http://internetmemory.org/ – a not for profit foundation.
A range of other web archive initiatives are available on that great example of crowdsourcing Wikipedia

One thing I would like to highlight is that these types of projects are not just limited to the information enthusiast, libraries, not for profit and information professionals. A number of companies have recognised the economic opportunity that capturing the internet holds. As an example Web Preserver (http://webpreserver.com/) has capitalised on the internet as the point of activity to provide authoritative point-in-time monitoring of online content to be used in litigation.

How we can help

Think to the future – produce and maintain stable websites with links compliant with archiving standards.
Say no to robots.txt
The Internet Archive and other like projects employ crawler software to capture pages. Employing a robot text exclusion protocol (or robots.txt) prevents the page from being captured so don’t use it. Whilst robots.txt is regarded as a security technique it’s effectiveness as such is questionable.
Version control.
Consider how you are capturing and storing versions of your online services. Should you be capturing versions? This versioning can have many advantages, including enabling you to easily show the progress and development of your online services to key stakeholders and decision makers, but to also save your bacon in the event of a server crash. Losing all of your online content becomes less of a headache when you can quickly reinstate a previous version
Don’t over invest in a particular product or if you do, have a plan B.
Nothing is permanent so don’t over invest in a single information product, it will only end in tears.

 

Advertisements