All you need to know about Web Archiving at The National Archives
by Clare Brown on February 24, 2021
Notes from a Webinar: "Web archiving services at the National Archives" (Wednesday 3 February 2021)with Tom Storrar - Web Archiving Service Owner at The National Archives. Chair was Fiona Laing (currently Chair of SCOOP – Standing Committee on Official Publications)
When I read that Tom, Web Archiving Service Owner at the UK's National Archives (TNA) was presenting a webinar, I immediately signed up and was excited to join CILIP GIG colleagues online. We weren’t disappointed. Tom and his colleagues (7 full-time and one part-time) have the important role of officially preserving the UK government’s online material.
Technology has often run ahead of government, which has left researchers stranded. The issue of missing or inaccessible online government information has caused problems in the past, and was raised in the UK parliament, for instance in 2006 and 2009.
What’s the story of the UK Government Web Archive?
Even in 2006, the National Archives were on the case: in 2003 they were preserving a small selection of UK central government websites. By 2008, the scope had been expanded, with 2012 seeing the addition of social media. In 2017 they switched their service provider to MirrorWeb Ltd, which gives the archiving team a technical edge, for instance, being able to improve the site’s search functionality.
It comprises more than 40,000 crawls/snapshots of over 5000 websites and over 500 social media accounts. It is approximately 160tb in size, 6 billion resources and an important tool for contextualising records for past, present and future research. After all, in the future someone might write a PhD exploring the influence of government advertising on the sales of beef and lamb!
What is web archiving, and why is it important?
Tom agrees with the Wikipedia definition of web archiving, which says,
Web archiving is the process of collecting portions of the World Wide Web to ensure the information is preserved in an archive for future researchers, historians, and the public. Web archivists typically employ web crawlers for automated capture due to the massive size and amount of information on the Web.
One of the first experiences that made me question my librarian abilities occurred in May 1997, just after the general election. The change in government meant that the embryonic government websites on which we were starting to rely vanished overnight. I instantly regretted my lack of foresight and wished we’d printed out all that online material.
As a consequence of this, our law library created paper files of government press releases, guidance notes, manuals, reports, white paper etc which had to be updated daily - the role of the junior team member. When I see those government website iterations from the late 1990’s, it brings back many filing memories!
What do the National Archives capture?
We have come a long way since those classic static sites, and the advent of government broadcast on social media - Twitter and Flickr - means that even more information needs to be captured and archived. Web archiving operates within certain technical constraints, for it to be archived, content must be,
- Publicly available
- Reachable by robots/crawlers
If the web archiving team is informed about web resources that don’t meet the above criteria, they can intervene and capture it using state of the art tools such as Conifer. The captured pages then undergo quality assurance checks before they are published out - occasionally rogue code causes readability issues but is inevitable when you are dealing with this amount of data, and a wide variety of web sources.
Currently more than 800 distinct websites and social media accounts are regularly archived: those of central government, departments, and other public bodies, hubsites (e.g. GOV.UK, NHS, public inquiries, and some inquests. They take as much as possible from the target website:
- Publications, datasets, documentation
- Video, animations
Their approach is to take a “deep” and complete captures of every website they archive, with an emphasis on quality, completeness, and fidelity. Obviously there are circumstances when pages need to be taken down; when there are errors, non-governmental material, or something is subject to data protection - otherwise everything is conserved.
Special event-based archiving projects
It would be impossible - and unnecessary - to capture everything, everyday so they have a schedule for regular material. However some captures are triggered when a website is about to be retired, refreshed, or redesigned. Events of national importance such as Brexit (EEWA) have become web archive projects in their own right, and have required daily web archiving.
Tom outlined their three-pronged approach:
- They increased the frequency of captures of key sites and resources
- They supplemented the frequency with (weekly and fortnightly) keyword-generated broader crawls across the government web estate
- They captured daily snapshots of complex, interactive (forms etc), or very fast-changing content using web crawlers and/or Conifer
The search function is vital because the archive is huge and these projects are important. You have several options - a specific social media search; the general search, a fascinating A-Z, a URL search, and the Discovery catalogue style search. Discovery holds more than 32 million descriptions of records held by TNA and more than 2,500 archives across the country.
How people are coming together to help the National Archives
Web archiving is the responsibility of every government department and it can only be done with the assistance and cooperation of other people. TNA ask departments to:
- Make sure that their content is “crawlable” - there is technological guidance
- Provide XML sitemap(s), especially for content behind inaccessible functionality
- Ensure that the website’s copyright and reuse statement is clear
- Review the takedown policy, and check that existing archived content is within the rules
- Consider archiving timescales and remember that it isn’t an instantaneous process
- And check the capture before retiring, pruning, taking down or deleting a website!
The official status of TNA makes inter-departmental relationship building easier and to raise their profile, they hold events and host webinars. Should people need advice on tech, copyright, or new (or old!) websites, then they are invited to get in touch with the National Archive team.