TOOLS

Annotation Curation Tool

The Annotation and Curation Tool (ACT) was built by the Austrian Institute of Technology, with guidance from the British Library. Its design and implementation responded to a new piece of legislation introduced in the UK in 2013. The legislation extended our Legal Deposit Libraries’ responsibilities — for systematically collecting all published works — to include online as well as printed material.

The Non-Print Legal Deposit Regulation stimulated a transformation of web archiving practice in the UK, and initiated the development of new workflows for selecting, acquiring, and storing web content. At this time, the UK Legal Deposit Libraries — The British Library, National Library of Scotland, National Library of Wales, Bodleian Libraries, Cambridge University Libraries and Trinity College, Dublin — came together as the UK Web Archive.

The UK Web Archive is a collection of millions of websites, captured by the British Library on behalf of the six UK Legal Deposit Libraries. Each year, they make one broad crawl of the UK Domain to capture all websites with a UK top-level suffix (i.e., .uk, .scot, .cymru and .wales) plus any others which have been identified as hosted or based in the UK. Additionally, hundreds of regional and national online news publications are crawled on either a daily, or weekly basis.

The UK Web Archive use the Annotation Curation Tool to facilitate curatorial tasks, including adding metadata information, specifying how often a website should be ‘crawled’ (or captured) for the archive, checking the quality of captures, and seeking publishers’ permission to provide open access to selected sites.

Anyone can nominate a website for archiving. Visit this form and fill it in. New collections of websites can be instigated at any time by an individual, or an organisation – on any theme or area of interest. UK Legal Deposit Libraries seek to work in partnership with external organisations, to develop collections in line with the partner’s area of expertise or collection scope. This is a good way to harness specialist knowledge, whilst supporting organisations who are unable to carry out web archiving themselves.

The Posters Network have been collaborating with the British Library in this spirit. Together, we have trialled an alternative to our usual autonomous model of collecting, whereby archived content remains part of the British Library/UK Web Archive collection rather than being acquired by Network institutions. Technical infrastructure and know-how is exchanged for curatorial input. We see this mode of archiving as a way to capture online context around physical objects in our collections, with the aim of enlivening our acquisitions and better understanding how objects oscillate between online/offline spheres.

The Annotation Curation Tool is accessible online. However new users must request an invitation before they can begin using it. Get in touch with a member of the UK Web Archive team to register and get started.

Searching

Users can search and browse ACT for content that has been captured as part of the UK Web Archive’s Domain crawl. Search by URL, Website Name, Subject, Collecting Curator, or Nominating Organisation.
If a URL has been crawled, ACT retrieves the associated record.
If a URL has not been crawled, a user can identify it as a new ‘target’, and create a new record to define and capture the website.

Adding a New Record

Click the ‘Add Target’ button. New records require some basic descriptive metadata:
Title: The title is generally expressed as it is on the live website.
URL: the URL of the website.
Description: a brief factual description of the website (and/or the publishing organisation)
Subject: select from the broad subject headings in the drop down menu.

Scope, Depth and Frequency

Curators can specify the crawl ’depth’, ‘scope’ and ‘frequency’ of the websites they target for capture. Crawlers are programmed to archive every page on a website, systematically following internal links until a whole site is captured, or until the file package of webpages collected reaches 500MB (a default cap, which can be overridden). This internal distance travelled by the crawler from the start or ‘seed’ URL (usually the homepage) towards the inners of a website is termed as the crawl ‘depth’. Traditional web crawlers can also be automated to follow external links, moving away from a ‘seed’ URL for a specified number of link-clicks or ‘link-hops’ over website thresholds. The external distance travelled by the crawler is described as the crawl ‘scope’. In ACT, curators can indicate how frequently a website should be crawled: anything from ‘daily’ to ‘one off crawl’.

Legal Deposit Criteria

Legal Deposit Libraries are only permitted to capture UK websites. The system can run two automated checks to determine whether or not a site is hosted in the UK, one which determines its suffix is either .uk, .scot, .cymru or .wales, and a second which finds the geolocation of an IP address. Curator can also perform manual checks. For example, seeking out evidence of a UK postal contact address on a website.

Quality Assurance

ACT allows curators to perform basic quality assurance. When viewing a website within the system’s integrated Playback software, users can flag missing or incomplete content to the web archiving team for attention.

Licences for open access

A stipulation of the Non-Print Legal Deposit Regulation is that websites captured as part of the annual Domain crawl can only be made accessible within reading rooms at one of the Legal Deposit Libraries, unless publishers agrees that an archived version of their website can be made publicly accessible through the Open UK Web Archive.

The Annotation Curation Tool allows curators to send permission forms to website publishers to seek this agreement. The permissions process is handled within the Tool, so that content is automatically ‘whitelisted’ for open access if a permission form is completed. However, colleagues at the British Library explained that a higher response rate is garnered if external partners themselves communicate with website publishers by way of a pre-permissions or introductory email.