A contribution to the first international “Digital Preservation Day”
Have you ever tried to access a website only to find that it is not available anymore?
The first international Digital Preservation Day aims to raise awareness of new processes and solutions in long-term preservation of cultural heritage in the digital age. ETH Library is also active in this area. In order to preserve ETH Zurich’s web presence (and thus one of the most important sources on the history of our university), the ETH Zurich University Archives has embarked on a web archiving initiative. Since the end of 2017, the new ETH Zurich Web Archive has been available to the public.
Added value compared to other web archives
While the mass of archived websites in the Internet Archive is truly impressive, the University Archives purposefully invests in the quality of its web archive in the following ways:
- Systematic selection of websites
- Quality assurance (e.g. comprehensiveness of the content)
- Professional description and metadata
- Long-term accessibility
- Persistent identifier for scientific citation
The University Archives cooperates closely with the ETH IT Services which operate the web crawler, and also with the ETH Data Archive (ETH Library) which provides long-term data storage and access.
At ETH Zurich, there is a tradition of web archiving. Thanks to an earlier initiative (available at http://www.archiv.ethz.ch/), some important websites are still accessible for historical purposes. The new ETH Web Archive regularly captures the most important parts of the university’s web presence: its main page, the portals for ETH members and students as well as special interest websites, i.e. institute and research group websites.
ETH websites are harvested using the web crawler Heritrix creating a container in WARC format holding all elements of the website. The illustration shows a section in the header of a WARC file.
Challenges for digital preservation
How can we ensure that our archived websites remain available in the long term? Reliable and geo-redundant storage is certainly a basic requirement and is provided by ETH IT Services. At ETH Library, our Digital Preservation Managers provide additional expertise by monitoring the evolution of file formats in order to ensure long-term preservation. It is essential to detect as early as possible whether the current standard format for web archive initiative (WARC) is being replaced by a new format and whether the web viewers in use are still supported. Of course, this task also applies to objects embedded in the websites, for example PDFs, images, and videos. The responsibility for this lies with the ETH Data Archive.
Where we would like to see innovation
Web archiving is a labour-intensive process, especially in regards to quality assurance and cataloguing. It would be useful to develop a tool which compares the pixels of the archived website with the pixels of the original website thus automatically grading the quality of the archived version. It would also be useful if a website’s title and date of last activity were automatically read out to the metadata fields.
How to search and quote
The ETH Zurich Web Archive is accessible via various online portals:
The metadata of the ETH Web Archive is accessible in various search portals. The illustration shows a list of research results from the University Archives Information System.
Every snapshot, i.e. the versions of a website archived at different times, is assigned a Digital Object Identifier (DOI). Thus, users can cite these snapshots as sources in their scientific publications.
Would you like to save your ETH website in the ETH Zurich Web Archive? To register, please email email@example.com.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Public License.
DOI Link: 10.16911/ethz-ib-2993-en