Software and Processes
PAGE CONTENTS 5 minute read.
The Digital Projects Unit’s web archiving techniques.
Software
There are two main pieces to our web archiving process: harvesting the live web content and providing access to the resulting archived files.
Harvesting
For broad crawls we use the Internet Archive’s open source web crawler Heritrix to begin at URLs we specify and then recursively visit the links on the pages it encounters. It downloads the files (HTML, CSS, JavaScript, videos, etc.) that are within our scope and appends them to WARC files, the standard type of file for storing web archives, in WARC record format.
For more targeted smaller-scale crawling we use Browsertrix Crawler, Browsertrix Cloud, or pywb’s recording mode. For more complex websites, these tools, developed by Webrecorder, are able to produce higher fidelity archived web pages because, unlike Heritrix, they render pages in a browser, executing JavaScript, to write a more complete result to the output WARC files.
Access
To display the archived content on the Web, we are migrating from the International Internet Preservation Consortium (IIPC) maintained version of OpenWayback (no longer in development) to Webrecorder’s pywb. These applications rely on client and server-side scripts to rewrite links, so web requests are made for documents in the archive’s WARC files rather than trying to pull content from the live Web.
Archiving Process
We begin by examining the site(s) that we will be archiving, looking for areas we do or do not want to capture and identifying potential crawler traps. We browse the site manually as a user might and also look at source code when crawler access appears questionable. If necessary, we write scripts to extract elusive URIs that may then be added to the crawl’s seed list.
With knowledge gained from site examination, we program our crawler to follow rules that instruct it to harvest content that we have deemed to be within our desired scope.
Next we may do a test crawl to verify that our crawler has been configured in a way that allows us to download all of the URIs needed to render the archived site true to its live version. Completing a test capture also indicates how much time it should take to execute the final crawl which depends on factors such as the amount of content, how it is organized, and any delays needed to keep from overwhelming the target server.
Once a crawl is complete, we create a CDX file that is an index of the downloaded items stored in the WARC files and a second index that maps the WARC file names to their accessible locations. We then configure an instance of pywb, running at localhost, to use these indexes for viewing/navigating the archived website for quality. With the help of browser development tools, we can discover files that were not archived in our crawl.
If we missed desired areas of content, we modify our crawl configuration and execute another crawl to obtain the documents we lack.
Standards
Downloaded content is stored in WARC files. We do not manipulate these files once they are written, allowing us to keep a true record of a site at the time of its capture.
Current Challenges and Limitations
Although there is an active community developing and improving upon the tools and methods used in web archiving, there continues to be a common set of problems encountered during the process.
External links and externally hosted media, such as video, may be problematic to harvest since we must rely on third parties to supply the files. Even when we are able to download the media content files, some embedded media players do not function properly when replaying a site. Sites that embed media but also provide a link to a direct download of the media file help to ensure that users will be able to access these files from the archive.
Because the content served from the archive as a web site is static, server side scripting from the original site will not work. PHP, JSP, ASP, etc. will not execute, so actions such as querying databases, performing searches, and serving forms will no longer function. We do not have an explicit copy of any databases that drove the original site, so there is nothing for us to query. CSS should continue to function properly. Most JavaScript also works (since it is executed client side) as long as Heritrix and OpenWayback are able to access any URLs in the script.
Crawl Configuration
When configuring a crawler for a harvest, we consider settings, including:
- How many threads (processes) should be run at once
- How long the crawler should wait before retrievals
- How many times URIs should be retried
- Whether or not the crawler should comply with robots.txt
- In what format downloaded content should be written
The settings we apply change from crawl to crawl based on factors such as the current resources/hardware we have available, what permission we have gained to harvest a site, time limitations in place, and our goals for a specific capture. Two settings that remain consistent for every crawl specify our crawl operator information. Our crawler informs the web servers it visits of a URL where a webmaster noticing traffic from us may visit to read about our crawling activity. Additionally, we provide an e-mail address, so if a webmaster finds our crawler causing trouble for his or her servers, such as by making too many requests too quickly, he or she is able to contact us about the issue.
Crawl Scope
During configuration, we also define scope rules for the crawler to follow. Some of our most commonly applied rules are:
- Accept URIs based on SURT prefix
- Accept and reject URIs based on regular expressions
- Reject URIs based on too many path segments (potential crawler trap)
- Accept URIs based on number of hops from seed
- Use a transclusion rule that accepts embedded content hosted by otherwise out of scope domains
- Accept a URI based on an in-scope page linking to it
- Use a prerequisite rule that accepts otherwise out-of-scope URIs that are required to get something that is in scope