Fig 1.
Verifying the fixity of a file in a GitHub repository.
A third-party verifying the fixity of a file in a GitHub repository can directly download the file and access the associated server-generated hash to compare with a locally-generated hash.
Fig 2.
Verifying the fixity of a composite memento.
A third-party verifying the fixity of an archived web page cannot directly download the WARC or the server-generated hash, but has to access the resource via replay software. We can compare multiple locally-generated hashes to each other, but there is no single server-generated hash available for comparison.
Table 1.
Our set of 17 public web archives.
Table 2.
URI-Rs and URI-Ms of example web page.
The URI-Rs of the original resources and the URI-Ms of their corresponding mementos for the web page https://maturban.github.io/playground/index.html.
Fig 3.
The representation of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 4.
The rewritten HTML of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html.
The code marked in red was added by the archive. The archive also modifies the names of original headers by adding x-archive-orig at the beginning of these headers. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 5.
The raw HTML from requesting the memento https://web.archive.org/web/20190725212938id_/https://maturban.github.io/playground/index.html.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhDdissertation, 2020.
Fig 6.
Overview of our methodology for evaluating the use of hashes for determining fixity on replayed mementos.
Fig 7.
Note that we collected mementos on November 15, 2017, and thus, the number of mementos from 2017 is fewer than the number of mementos in other years. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Table 3.
Final URI-Rs and URI-Ms selected per archive.
The 16,267 total URI-Ms come from 3,698 unique URI-Rs. We include some of the same URI-Rs in multiple archives because they produce different URI-Ms.
Fig 8.
Median number of embedded resources per memento per year.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Table 4.
URI-Ms per archive per year.
Fig 9.
Diagram illustrating how the root hash of a memento using Merkle trees is generated.
The output of a Merkle tree becomes input to another Merkle tree. The brown Merkle tree is for generating a hash on HTTP response headers of each resource. The blue Merkle tree generates an overall hash for each resource. The red Merkle tree generates a hash that represents rewritten.warc and another hash for raw.warc. The root hash is generated by the green Merkle tree. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 10.
Different images shown on each replay.
Replaying the memento https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ at three different times produced three different images. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 11.
JavaScript code on https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/.
Because of the function Math.random(), each time the JavaScript code is executed, an image will be selected randomly. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 12.
Replaying the memento https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ at two different times. We observe different HTTP status code of the embedded image https://web.archive.org/web/20141209193553im_/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb_200.jpg. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 13.
Different URI-M, but identical HTTP entities.
Requesting the image https://web.archive.org/web/20171114170029im_/https://sos.tn.gov/sites/default/files/styles/large/public/15259.jpg?itok=BgNjlAZj which is embedded in the memento https://web.archive.org/web/20171114170029/https://sos.tn.gov/tsla at two different times. Each time, it redirects to a different URI-M with different Memento-Datetime, but the returned HTTP entities are identical.
Fig 14.
archive.is refers to itself differently in the HTML upon multiple downloads.
Downloading the ZIP file http://archive.is/download/BRWpm.zip of the memento http://archive.is/BRWpm at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 15.
Requesting the image http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg which is embedded in the memento http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ at two different times. We noticed HTTP entity change because of a transient error. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 16.
Three archives react differently to requests for raw mementos.
The archive vesafn.is returns a custom HTML page with 200 OK which might cause different hashes. The archive webharvest.gov issues 302 Redirect to the live web, while archive.org returns 302 Redirect (with the original, raw HTML page—marked in blue) to the closest raw memento that satisfies the request. The way that webharvest.gov and archive.org react to requests for raw mementos does not affect the hash calculation. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 17.
Redirect to a memento that has different HTTP entity.
Requesting the base HTML file https://web.archive.org/web/20080828005922id_/http://www.evangelcogdayton.org/ at two different times. The second request on December 28, 2017 redirects to a memento that has a different HTTP entity. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 18.
Redirect to a memento with the same URI-R, but different HTTP entity.
Requesting the image https://web.archive.org/web/20110116134258id_/http://1.gravatar.com/avatar/117a6cc4203b951f11fc43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G, which is embedded in the memento https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/, at two different times. The first HTTP request returns 200 OK, but the second request redirects to a URI-M (with the Memento-Datetime January 21, 2012 09:05:32 GMT) that has the same URI-R but a different HTTP entity. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 19.
Different HTTP entity, but image looks the same.
Requesting the image https://perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113/name/ResearchDIL-19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit, which is embedded in the memento https://perma-archives.org/warc/20170101182813/http://umich.edu/ at two different times. The first HTTP request returns 200 OK, while the second HTTP request of the image redirects to a URI-M (with the Memento-Datetime June 19, 2017 14:54:58 GMT) which has a different HTTP entity that looks exactly the same. The two images were compared using Resemble [105] (mismatched pixels are marked in pink). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 20.
Distribution of the number of distinct hash values over all 16,627 mementos.
The blue bar represents mementos with a single hash value for all downloads (1,920 mementos, 11.55%), and the red bar represents mementos with a different hash value on each download (2,670 mementos, 16.06%). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Table 5.
Mementos per archive that produced at least two different hashes, the same hash, or always different hashes.
Fig 21.
Number of mementos that have at least two different hashes increases over time.
This shows that the chance of getting different hashes for the same memento increases over time. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 22.
Types of changes affecting all mementos for each download.
There were a total of 16,627 mementos replayed in each download. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 23.
Percentage of mementos with each type of change in each download by archive.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 24.
Percentage of mementos in each archive showing changes from the previous download.
Light blue = fewer mementos with changes compared to a previous download, dark blue = more of the mementos have one or more changes. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 25.
Hash calculations for the resources from 1,566 composite mementos from the Internet Archive in download 1.
Each point = hash(HTTP response headers, HTTP entity body, HTTP status code, URI-M). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 26.
Hash calculations for the resources from 1,566 composite mementos from the Internet Archive in downloads 1–5 and download 39.
Each point = hash(HTTP response headers, HTTP entity body, HTTP status code, URI-M). Red = the hash value was observed in this download, Gray = the previously seen hash value was not observed in this download. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 27.
Number of new hash values calculated per download from the Internet Archive.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
Fig 28.
Number of new entities observed per download from the Internet Archive.
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.