Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

< Back to Article

Fig 1.

Verifying the fixity of a file in a GitHub repository.

A third-party verifying the fixity of a file in a GitHub repository can directly download the file and access the associated server-generated hash to compare with a locally-generated hash.

More »

Fig 1 Expand

Fig 2.

Verifying the fixity of a composite memento.

A third-party verifying the fixity of an archived web page cannot directly download the WARC or the server-generated hash, but has to access the resource via replay software. We can compare multiple locally-generated hashes to each other, but there is no single server-generated hash available for comparison.

More »

Fig 2 Expand

Table 1.

Our set of 17 public web archives.

More »

Table 1 Expand

Table 2.

URI-Rs and URI-Ms of example web page.

The URI-Rs of the original resources and the URI-Ms of their corresponding mementos for the web page https://maturban.github.io/playground/index.html.

More »

Table 2 Expand

Fig 3.

The representation of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 3 Expand

Fig 4.

The rewritten HTML of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html.

The code marked in red was added by the archive. The archive also modifies the names of original headers by adding x-archive-orig at the beginning of these headers. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 4 Expand

Fig 5.

The raw HTML from requesting the memento https://web.archive.org/web/20190725212938id_/https://maturban.github.io/playground/index.html.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhDdissertation, 2020.

More »

Fig 5 Expand

Fig 6.

Overview of our methodology for evaluating the use of hashes for determining fixity on replayed mementos.

More »

Fig 6 Expand

Fig 7.

URI-Ms collected per year.

Note that we collected mementos on November 15, 2017, and thus, the number of mementos from 2017 is fewer than the number of mementos in other years. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 7 Expand

Table 3.

Final URI-Rs and URI-Ms selected per archive.

The 16,267 total URI-Ms come from 3,698 unique URI-Rs. We include some of the same URI-Rs in multiple archives because they produce different URI-Ms.

More »

Table 3 Expand

Fig 8.

Median number of embedded resources per memento per year.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 8 Expand

Table 4.

URI-Ms per archive per year.

More »

Table 4 Expand

Fig 9.

Diagram illustrating how the root hash of a memento using Merkle trees is generated.

The output of a Merkle tree becomes input to another Merkle tree. The brown Merkle tree is for generating a hash on HTTP response headers of each resource. The blue Merkle tree generates an overall hash for each resource. The red Merkle tree generates a hash that represents rewritten.warc and another hash for raw.warc. The root hash is generated by the green Merkle tree. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 9 Expand

Fig 10.

Different images shown on each replay.

Replaying the memento https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/ at three different times produced three different images. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 10 Expand

Fig 11.

JavaScript code on https://www.webharvest.gov/congress112th/20130119060624/http://www.fws.gov/.

Because of the function Math.random(), each time the JavaScript code is executed, an image will be selected randomly. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 11 Expand

Fig 12.

Different HTTP status codes.

Replaying the memento https://web.archive.org/web/20141209193553/http://noisecreep.com/aaron-harris-of-isis-talks-twitter/ at two different times. We observe different HTTP status code of the embedded image https://web.archive.org/web/20141209193553im_/http://wac.450F.edgecastcdn.net/80450F/noisecreep.com/files/2009/06/aaron_a042209eb_200.jpg. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 12 Expand

Fig 13.

Different URI-M, but identical HTTP entities.

Requesting the image https://web.archive.org/web/20171114170029im_/https://sos.tn.gov/sites/default/files/styles/large/public/15259.jpg?itok=BgNjlAZj which is embedded in the memento https://web.archive.org/web/20171114170029/https://sos.tn.gov/tsla at two different times. Each time, it redirects to a different URI-M with different Memento-Datetime, but the returned HTTP entities are identical.

More »

Fig 13 Expand

Fig 14.

archive.is refers to itself differently in the HTML upon multiple downloads.

Downloading the ZIP file http://archive.is/download/BRWpm.zip of the memento http://archive.is/BRWpm at three different times. Each time the archive refers to itself differently in the index.html in the ZIP file. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 14 Expand

Fig 15.

HTTP entity change.

Requesting the image http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/media/1157842/corporate-strategy-1.jpg which is embedded in the memento http://webarchive.nationalarchives.gov.uk/20170303010736id_/https://cereals.ahdb.org.uk/ at two different times. We noticed HTTP entity change because of a transient error. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 15 Expand

Fig 16.

Three archives react differently to requests for raw mementos.

The archive vesafn.is returns a custom HTML page with 200 OK which might cause different hashes. The archive webharvest.gov issues 302 Redirect to the live web, while archive.org returns 302 Redirect (with the original, raw HTML page—marked in blue) to the closest raw memento that satisfies the request. The way that webharvest.gov and archive.org react to requests for raw mementos does not affect the hash calculation. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 16 Expand

Fig 17.

Redirect to a memento that has different HTTP entity.

Requesting the base HTML file https://web.archive.org/web/20080828005922id_/http://www.evangelcogdayton.org/ at two different times. The second request on December 28, 2017 redirects to a memento that has a different HTTP entity. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 17 Expand

Fig 18.

Redirect to a memento with the same URI-R, but different HTTP entity.

Requesting the image https://web.archive.org/web/20110116134258id_/http://1.gravatar.com/avatar/117a6cc4203b951f11fc43f946106657?s=33&d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D33&r=G, which is embedded in the memento https://web.archive.org/web/20110114074814/http://www.copyblogger.com:80/popular-blogger/, at two different times. The first HTTP request returns 200 OK, but the second request redirects to a URI-M (with the Memento-Datetime January 21, 2012 09:05:32 GMT) that has the same URI-R but a different HTTP entity. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 18 Expand

Fig 19.

Different HTTP entity, but image looks the same.

Requesting the image https://perma-archives.org/warc/20170101182814id_/http://umich.edu/includes/image/type/gallery/id/113/name/ResearchDIL-19Aug14_DM%28136%29.jpg/width/152/height/152/mode/minfit, which is embedded in the memento https://perma-archives.org/warc/20170101182813/http://umich.edu/ at two different times. The first HTTP request returns 200 OK, while the second HTTP request of the image redirects to a URI-M (with the Memento-Datetime June 19, 2017 14:54:58 GMT) which has a different HTTP entity that looks exactly the same. The two images were compared using Resemble [105] (mismatched pixels are marked in pink). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 19 Expand

Fig 20.

Distribution of the number of distinct hash values over all 16,627 mementos.

The blue bar represents mementos with a single hash value for all downloads (1,920 mementos, 11.55%), and the red bar represents mementos with a different hash value on each download (2,670 mementos, 16.06%). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 20 Expand

Table 5.

Mementos per archive that produced at least two different hashes, the same hash, or always different hashes.

More »

Table 5 Expand

Fig 21.

Number of mementos that have at least two different hashes increases over time.

This shows that the chance of getting different hashes for the same memento increases over time. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 21 Expand

Fig 22.

Types of changes affecting all mementos for each download.

There were a total of 16,627 mementos replayed in each download. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 22 Expand

Fig 23.

Percentage of mementos with each type of change in each download by archive.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 23 Expand

Fig 24.

Percentage of mementos in each archive showing changes from the previous download.

Light blue = fewer mementos with changes compared to a previous download, dark blue = more of the mementos have one or more changes. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 24 Expand

Fig 25.

Hash calculations for the resources from 1,566 composite mementos from the Internet Archive in download 1.

Each point = hash(HTTP response headers, HTTP entity body, HTTP status code, URI-M). M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 25 Expand

Fig 26.

Hash calculations for the resources from 1,566 composite mementos from the Internet Archive in downloads 1–5 and download 39.

Each point = hash(HTTP response headers, HTTP entity body, HTTP status code, URI-M). Red = the hash value was observed in this download, Gray = the previously seen hash value was not observed in this download. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 26 Expand

Fig 27.

Number of new hash values calculated per download from the Internet Archive.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 27 Expand

Fig 28.

Number of new entities observed per download from the Internet Archive.

M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.

More »

Fig 28 Expand