6.1. Hash Filter
- Plugin Key
mediatype_filter_factory, wheremediatypeis a media type like text/html- Value Type
- Plugin Value Type
The string is the fully-qualified name of a Java class implementing the
org.lockss.plugin.FilterFactoryinterface.- Sample
<entry> <string>text/html_filter_factory</string> <string>edu.example.plugin.publisherx.PublisherXHtmlHashFilterFactory</string> </entry>
- Description
To canonicalize content before comparison between nodes in the LOCKSS audit and repair protocol, a plugin can define a hash filter for each affected media type. The goal is to pre-process content so that it is fit for a logical comparison between nodes, even if different nodes do not have byte-identical versions. This occurs frequently in HTML content that has personalizations ("You are logged in as..."), advertising, and other variable content ("You may also be interested in...", "Top 10 viewed articles this week...", "Recently added articles...") other than the main content. It can be needed for other media types like PDF and RIS because of timestamping, watermarking, and other dynamic server behaviors.
The
org.lockss.plugin.FilterFactoryinterface defines acreateFilteredInputStreammethod that accepts anorg.lockss.plugin.ArchivalUnitobject, anInputStreamof the URL's raw content, and a string representing the encoding, and returns anInputStreamof the canonicalized byte stream, which does not need to be a valid object of that media type (it is only used to compute a checksum).As part of its general content filtering framework, the LOCKSS plugin framework offers a variety of utility classes specifically for HTML Filters and PDF Filters.