6.1. Hash Filter
- Plugin Key
mediatype_filter_factory
, wheremediatype
is a media type like text/html- Value Type
- Plugin Value Type
The string is the fully-qualified name of a Java class implementing the
org.lockss.plugin.FilterFactory
interface.- Sample
<entry> <string>text/html_filter_factory</string> <string>edu.example.plugin.publisherx.PublisherXHtmlHashFilterFactory</string> </entry>
- Description
To canonicalize content before comparison between nodes in the LOCKSS audit and repair protocol, a plugin can define a hash filter for each affected media type. The goal is to pre-process content so that it is fit for a logical comparison between nodes, even if different nodes do not have byte-identical versions. This occurs frequently in HTML content that has personalizations ("You are logged in as..."), advertising, and other variable content ("You may also be interested in...", "Top 10 viewed articles this week...", "Recently added articles...") other than the main content. It can be needed for other media types like PDF and RIS because of timestamping, watermarking, and other dynamic server behaviors.
The
org.lockss.plugin.FilterFactory
interface defines acreateFilteredInputStream
method that accepts anorg.lockss.plugin.ArchivalUnit
object, anInputStream
of the URL's raw content, and a string representing the encoding, and returns anInputStream
of the canonicalized byte stream, which does not need to be a valid object of that media type (it is only used to compute a checksum).As part of its general content filtering framework, the LOCKSS plugin framework offers a variety of utility classes specifically for HTML Filters and PDF Filters.