7.3. File Metadata Extractor
- Plugin Key
mediatypeis a media type like
- Plugin Value Type
- Plugin Value Type
The values are the fully-qualified name of a Java class implementing the
<entry> <string>text/html_metadata_extractor_factory_map</string> <map> <entry> <string>*</string> <string>edu.example.plugin.publisherx.PublisherXHtmlMetadataExtractorFactory</string> </entry> </map> </entry>
If the media type is represented under multiple guises in the plugin's AUs, for example XML represented as both
application/xml, you will need multiple entries in the plugin.
File metadata extractors are part of the metadata extraction pipeline. Their function is to parse the contents of a particular URL based on its media type and file format, and emit any number of
ArticleMetadatametadata records, and they are invoked as part of the execution of an Article Metadata Extractor.
org.lockss.extractor.SimpleFileMetadataExtractor utility class is used as a base class for the common case where a file metadata extractor produces a single metadata record (or
null), rather than an arbitrary number of metadata records. It defines one abstract method:
public abstract ArticleMetadata extract(MetadataTarget target, CachedUrl cu) throws IOException, PluginException;
extract(MetadataTarget target, CachedUrl cu, Emitter emitter) method simply calls
extract(MetadataTarget target, CachedUrl cu) and emits the returned
ArticleMetadata if it is not
Utility classes based on
org.lockss.extractor.JsoupTagExtractor utility class can be used to build HTML or XML file metadata extractors that use the jsoup parser.
By default, it maps the value of the
name attribute of HTML
<meta> tags to the value of their
content attribute in the
ArticleMetadata object's raw multi-map.
However if the media type is
application/xhtml+xml, or if the extractor is created with selector strings, for each selector string, and for each element matched by the selector string, it maps the selector string to the selector value in the raw multi-map. The selector strings are those understood by the
select(...) method of jsoup's
Subclasses provide the recipe multi-map (cook map) to process raw data into metadata.
org.lockss.extractor.JsoupXmlTagExtractor class exists but its functionality has been absorbed into
org.lockss.extractor.JsoupTagExtractor, which is capable of handling HTML without selector strings as well as HTML and XML with selector strings. It may be removed in a future version of the LOCKSS system and should not be used for new plugin implementations -- use
org.lockss.extractor.SimpleHtmlMetaTagMetadataExtractor class also exists and scrapes HTML
<meta> tags using a regular expression-based approach. It is at risk of being deprecated in a future version of the LOCKSS system, and is not recommended for new plugin implementations -- use
org.lockss.extractor.RisMetadataExtractor utility class parses RIS metadata files (media type
By default, it maps RIS tags to their values in the
ArticleMetadata object's raw multi-map, and its recipe map (cook map) maps the following raw keys (RIS tags) to the following
T1to the article title (
AUto an author (
JFtp the journal title (
DOto the DOI (
PBto the publisher name (
VLto the journal volume (
ISto the journal issue (
SPto the start page (
EPto the end page (
DAto the publication date (
SNto the ISSN (
MetadataField.FIELD_ISSN) for a journal (
TYtag equal to
JOUR) or ISBN (
MetadataField.FIELD_ISBN) for a book (
TYtag equal to
but the behavior is customizable.
Because the LOCKSS Program processes large amounts of bulk content on behalf of the CLOCKSS Archive, which is often in the form of bundles of content with multi-article metadata in XML (for example JATS format), there are utility classes in the
org.lockss.plugin.clockss package of the plugins tree of the lockss-daemon project to generalize this kind of data processing.
Plugins can only reference classes found in the plugin JAR itself, in lockss-core and in its dependencies (if using the re-architected LOCKSS system), or in the main tree of
lockss-daemon and in its dependencies (if using the classic LOCKSS system), so these classes in the plugins tree of
lockss-daemon are not directly accessible to arbitrary plugins (without some manipulation, like injecting additional classes in plugin JARs). However there is growing interest in re-using these utility classes in the broader LOCKSS community, so some of these classes will be "promoted" to
lockss-core so they can be used by third-party plugins in a future version of the LOCKSS system.
org.lockss.plugin.clockss.SourceXmlSchemaHelper classes define a framework for processing XML metadata in some format, and mapping from XPath expressions to text values in the
ArticleMetadata object's raw multi-map. The format-specific logic is confined in the
SourceXmlSchemaHelper class consists of a global map and an article map. Both map XPath strings to the corresponding values. The article map, aided by the
getArticleNode() method which gives an XPath for the top-level node of each article in the XML file, is used to designate XPaths for each emitted article from the file. The optional global map is used to designate XPaths that apply to all emitted articles from the file, and can be used for XML formats that hoist some data above the level of each article (for instance publication-level or issue-level data).
This framework also offers some features to perform deduplication or recombination, verify some URLs or file paths, and
getCookMap() method provides the recipe multi-map to produce metadata from the raw multi-map.
There is also an effort underway to define an equivalent framework for similarly structured metadata in JSON, using the Jayway JsonPath library.