LOCKSS Documentation Portal Logo

LOCKSS System

  • Releases
    • LOCKSS 2.0-beta1
    • LOCKSS 1.78
    • Archived 2.x Releases
      • LOCKSS 2.0-alpha7
      • LOCKSS 2.0-alpha6
      • LOCKSS 2.0-alpha5
      • LOCKSS 2.0-alpha4
      • LOCKSS 2.0-alpha3
      • LOCKSS 2.0-alpha2
      • LOCKSS 2.0-alpha1
      • LOCKSS 2.0-alpha0
    • Archived 1.x Releases
      • LOCKSS 1.77
      • LOCKSS 1.76
      • LOCKSS 1.75
      • LOCKSS 1.74
      • Older Releases
  • Security
    • CVE-2022-39135
    • CVE-2021-45105 and CVE-2021-44832
    • CVE-2021-44228, CVE-2021-45046 and CVE-2021-4104
  • LOCKSS 1.x System Manual
  • LOCKSS 2.x System Manual
  • Acknowledgments

LOCKSS Guides

  • LOCKSS 1.x to 2.x Migration Guide
    • 1. Migration Overview
      • 1.1. Migration Scenario
      • 1.2. Overview of the Migration Process
    • 2. Upgrading to LOCKSS 1.78.5
    • 3. Preparing Your LOCKSS 2.x Host
    • 4. Installing LOCKSS 2.0.84-beta1
    • 5. Configuring LOCKSS 2.x for Migration
      • 5.1. Importing Configuration From LOCKSS 1.x
      • 5.2. Running configure-lockss --migrate
      • 5.3. Running LOCKSS 2.x
    • 6. Configuring LOCKSS 1.x for Migration
    • 7. Running the Migrator
    • 8. Reconfiguring LOCKSS 2.x for Normal Operation
    • 9. Frequently Asked Questions about the Migration
    • 10. Appendix: Differences Between LOCKSS 1.x and LOCKSS 2.x
      • 10.1. Technical Aspects
      • 10.2. Features
      • 10.3. Node Operation
    • 11. Appendix: LCAP Over SSL Migration
  • LOCKSS Software Guides
    • Debugpanel
      • 1. Installing Debugpanel
        • 1.1. Debugpanel Prerequisites
        • 1.2. Debugpanel Installation
      • 2. Debugpanel Overview
        • 2.1. Debugpanel Node Operations
        • 2.2. Debugpanel AU Operations
        • 2.3. Other Debugpanel Operations
      • 3. Using Debugpanel
        • 3.1. Debugpanel Commands
          • 3.1.1. debugpanel Command
          • 3.1.2. debugpanel check-substance
          • 3.1.3. debugpanel copyright
          • 3.1.4. debugpanel crawl
          • 3.1.5. debugpanel crawl-plugins
          • 3.1.6. debugpanel deep-crawl
          • 3.1.7. debugpanel disable-indexing
          • 3.1.8. debugpanel license
          • 3.1.9. debugpanel poll
          • 3.1.10. debugpanel reindex-metadata
          • 3.1.11. debugpanel reload-config
          • 3.1.12. debugpanel tree
          • 3.1.13. debugpanel validate-files
          • 3.1.14. debugpanel version
        • 3.2. Debugpanel Node Options
        • 3.3. Debugpanel AUID Options
        • 3.4. Debugpanel Output Format Options
        • 3.5. Debugpanel Job Pool Options
      • 4. Using Debugpanel as a Library
      • 5. Debugpanel API Reference
        • 5.1. lockss.debugpanel
          • Node
            • Node.DEFAULT_PROTOCOL
            • Node.authenticate()
            • Node.get_url()
          • RequestUrlOpenT
          • check_substance()
          • crawl()
          • crawl_plugins()
          • deep_crawl()
          • disable_indexing()
          • node()
          • poll()
          • reindex_metadata()
          • reload_config()
          • validate_files()
        • 5.2. lockss.debugpanel.cli
          • cli.RequestUrlOpenT
          • _AUID_COMMANDS
          • _DEFAULT_JOB_POOL_TYPE
          • _DebugPanelCli
            • _DebugPanelCli._auids
            • _DebugPanelCli._ctx
            • _DebugPanelCli._do_auid_command()
            • _DebugPanelCli._do_node_command()
            • _DebugPanelCli._executor
            • _DebugPanelCli._initialize_auid_operation()
            • _DebugPanelCli._initialize_node_operation()
            • _DebugPanelCli._nodes
            • _DebugPanelCli._opts
            • _DebugPanelCli.check_substance()
            • _DebugPanelCli.crawl()
            • _DebugPanelCli.crawl_plugins()
            • _DebugPanelCli.deep_crawl()
            • _DebugPanelCli.disable_indexing()
            • _DebugPanelCli.dispatch()
            • _DebugPanelCli.poll()
            • _DebugPanelCli.reindex_metadata()
            • _DebugPanelCli.reload_config()
            • _DebugPanelCli.validate_files()
          • _JobPoolType
            • _JobPoolType.PROCESS_POOL
            • _JobPoolType.THREAD_POOL
          • _NODE_COMMANDS
          • _Opts
            • _Opts.auid
            • _Opts.auids
            • _Opts.depth
            • _Opts.headings
            • _Opts.node
            • _Opts.nodes
            • _Opts.p
            • _Opts.password
            • _Opts.pool_size
            • _Opts.pool_type
            • _Opts.process_pool
            • _Opts.progress
            • _Opts.table_format
            • _Opts.thread_pool
            • _Opts.u
            • _Opts.username
          • _auid_operation()
          • _auid_option_group()
          • _depth_option_group()
          • _node_operation()
          • _node_option_group()
          • _output_option_group()
          • _pool_option_group()
          • main()
    • Turtles
      • 1. Installing Turtles
        • 1.1. Turtles Prerequisites
        • 1.2. Turtles Installation
      • 2. Turtles Overview
        • 2.1. Turtles Concepts
        • 2.2. Turtles Operations
          • 2.2.1. Turtles Plugin Building Operations
          • 2.2.2. Turtles Plugin Deployment Operations
          • 2.2.3. Other Turtles Operations
      • 3. Configuring Turtles
        • 3.1. Configuring a Plugin Set
        • 3.2. Configuring a Plugin Set Catalog
          • 3.2.1. Default Plugin Set Catalog File
        • 3.3. Configuring a Plugin Registry
        • 3.4. Configuring a Plugin Registry Catalog
          • 3.4.1. Default Plugin Registry Catalog File
        • 3.5. Configuring Plugin Signing Credentials
          • 3.5.1. Default Plugin Signing Credentials File
      • 4. Using Turtles
        • 4.1. Turtles Commands
          • 4.1.1. turtles command
          • 4.1.2. turtles build-plugin
          • 4.1.3. turtles copyright
          • 4.1.4. turtles deploy-plugin
          • 4.1.5. turtles license
          • 4.1.6. turtles release-plugin
          • 4.1.7. turtles tree
          • 4.1.8. turtles version
        • 4.2. Turtles Options
          • 4.2.1. Turtles Interactivity Options
          • 4.2.2. Turtles Output Format Options
          • 4.2.3. Turtles Plugin Identifier Options
          • 4.2.4. Turtles Plugin JAR Options
          • 4.2.5. Turtles Plugin Registry Options
          • 4.2.6. Turtles Plugin Registry Layer Options
          • 4.2.7. Turtles Plugin Set Options
          • 4.2.8. Turtles Plugin Signing Credentials Options
      • 5. Turtles Configuration Reference
        • 5.1. Plugin Set Definition Reference
          • 5.1.1. Plugin Set Builder Specification
            • 5.1.1.1. Maven Plugin Set Builder Specification
            • 5.1.1.2. Legacy Ant Plugin Set Builder Specification
        • 5.2. Plugin Set Catalog Definition Reference
        • 5.3. Plugin Registry Definition Reference
          • 5.3.1. Plugin Registry Layout Specification
            • 5.3.1.1. Directory Plugin Registry Layout Specification
            • 5.3.1.2. RCS Plugin Registry Layout Specification
          • 5.3.2. Plugin Registry Layer Specification
        • 5.4. Plugin Registry Catalog Definition Reference
        • 5.5. Plugin Signing Credentials Definition Reference
      • 6. Turtles API Reference
        • 6.1. lockss.turtles
          • __copyright__
          • __license__
          • __version__
        • 6.2. lockss.turtles.app
          • BuildPluginResult
          • DeployPluginResult
          • Turtles
            • Turtles.CONFIG_DIRS
            • Turtles.CONFIG_DIR_NAME
            • Turtles.ETC_CONFIG_DIR
            • Turtles.PLUGIN_REGISTRY_CATALOG
            • Turtles.PLUGIN_SET_CATALOG
            • Turtles.PLUGIN_SIGNING_CREDENTIALS
            • Turtles.USR_CONFIG_DIR
            • Turtles.XDG_CONFIG_DIR
            • Turtles.build_plugin()
            • Turtles.default_plugin_registry_catalog_choices()
            • Turtles.default_plugin_set_catalog_choices()
            • Turtles.default_plugin_signing_credentials_choices()
            • Turtles.deploy_plugin()
            • Turtles.load_plugin_registries()
            • Turtles.load_plugin_registry_catalogs()
            • Turtles.load_plugin_set_catalogs()
            • Turtles.load_plugin_sets()
            • Turtles.load_plugin_signing_credentials()
            • Turtles.release_plugin()
            • Turtles.select_default_plugin_registry_catalog()
            • Turtles.select_default_plugin_set_catalog()
            • Turtles.select_default_plugin_signing_credentials()
            • Turtles.set_plugin_signing_password()
        • 6.3. lockss.turtles.cli
          • _Opts
            • _Opts.headings
            • _Opts.interactive
            • _Opts.plugin_identifier
            • _Opts.plugin_identifiers
            • _Opts.plugin_jar
            • _Opts.plugin_jars
            • _Opts.plugin_registry
            • _Opts.plugin_registry_catalog
            • _Opts.plugin_registry_layer
            • _Opts.plugin_registry_layers
            • _Opts.plugin_set
            • _Opts.plugin_set_catalog
            • _Opts.plugin_signing_credentials
            • _Opts.plugin_signing_password
            • _Opts.production
            • _Opts.table_format
            • _Opts.testing
          • _TurtlesCli
            • _TurtlesCli._app
            • _TurtlesCli._ctx
            • _TurtlesCli._errs
            • _TurtlesCli._fail_if_errs()
            • _TurtlesCli._get_plugin_identifiers()
            • _TurtlesCli._get_plugin_jars()
            • _TurtlesCli._get_plugin_registries()
            • _TurtlesCli._get_plugin_registry_catalogs()
            • _TurtlesCli._get_plugin_registry_layers()
            • _TurtlesCli._get_plugin_set_catalogs()
            • _TurtlesCli._get_plugin_sets()
            • _TurtlesCli._get_plugin_signing_credentials()
            • _TurtlesCli._initialize_plugin_building_operation()
            • _TurtlesCli._initialize_plugin_deployment_operation()
            • _TurtlesCli._obtain_plugin_signing_password()
            • _TurtlesCli._opts
            • _TurtlesCli.build_plugin()
            • _TurtlesCli.deploy_plugin()
            • _TurtlesCli.dispatch()
            • _TurtlesCli.release_plugin()
          • _interactive_option()
          • _output_option_group()
          • _plugin_building_option_group()
          • _plugin_deployment_option_group()
          • _plugin_identifier_option_group()
          • _plugin_jar_option_group()
          • _plugin_registry_layer_option_group()
          • main()
        • 6.4. lockss.turtles.plugin
          • Plugin
            • Plugin.file_to_id()
            • Plugin.from_jar()
            • Plugin.from_path()
            • Plugin.get_aux_packages()
            • Plugin.get_identifier()
            • Plugin.get_name()
            • Plugin.get_parent_identifier()
            • Plugin.get_parent_version()
            • Plugin.get_version()
            • Plugin.id_from_jar()
            • Plugin.id_to_dir()
            • Plugin.id_to_file()
          • PluginIdentifier
        • 6.5. lockss.turtles.plugin_registry
          • BasePluginRegistryLayout
            • BasePluginRegistryLayout.FILE_NAMING_CONVENTION_DEFAULT
            • BasePluginRegistryLayout.FILE_NAMING_CONVENTION_FIELD
            • BasePluginRegistryLayout.TYPE_FIELD
            • BasePluginRegistryLayout.deploy_plugin()
            • BasePluginRegistryLayout.get_file_naming_convention()
            • BasePluginRegistryLayout.get_plugin_registry()
            • BasePluginRegistryLayout.get_type()
            • BasePluginRegistryLayout.initialize()
            • BasePluginRegistryLayout.model_config
            • BasePluginRegistryLayout.model_post_init()
          • DirectoryPluginRegistryLayout
            • DirectoryPluginRegistryLayout.file_naming_convention
            • DirectoryPluginRegistryLayout.model_config
            • DirectoryPluginRegistryLayout.model_post_init()
            • DirectoryPluginRegistryLayout.type
          • PluginRegistry
            • PluginRegistry.get_id()
            • PluginRegistry.get_layer()
            • PluginRegistry.get_layer_ids()
            • PluginRegistry.get_layers()
            • PluginRegistry.get_layout()
            • PluginRegistry.get_name()
            • PluginRegistry.get_plugin_identifiers()
            • PluginRegistry.get_suppressed_plugin_identifiers()
            • PluginRegistry.has_plugin()
            • PluginRegistry.id
            • PluginRegistry.kind
            • PluginRegistry.layers
            • PluginRegistry.layout
            • PluginRegistry.model_config
            • PluginRegistry.model_post_init()
            • PluginRegistry.name
            • PluginRegistry.plugin_identifiers
            • PluginRegistry.suppressed_plugin_identifiers
          • PluginRegistryCatalog
            • PluginRegistryCatalog.get_plugin_registry_files()
            • PluginRegistryCatalog.kind
            • PluginRegistryCatalog.model_config
            • PluginRegistryCatalog.model_post_init()
            • PluginRegistryCatalog.plugin_registry_files
          • PluginRegistryCatalogKind
          • PluginRegistryIdentifier
          • PluginRegistryKind
          • PluginRegistryLayer
            • PluginRegistryLayer.deploy_plugin()
            • PluginRegistryLayer.get_id()
            • PluginRegistryLayer.get_jars()
            • PluginRegistryLayer.get_name()
            • PluginRegistryLayer.get_path()
            • PluginRegistryLayer.get_plugin_registry()
            • PluginRegistryLayer.id
            • PluginRegistryLayer.initialize()
            • PluginRegistryLayer.model_config
            • PluginRegistryLayer.model_post_init()
            • PluginRegistryLayer.name
            • PluginRegistryLayer.path
          • PluginRegistryLayerIdentifier
          • PluginRegistryLayout
          • PluginRegistryLayoutFileNamingConvention
          • PluginRegistryLayoutType
          • RcsPluginRegistryLayout
            • RcsPluginRegistryLayout.model_config
            • RcsPluginRegistryLayout.model_post_init()
            • RcsPluginRegistryLayout.type
        • 6.6. lockss.turtles.plugin_set
          • AntPluginSetBuilder
            • AntPluginSetBuilder.DEFAULT_MAIN
            • AntPluginSetBuilder.DEFAULT_TEST
            • AntPluginSetBuilder.build_plugin()
            • AntPluginSetBuilder.main
            • AntPluginSetBuilder.model_config
            • AntPluginSetBuilder.model_post_init()
            • AntPluginSetBuilder.test
            • AntPluginSetBuilder.type
          • BasePluginSetBuilder
            • BasePluginSetBuilder.MAIN_FIELD
            • BasePluginSetBuilder.TEST_FIELD
            • BasePluginSetBuilder.TYPE_FIELD
            • BasePluginSetBuilder.build_plugin()
            • BasePluginSetBuilder.get_main()
            • BasePluginSetBuilder.get_test()
            • BasePluginSetBuilder.get_type()
            • BasePluginSetBuilder.has_plugin()
            • BasePluginSetBuilder.make_plugin()
            • BasePluginSetBuilder.model_config
            • BasePluginSetBuilder.model_post_init()
          • MavenPluginSetBuilder
            • MavenPluginSetBuilder.DEFAULT_MAIN
            • MavenPluginSetBuilder.DEFAULT_TEST
            • MavenPluginSetBuilder.build_plugin()
            • MavenPluginSetBuilder.main
            • MavenPluginSetBuilder.model_config
            • MavenPluginSetBuilder.model_post_init()
            • MavenPluginSetBuilder.test
            • MavenPluginSetBuilder.type
          • PluginSet
            • PluginSet.build_plugin()
            • PluginSet.builder
            • PluginSet.get_builder()
            • PluginSet.get_id()
            • PluginSet.get_name()
            • PluginSet.has_plugin()
            • PluginSet.id
            • PluginSet.initialize()
            • PluginSet.kind
            • PluginSet.make_plugin()
            • PluginSet.model_config
            • PluginSet.name
          • PluginSetBuilder
          • PluginSetBuilderType
          • PluginSetCatalog
            • PluginSetCatalog.get_plugin_set_files()
            • PluginSetCatalog.kind
            • PluginSetCatalog.model_config
            • PluginSetCatalog.model_post_init()
            • PluginSetCatalog.plugin_set_files
          • PluginSetCatalogKind
          • PluginSetIdentifier
          • PluginSetKind
        • 6.7. lockss.turtles.util
          • BaseModelWithRoot
            • BaseModelWithRoot.get_root()
            • BaseModelWithRoot.initialize()
            • BaseModelWithRoot.model_config
            • BaseModelWithRoot.model_post_init()
          • PathOrStr
          • file_or()
  • LOCKSS Network Administrator Guide
    • Starter Network Configuration File
    • Managing Plugins
      • Managing Plugin Signing Credentials
        • Creating a Plugin Signing Key
        • Deploying a Plugin Signing Keystore
  • LOCKSS Plugin Developer Guide
    • 1. Introduction
      • 1.1. LOCKSS Plugin Concepts
        • 1.1.1. LOCKSS Plugin
        • 1.1.2. Archival Unit
        • 1.1.3. Plugin Configuration Parameter
        • 1.1.4. Plugin Format
        • 1.1.5. Plugin Feature Categories
        • 1.1.6. Identification Features
        • 1.1.7. Crawl Control Features
        • 1.1.8. Crawl Validation Features
        • 1.1.9. Poll Control Features
        • 1.1.10. Hash Filtering Features
        • 1.1.11. Metadata Extraction Features
        • 1.1.12. Web Replay Features
        • 1.1.13. Inheritance Features
        • 1.1.14. Miscellaneous Features
        • 1.1.15. Minimalistic Plugin
        • 1.1.16. Plugin Compatibility Between LOCKSS 1.x and LOCKSS 2.x
      • 1.2. LOCKSS Plugin Format
        • 1.2.1. Plugin Value Types
          • 1.2.1.1. String
          • 1.2.1.2. Integer
          • 1.2.1.3. Long Integer
          • 1.2.1.4. List
          • 1.2.1.5. Map
    • 2. Identification
      • 2.1. Plugin Identifier
      • 2.2. Plugin Name
      • 2.3. Plugin Version
      • 2.4. Plugin Configuration Parameters
        • 2.4.1. Plugin Configuration Parameter Types
          • 2.4.1.1. Boolean Type
          • 2.4.1.2. Integer Type
          • 2.4.1.3. Long Integer Type
          • 2.4.1.4. Non-Negative Integer Type
          • 2.4.1.5. Numeric Range Type
          • 2.4.1.6. Set Type
          • 2.4.1.7. String Type
          • 2.4.1.8. String Range Type
          • 2.4.1.9. Time Interval Type
          • 2.4.1.10. URL Type
          • 2.4.1.11. User Credentials Type
          • 2.4.1.12. Year Type
        • 2.4.2. Built-In Definitional Parameters
          • 2.4.2.1. Base URL
          • 2.4.2.2. Second Base URL
          • 2.4.2.3. Year
          • 2.4.2.4. Volume Number
          • 2.4.2.5. Volume Name
          • 2.4.2.6. Issue Range
          • 2.4.2.7. Numeric Issue Range
          • 2.4.2.8. Issue Set
          • 2.4.2.9. Journal Directory
          • 2.4.2.10. Journal Abbreviation
          • 2.4.2.11. Journal Identifier
          • 2.4.2.12. Journal ISSN
          • 2.4.2.13. Publisher Name
          • 2.4.2.14. OAI Request URL
          • 2.4.2.15. OAI Spec
        • 2.4.3. Built-In Non-Definitional Parameters
          • 2.4.3.1. Username and Password
          • 2.4.3.2. AU Down
          • 2.4.3.3. AU Off-Limits
          • 2.4.3.4. AU Closed
          • 2.4.3.5. Crawl Proxy
          • 2.4.3.6. New Content Crawl Interval
          • 2.4.3.7. Crawl Test Substance Threshold
        • 2.4.4. Derivative Parameters
          • 2.4.4.1. Derivative URL Parameters
          • 2.4.4.2. Derivative Year Parameters
      • 2.5. AU Name
      • 2.6. Required Daemon Version
    • 3. Crawl Control
      • 3.1. Start URLs
      • 3.2. Crawl Seed
      • 3.3. Permission URLs
      • 3.4. Per-Host Permission Path
      • 3.5. Permitted Host Pattern
      • 3.6. Crawl Rules
        • 3.6.1. Crawl Rule Types
          • 3.6.1.1. Include
          • 3.6.1.2. Exclude
          • 3.6.1.3. Include No Match
          • 3.6.1.4. Exclude No Match
          • 3.6.1.5. Include Match Else Exclude
          • 3.6.1.6. Exclude Match Else Include
      • 3.7. Crawl Window
      • 3.8. Recrawl Interval
      • 3.9. Refetch Depth
      • 3.10. Fetch Pause Time
      • 3.11. Crawl Rate Limiter
      • 3.12. Crawl Pool
      • 3.13. Response Handler
      • 3.14. URL Normalizer
      • 3.15. Link Extractor
      • 3.16. Crawl Filter
      • 3.17. URL Fetcher
      • 3.18. URL Consumer
    • 4. Crawl Validation
      • 4.1. Redirect to Login URL Pattern
      • 4.2. Login Page Checker
      • 4.3. Content Validator
      • 4.4. Substance Patterns
      • 4.5. Substance Predicate
    • 5. Poll Control
      • 5.1. Exclude URLs From Polls Pattern
      • 5.2. Poll Result Weight
      • 5.3. Repair From Publisher When Too Close
      • 5.4. Repair From Peer If Missing
    • 6. Hash Filtering
      • 6.1. Hash Filter
      • 6.2. HTML Filters
        • 6.2.1. HtmlFilterInputStream
        • 6.2.2. WhiteSpaceFilter
      • 6.3. PDF Filters
    • 7. Metadata Extraction
      • 7.1. Introduction to Metadata Extraction
      • 7.2. Article Iterator
        • 7.2.1. ArticleFiles
        • 7.2.2. SubTreeArticleIterator
        • 7.2.3. SubTreeArticleIteratorBuilder
      • 7.3. File Metadata Extractor
        • 7.3.1. SimpleFileMetadataExtractor
        • 7.3.2. JsoupTagExtractor
        • 7.3.3. RisMetadataExtractor
        • 7.3.4. SourceXmlMetadataExtractor
      • 7.4. Article Metadata Extractor
        • 7.4.1. ArticleMetadata
        • 7.4.2. BaseArticleMetadataExtractor
    • 8. Web Replay
      • 8.1. Link Rewriter
      • 8.2. Rewrite HTML Meta URLs
    • 9. Inheritance
      • 9.1. Parent Plugin
      • 9.2. Parent Plugin Version
    • 10. Appendix
      • 10.1. printf Format Strings
        • 10.1.1. printf Format String Format
        • 10.1.2. printf Format Specifiers
          • 10.1.2.1. String Specifier
          • 10.1.2.2. Integer Specifier
          • 10.1.2.3. Percent Sign Specifier
      • 10.2. Regular Expressions
  • LOCKSS Software Developer Guide
    • 1. Classic LOCKSS Development
      • 1.1. Prerequisites
        • 1.1.1. Installing Git
        • 1.1.2. Installing the Java Development Kit
        • 1.1.3. Installing Apache Ant
        • 1.1.4. Cloning the Git Repository
        • 1.1.5. JUnit Prerequisites
      • 1.2. Tour of lockss-daemon
    • 2. License Templates
      • 2.1. Plain Text
      • 2.2. Java
      • 2.3. Python
      • 2.4. Shell
      • 2.5. XML
    • 3. REST APIs

Navigation

  • LOCKSS Program Web Site
  • » LOCKSS Documentation Portal
  • LOCKSS Community Wiki
  • LOCKSS Community Discussions
LOCKSS Documentation Portal
  • LOCKSS Plugin Developer Guide
  • 3. Crawl Control
  • 3.15. Link Extractor
Previous Next

3.15. Link Extractor

Note

This page is under construction.

Plugin Key

mediatype_link_extractor_factory, where mediatype is a media type like text/html

Plugin Value Type

String

Plugin Value Format

The value is the fully qualified name of a Java class implementing the org.lockss.plugin.LinkExtractorFactory interface.

Sample
<entry>
  <string>text/html_link_extractor_factory</string>
  <string>edu.example.plugin.publisherx.PublisherXHtmlLinkExtractorFactory</string>
</entry>
Description

The LOCKSS software comes with built-in code to extract URLs from HTML and CSS files encountered during the crawl of an AU. A URL extracted in this manner is then subject to the URL Normalizer, then the Crawl Rules determine if it should in turn be included in the AU. If URLs need to be extracted from other file types, or if the extraction behavior for built-in types like HTML and CSS needs to be extended or customized, this plugin feature can be used to point the plugin at new link extraction code.

Previous Next

© Copyright 2000-2026, LOCKSS Program.

Built with Sphinx using a theme provided by Read the Docs.