LOCKSS Documentation Portal Logo
latest

General Information

  • Releases
    • LOCKSS 2.0-alpha6
    • LOCKSS 1.76
    • Archived 2.x Releases
      • LOCKSS 2.0-alpha5
      • LOCKSS 2.0-alpha4
      • LOCKSS 2.0-alpha3
      • LOCKSS 2.0-alpha2
      • LOCKSS 2.0-alpha1
      • LOCKSS 2.0-alpha0
    • Archived 1.x Releases
      • LOCKSS 1.75
      • LOCKSS 1.74
      • Older Releases
  • Security
    • CVE-2021-45105 and CVE-2021-44832
    • CVE-2021-44228, CVE-2021-45046 and CVE-2021-4104
  • Acknowledgments

LOCKSS Guides

  • LOCKSS 1.x System Manual
  • LOCKSS 2.x System Manual
  • LOCKSS Software Developer Guide
    • 1. Classic LOCKSS Development
      • 1.1. Prerequisites
        • 1.1.1. Installing Git
        • 1.1.2. Installing the Java Development Kit
        • 1.1.3. Installing Apache Ant
      • 1.2. Cloning the Git Repository
        • 1.2.1. Additional Prerequisites
    • 2. License Templates
      • 2.1. Plain Text
      • 2.2. Java
      • 2.3. Python
      • 2.4. Shell
      • 2.5. XML
    • 3. REST APIs
      • LOCKSS Repository Service REST API
      • LOCKSS Configuration Service REST API
      • LOCKSS Poller Service REST API
      • LOCKSS Metadata Extraction Service REST API
      • LOCKSS Metadata Service REST API
  • LOCKSS Plugin Developer Guide
    • 1. Introduction
      • 1.1. LOCKSS Plugin Concepts
        • 1.1.1. LOCKSS Plugin
        • 1.1.2. Archival Unit
        • 1.1.3. Plugin Configuration Parameters
        • 1.1.4. Plugin Format
        • 1.1.5. Plugin Feature Categories
        • 1.1.6. Identification Features
        • 1.1.7. Crawl Control Features
        • 1.1.8. Crawl Validation Features
        • 1.1.9. Poll Control Features
        • 1.1.10. Hash Filtering Features
        • 1.1.11. Metadata Extraction Features
        • 1.1.12. Web Replay Features
        • 1.1.13. Inheritance Features
        • 1.1.14. Miscellaneous Features
        • 1.1.15. Minimalistic Plugin
        • 1.1.16. Plugin Compatibility Between LOCKSS 1.x and LOCKSS 2.x
      • 1.2. LOCKSS Plugin Format
        • 1.2.1. Plugin Value Types
          • 1.2.1.1. String
          • 1.2.1.2. Integer
          • 1.2.1.3. Long Integer
          • 1.2.1.4. List
          • 1.2.1.5. Map
    • 2. Identification
      • 2.1. Plugin Identifier
      • 2.2. Plugin Name
      • 2.3. Plugin Version
      • 2.4. Plugin Configuration Parameters
        • 2.4.1. Parameter Types
          • 2.4.1.1. String
          • 2.4.1.2. URL
          • 2.4.1.3. User Credentials
          • 2.4.1.4. Integer
          • 2.4.1.5. Non-Negative Integer
          • 2.4.1.6. Long Integer
          • 2.4.1.7. Year
          • 2.4.1.8. Time Interval
          • 2.4.1.9. String Range
          • 2.4.1.10. Numeric Range
          • 2.4.1.11. Set
          • 2.4.1.12. Boolean
        • 2.4.2. Built-In Definitional Parameters
          • 2.4.2.1. Base URL
          • 2.4.2.2. Second Base URL
          • 2.4.2.3. Year
          • 2.4.2.4. Volume Number
          • 2.4.2.5. Volume Name
          • 2.4.2.6. Issue Range
          • 2.4.2.7. Numeric Issue Range
          • 2.4.2.8. Issue Set
          • 2.4.2.9. Journal Directory
          • 2.4.2.10. Journal Abbreviation
          • 2.4.2.11. Journal Identifier
          • 2.4.2.12. Journal ISSN
          • 2.4.2.13. Publisher Name
          • 2.4.2.14. OAI Request URL
          • 2.4.2.15. OAI Spec
        • 2.4.3. Built-In Non-Definitional Parameters
          • 2.4.3.1. Username and Password
          • 2.4.3.2. AU Down
          • 2.4.3.3. AU Off-Limits
          • 2.4.3.4. AU Closed
          • 2.4.3.5. Crawl Proxy
          • 2.4.3.6. New Content Crawl Interval
          • 2.4.3.7. Crawl Test Substance Threshold
        • 2.4.4. Derivative Parameters
          • 2.4.4.1. Derivative URL Parameters
          • 2.4.4.2. Derivative Year Parameters
      • 2.5. AU Name
      • 2.6. Required Daemon Version
    • 3. Crawl Control
      • 3.1. Start URLs
      • 3.2. Crawl Seed
      • 3.3. Permission URLs
      • 3.4. Per-Host Permission Path
      • 3.5. Permitted Host Pattern
      • 3.6. Crawl Rules
        • 3.6.1. Crawl Rule Types
          • 3.6.1.1. Include
          • 3.6.1.2. Exclude
          • 3.6.1.3. Include No Match
          • 3.6.1.4. Exclude No Match
          • 3.6.1.5. Include Match Else Exclude
          • 3.6.1.6. Exclude Match Else Include
      • 3.7. Crawl Window
      • 3.8. Recrawl Interval
      • 3.9. Refetch Depth
      • 3.10. Fetch Pause Time
      • 3.11. Crawl Rate Limiter
      • 3.12. Crawl Pool
      • 3.13. Response Handler
      • 3.14. URL Normalizer
      • 3.15. Link Extractor
      • 3.16. Crawl Filter
      • 3.17. URL Fetcher
      • 3.18. URL Consumer
    • 4. Crawl Validation
      • 4.1. Redirect to Login URL Pattern
      • 4.2. Login Page Checker
      • 4.3. Content Validator
      • 4.4. Substance Patterns
      • 4.5. Substance Predicate
    • 5. Poll Control
      • 5.1. Exclude URLs From Polls Pattern
      • 5.2. Poll Result Weight
      • 5.3. Repair From Publisher When Too Close
      • 5.4. Repair From Peer If Missing
    • 6. Hash Filtering
      • 6.1. Hash Filter
      • 6.2. HTML Filters
        • 6.2.1. HtmlFilterInputStream
        • 6.2.2. WhiteSpaceFilter
      • 6.3. PDF Filters
    • 7. Metadata Extraction
      • 7.1. Introduction to Metadata Extraction
      • 7.2. Article Iterator
        • 7.2.1. ArticleFiles
        • 7.2.2. SubTreeArticleIterator
        • 7.2.3. SubTreeArticleIteratorBuilder
      • 7.3. File Metadata Extractor
        • 7.3.1. SimpleFileMetadataExtractor
        • 7.3.2. JsoupTagExtractor
        • 7.3.3. RisMetadataExtractor
        • 7.3.4. SourceXmlMetadataExtractor
      • 7.4. Article Metadata Extractor
        • 7.4.1. ArticleMetadata
        • 7.4.2. BaseArticleMetadataExtractor
    • 8. Web Replay
      • 8.1. Link Rewriter
      • 8.2. Rewrite HTML Meta URLs
    • 9. Inheritance
      • 9.1. Parent Plugin
      • 9.2. Parent Plugin Version
    • 10. Appendix
      • 10.1. printf Format Strings
        • 10.1.1. printf Format String Format
        • 10.1.2. printf Format Specifiers
          • 10.1.2.1. String
          • 10.1.2.2. Integer
          • 10.1.2.3. Percent Sign
      • 10.2. Regular Expressions
  • LOCKSS Network Administrator Guide

Navigation

  • « LOCKSS Web Site
LOCKSS Documentation Portal
  • LOCKSS Plugin Developer Guide
  • 3. Crawl Control
  • 3.15. Link Extractor
Previous Next

3.15. Link Extractor

Note

This page is under construction.

Plugin Key

mediatype_link_extractor_factory, where mediatype is a media type like text/html

Plugin Value Type

String

Plugin Value Format

The value is the fully qualified name of a Java class implementing the org.lockss.plugin.LinkExtractorFactory interface.

Sample
<entry>
  <string>text/html_link_extractor_factory</string>
  <string>edu.example.plugin.publisherx.PublisherXHtmlLinkExtractorFactory</string>
</entry>
Description

The LOCKSS software comes with built-in code to extract URLs from HTML and CSS files encountered during the crawl of an AU. A URL extracted in this manner is then subject to the URL Normalizer, then the Crawl Rules determine if it should in turn be included in the AU. If URLs need to be extracted from other file types, or if the extraction behavior for built-in types like HTML and CSS needs to be extended or customized, this plugin feature can be used to point the plugin at new link extraction code.

Previous Next

© Copyright 2000-2023, LOCKSS Program. Revision df5b25c2.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: latest
Versions
latest
Downloads
pdf
html
On Read the Docs
Project Home
Builds