LOCKSS Documentation Portal Logo
latest

General Information

  • Releases
    • LOCKSS 2.0-alpha6
    • LOCKSS 1.76
    • Archived 2.x Releases
      • LOCKSS 2.0-alpha5
      • LOCKSS 2.0-alpha4
      • LOCKSS 2.0-alpha3
      • LOCKSS 2.0-alpha2
      • LOCKSS 2.0-alpha1
      • LOCKSS 2.0-alpha0
    • Archived 1.x Releases
      • LOCKSS 1.75
      • LOCKSS 1.74
      • Older Releases
  • Security
    • CVE-2021-45105 and CVE-2021-44832
    • CVE-2021-44228, CVE-2021-45046 and CVE-2021-4104
  • Acknowledgments

LOCKSS Guides

  • LOCKSS 1.x System Manual
  • LOCKSS 2.x System Manual
  • LOCKSS Software Developer Guide
    • 1. Classic LOCKSS Development
      • 1.1. Prerequisites
        • 1.1.1. Installing Git
        • 1.1.2. Installing the Java Development Kit
        • 1.1.3. Installing Apache Ant
      • 1.2. Cloning the Git Repository
        • 1.2.1. Additional Prerequisites
    • 2. License Templates
      • 2.1. Plain Text
      • 2.2. Java
      • 2.3. Python
      • 2.4. Shell
      • 2.5. XML
    • 3. REST APIs
      • LOCKSS Repository Service REST API
      • LOCKSS Configuration Service REST API
      • LOCKSS Poller Service REST API
      • LOCKSS Metadata Extraction Service REST API
      • LOCKSS Metadata Service REST API
  • LOCKSS Plugin Developer Guide
    • 1. Introduction
      • 1.1. LOCKSS Plugin Concepts
        • 1.1.1. LOCKSS Plugin
        • 1.1.2. Archival Unit
        • 1.1.3. Plugin Configuration Parameters
        • 1.1.4. Plugin Format
        • 1.1.5. Plugin Feature Categories
        • 1.1.6. Identification Features
        • 1.1.7. Crawl Control Features
        • 1.1.8. Crawl Validation Features
        • 1.1.9. Poll Control Features
        • 1.1.10. Hash Filtering Features
        • 1.1.11. Metadata Extraction Features
        • 1.1.12. Web Replay Features
        • 1.1.13. Inheritance Features
        • 1.1.14. Miscellaneous Features
        • 1.1.15. Minimalistic Plugin
        • 1.1.16. Plugin Compatibility Between LOCKSS 1.x and LOCKSS 2.x
      • 1.2. LOCKSS Plugin Format
        • 1.2.1. Plugin Value Types
          • 1.2.1.1. String
          • 1.2.1.2. Integer
          • 1.2.1.3. Long Integer
          • 1.2.1.4. List
          • 1.2.1.5. Map
    • 2. Identification
      • 2.1. Plugin Identifier
      • 2.2. Plugin Name
      • 2.3. Plugin Version
      • 2.4. Plugin Configuration Parameters
        • 2.4.1. Parameter Types
          • 2.4.1.1. String
          • 2.4.1.2. URL
          • 2.4.1.3. User Credentials
          • 2.4.1.4. Integer
          • 2.4.1.5. Non-Negative Integer
          • 2.4.1.6. Long Integer
          • 2.4.1.7. Year
          • 2.4.1.8. Time Interval
          • 2.4.1.9. String Range
          • 2.4.1.10. Numeric Range
          • 2.4.1.11. Set
          • 2.4.1.12. Boolean
        • 2.4.2. Built-In Definitional Parameters
          • 2.4.2.1. Base URL
          • 2.4.2.2. Second Base URL
          • 2.4.2.3. Year
          • 2.4.2.4. Volume Number
          • 2.4.2.5. Volume Name
          • 2.4.2.6. Issue Range
          • 2.4.2.7. Numeric Issue Range
          • 2.4.2.8. Issue Set
          • 2.4.2.9. Journal Directory
          • 2.4.2.10. Journal Abbreviation
          • 2.4.2.11. Journal Identifier
          • 2.4.2.12. Journal ISSN
          • 2.4.2.13. Publisher Name
          • 2.4.2.14. OAI Request URL
          • 2.4.2.15. OAI Spec
        • 2.4.3. Built-In Non-Definitional Parameters
          • 2.4.3.1. Username and Password
          • 2.4.3.2. AU Down
          • 2.4.3.3. AU Off-Limits
          • 2.4.3.4. AU Closed
          • 2.4.3.5. Crawl Proxy
          • 2.4.3.6. New Content Crawl Interval
          • 2.4.3.7. Crawl Test Substance Threshold
        • 2.4.4. Derivative Parameters
          • 2.4.4.1. Derivative URL Parameters
          • 2.4.4.2. Derivative Year Parameters
      • 2.5. AU Name
      • 2.6. Required Daemon Version
    • 3. Crawl Control
      • 3.1. Start URLs
      • 3.2. Crawl Seed
      • 3.3. Permission URLs
      • 3.4. Per-Host Permission Path
      • 3.5. Permitted Host Pattern
      • 3.6. Crawl Rules
        • 3.6.1. Crawl Rule Types
          • 3.6.1.1. Include
          • 3.6.1.2. Exclude
          • 3.6.1.3. Include No Match
          • 3.6.1.4. Exclude No Match
          • 3.6.1.5. Include Match Else Exclude
          • 3.6.1.6. Exclude Match Else Include
      • 3.7. Crawl Window
      • 3.8. Recrawl Interval
      • 3.9. Refetch Depth
      • 3.10. Fetch Pause Time
      • 3.11. Crawl Rate Limiter
      • 3.12. Crawl Pool
      • 3.13. Response Handler
      • 3.14. URL Normalizer
      • 3.15. Link Extractor
      • 3.16. Crawl Filter
      • 3.17. URL Fetcher
      • 3.18. URL Consumer
    • 4. Crawl Validation
      • 4.1. Redirect to Login URL Pattern
      • 4.2. Login Page Checker
      • 4.3. Content Validator
      • 4.4. Substance Patterns
      • 4.5. Substance Predicate
    • 5. Poll Control
      • 5.1. Exclude URLs From Polls Pattern
      • 5.2. Poll Result Weight
      • 5.3. Repair From Publisher When Too Close
      • 5.4. Repair From Peer If Missing
    • 6. Hash Filtering
      • 6.1. Hash Filter
      • 6.2. HTML Filters
        • 6.2.1. HtmlFilterInputStream
        • 6.2.2. WhiteSpaceFilter
      • 6.3. PDF Filters
    • 7. Metadata Extraction
      • 7.1. Introduction to Metadata Extraction
      • 7.2. Article Iterator
        • 7.2.1. ArticleFiles
        • 7.2.2. SubTreeArticleIterator
        • 7.2.3. SubTreeArticleIteratorBuilder
      • 7.3. File Metadata Extractor
        • 7.3.1. SimpleFileMetadataExtractor
        • 7.3.2. JsoupTagExtractor
        • 7.3.3. RisMetadataExtractor
        • 7.3.4. SourceXmlMetadataExtractor
      • 7.4. Article Metadata Extractor
        • 7.4.1. ArticleMetadata
        • 7.4.2. BaseArticleMetadataExtractor
    • 8. Web Replay
      • 8.1. Link Rewriter
      • 8.2. Rewrite HTML Meta URLs
    • 9. Inheritance
      • 9.1. Parent Plugin
      • 9.2. Parent Plugin Version
    • 10. Appendix
      • 10.1. printf Format Strings
        • 10.1.1. printf Format String Format
        • 10.1.2. printf Format Specifiers
          • 10.1.2.1. String
          • 10.1.2.2. Integer
          • 10.1.2.3. Percent Sign
      • 10.2. Regular Expressions
  • LOCKSS Network Administrator Guide

Navigation

  • « LOCKSS Web Site
LOCKSS Documentation Portal
  • LOCKSS Plugin Developer Guide
  • 3. Crawl Control
Previous Next

3. Crawl Control

This section introduces plugin features related to the definition and behavior of content crawls.

Chapter Table of Contents

  • 3.1. Start URLs
  • 3.2. Crawl Seed
  • 3.3. Permission URLs
  • 3.4. Per-Host Permission Path
  • 3.5. Permitted Host Pattern
  • 3.6. Crawl Rules
  • 3.7. Crawl Window
  • 3.8. Recrawl Interval
  • 3.9. Refetch Depth
  • 3.10. Fetch Pause Time
  • 3.11. Crawl Rate Limiter
  • 3.12. Crawl Pool
  • 3.13. Response Handler
  • 3.14. URL Normalizer
  • 3.15. Link Extractor
  • 3.16. Crawl Filter
  • 3.17. URL Fetcher
  • 3.18. URL Consumer
Previous Next

© Copyright 2000-2023, LOCKSS Program. Revision df5b25c2.

Built with Sphinx using a theme provided by Read the Docs.
Read the Docs v: latest
Versions
latest
Downloads
pdf
html
On Read the Docs
Project Home
Builds