Interface DocumentParser

All Known Implementing Classes:
JWeaverDocumentParser

public interface DocumentParser
The DocumentParser interface defines methods for extracting relevant information from HTML documents. Implementations of this interface are responsible for parsing HTML content to extract titles, bodies, and links from web pages.
  • Method Summary

    Modifier and Type
    Method
    Description
    parseBody(String htmlBody, String pageUri)
    Parses the HTML body of a web page and extracts the main content body.
    parseLinks(String htmlBody, String pageUri)
    Parses the HTML body of a web page and extracts the links contained within it.
    parseTitle(String htmlBody, String pageUri)
    Parses the HTML body of a web page and extracts the title.
  • Method Details

    • parseTitle

      String parseTitle(String htmlBody, String pageUri)
      Parses the HTML body of a web page and extracts the title.
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      The title of the web page.
    • parseBody

      String parseBody(String htmlBody, String pageUri)
      Parses the HTML body of a web page and extracts the main content body.
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      The main content body of the web page.
    • parseLinks

      Set<String> parseLinks(String htmlBody, String pageUri)
      Parses the HTML body of a web page and extracts the links contained within it.
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      A set of URIs representing the links found in the web page.