Class JWeaverDocumentParser

java.lang.Object
org.jweaver.crawler.internal.parse.JWeaverDocumentParser
All Implemented Interfaces:
DocumentParser

public final class JWeaverDocumentParser extends Object implements DocumentParser
The JWeaverDocumentParser class is responsible for parsing HTML documents to extract relevant information. It implements the DocumentParser interface and provides implementations to parse the title, body, and links from HTML content.
  • Constructor Details

    • JWeaverDocumentParser

      public JWeaverDocumentParser()
      Constructs a new JWeaverDocumentParser instance.
  • Method Details

    • parseTitle

      public String parseTitle(String htmlBody, String pageUri)
      Description copied from interface: DocumentParser
      Parses the HTML body of a web page and extracts the title.
      Specified by:
      parseTitle in interface DocumentParser
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      The title of the web page.
    • parseBody

      public String parseBody(String htmlBody, String pageUri)
      Description copied from interface: DocumentParser
      Parses the HTML body of a web page and extracts the main content body.
      Specified by:
      parseBody in interface DocumentParser
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      The main content body of the web page.
    • parseLinks

      public Set<String> parseLinks(String htmlBody, String pageUri)
      Description copied from interface: DocumentParser
      Parses the HTML body of a web page and extracts the links contained within it.
      Specified by:
      parseLinks in interface DocumentParser
      Parameters:
      htmlBody - The HTML body of the web page.
      pageUri - The URI of the web page.
      Returns:
      A set of URIs representing the links found in the web page.