org.jweaver.crawler.internal.parse.JWeaverDocumentParser

All Implemented Interfaces:: DocumentParser

public final class JWeaverDocumentParser extends Object implements DocumentParser

The JWeaverDocumentParser class is responsible for parsing HTML documents to extract relevant information. It implements the DocumentParser interface and provides implementations to parse the title, body, and links from HTML content.

Constructor Summary

Constructors

Constructor

Description

JWeaverDocumentParser()

Constructs a new JWeaverDocumentParser instance.
Method Summary

Modifier and Type

Method

Description

String

parseBody(String htmlBody, String pageUri)

Parses the HTML body of a web page and extracts the main content body.

Set<String>

parseLinks(String htmlBody, String pageUri)

Parses the HTML body of a web page and extracts the links contained within it.

String

parseTitle(String htmlBody, String pageUri)

Parses the HTML body of a web page and extracts the title.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- JWeaverDocumentParser
  
  public JWeaverDocumentParser()
  
  Constructs a new JWeaverDocumentParser instance.
Method Details
- parseTitle
  
  public String parseTitle(String htmlBody, String pageUri)
  
  Description copied from interface: DocumentParser
  
  Parses the HTML body of a web page and extracts the title.
  
  Specified by:
  
  parseTitle in interface DocumentParser
  
  Parameters:
  
  htmlBody - The HTML body of the web page.
  
  pageUri - The URI of the web page.
  
  Returns:
  
  The title of the web page.
- parseBody
  
  public String parseBody(String htmlBody, String pageUri)
  
  Description copied from interface: DocumentParser
  
  Parses the HTML body of a web page and extracts the main content body.
  
  Specified by:
  
  parseBody in interface DocumentParser
  
  Parameters:
  
  htmlBody - The HTML body of the web page.
  
  pageUri - The URI of the web page.
  
  Returns:
  
  The main content body of the web page.
- parseLinks
  
  public Set<String> parseLinks(String htmlBody, String pageUri)
  
  Description copied from interface: DocumentParser
  
  Parses the HTML body of a web page and extracts the links contained within it.
  
  Specified by:
  
  parseLinks in interface DocumentParser
  
  Parameters:
  
  htmlBody - The HTML body of the web page.
  
  pageUri - The URI of the web page.
  
  Returns:
  
  A set of URIs representing the links found in the web page.

Class JWeaverDocumentParser

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

JWeaverDocumentParser

Method Details

parseTitle

parseBody

parseLinks