Class JWeaverDocumentParser
java.lang.Object
org.jweaver.crawler.internal.parse.JWeaverDocumentParser
- All Implemented Interfaces:
DocumentParser
The JWeaverDocumentParser class is responsible for parsing HTML documents to extract relevant
information. It implements the DocumentParser interface and provides implementations to parse the
title, body, and links from HTML content.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionParses the HTML body of a web page and extracts the main content body.parseLinks
(String htmlBody, String pageUri) Parses the HTML body of a web page and extracts the links contained within it.parseTitle
(String htmlBody, String pageUri) Parses the HTML body of a web page and extracts the title.
-
Constructor Details
-
JWeaverDocumentParser
public JWeaverDocumentParser()Constructs a new JWeaverDocumentParser instance.
-
-
Method Details
-
parseTitle
Description copied from interface:DocumentParser
Parses the HTML body of a web page and extracts the title.- Specified by:
parseTitle
in interfaceDocumentParser
- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- The title of the web page.
-
parseBody
Description copied from interface:DocumentParser
Parses the HTML body of a web page and extracts the main content body.- Specified by:
parseBody
in interfaceDocumentParser
- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- The main content body of the web page.
-
parseLinks
Description copied from interface:DocumentParser
Parses the HTML body of a web page and extracts the links contained within it.- Specified by:
parseLinks
in interfaceDocumentParser
- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- A set of URIs representing the links found in the web page.
-