Interface DocumentParser
- All Known Implementing Classes:
JWeaverDocumentParser
public interface DocumentParser
The DocumentParser interface defines methods for extracting relevant information from HTML
documents. Implementations of this interface are responsible for parsing HTML content to extract
titles, bodies, and links from web pages.
-
Method Summary
Modifier and TypeMethodDescriptionParses the HTML body of a web page and extracts the main content body.parseLinks
(String htmlBody, String pageUri) Parses the HTML body of a web page and extracts the links contained within it.parseTitle
(String htmlBody, String pageUri) Parses the HTML body of a web page and extracts the title.
-
Method Details
-
parseTitle
Parses the HTML body of a web page and extracts the title.- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- The title of the web page.
-
parseBody
Parses the HTML body of a web page and extracts the main content body.- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- The main content body of the web page.
-
parseLinks
Parses the HTML body of a web page and extracts the links contained within it.- Parameters:
htmlBody
- The HTML body of the web page.pageUri
- The URI of the web page.- Returns:
- A set of URIs representing the links found in the web page.
-