JWeaver Crawler
JWeaver Crawler is an open-source Java library designed for extracting text content from websites efficiently. It provides essential functionality for tasks such as search engine development, data mining, and content aggregation.
Getting started Link to heading
Add Dependency: Begin by adding the JWeaver Crawler Library as a dependency in your project. Find the latest version on Maven Central
<dependency>
<groupId>org.jweaver</groupId>
<artifactId>crawler</artifactId>
<version>1.0.2</version>
</dependency>
implementation group: 'org.jweaver', name: 'crawler', version: '1.0.2'
Usage Link to heading
Quickstart Link to heading
Create a new crawler with the default configurations by providing a set of URIs and start crawling. You can see the generated files in the /output directory of the current path.
var uris = Set.of("https://en.wikipedia.org/wiki/Computer_science",
"https://crawler-test.com/");
var crawler = JWeaverCrawler.builder().build(uris);
crawler.runParallel();
Configuration Link to heading
The library provides a wide range of configuration options, allowing the fine-tuning of parameters such as maximum depth, politeness delay between requests, export configuration and the choice of HTTP client. This enables the optimization of the crawling process for efficient resource usage, and adherence to web server policies.
//var metadataEnabled = true;
//var exportConfig = ExportConfig.exportJson("/output", metadataEnabled);
var exportConfig = ExportConfig.exportMarkdown("/tmp/jweaver/output");
//customize your http client
var httpClient =
HttpClient.newBuilder()
.followRedirects(HttpClient.Redirect.ALWAYS)
.version(HttpClient.Version.HTTP_1_1)
.build();
var crawler =
JWeaverCrawler.builder()
.exportConfiguration(exportConfig)
.httpClient(httpClient)
.maxDepth(3)
.politenessDelay(Duration.ofSeconds(2))
.build(uris);
Supported Types Link to heading
Export Type | Metadata | Extension |
---|---|---|
Markdown | False | .md |
JSON | True | .json |
Execution Link to heading
- Parallel with Java Virtual Threads
- Sequentially
Extensibility Link to heading
The library offers customization options for the extraction process and filewriting.
By implementing the DocumentParser interface, you can replace the internal document parser and define a custom one to extract the required content for the internal processing from HTML pages.
public interface DocumentParser {
String parseTitle(String htmlBody, String pageUri);
String parseBody(String htmlBody, String pageUri);
Set<String> parseLinks(String htmlBody, String pageUri);
}
You can implement the JWeaverWriter interface which defines methods for processing and writing the results of the web crawling process. Implementations of this interface are responsible for handling successful crawled pages, error information, and connection maps generated during the crawling process.
public interface JWeaverWriter {
//Processes a successfully crawled page and writes the result
void processSuccess(SuccessResultPage successResultPage, ExportConfig exportConfiguration);
//Processes errors encountered during crawling and writes error information
void processErrors(
String baseUri, List<NodeError> nodeErrorList, ExportConfig exportConfiguration);
// Processes connection map information generated during crawling and writes it
void processConnectionMap(
String baseUri, List<Connection> connections, ExportConfig exportConfiguration);
}
Provide the implementations during the crawler configuration
//class CustomParserImpl implements DocumentParser{}
var myParserImpl = new CustomParserImpl();
//class MyFileWriter implements JWeaverWriter{}
var myFileWriter = new MyFileWriter();
var crawler = JWeaverCrawler.builder()
.parser(myParserImpl)
.writer(myFileWriter)
.build(uris);
License Link to heading
This project is licensed under the GNU General Public License v3.0 (GPL-3.0).