๐
javadocs-scraperA TypeScript library to scrape Java objects information from a Javadocs website.
Specifically, it scrapes data (name, description, url, etc) about, and links together:
Some extra data is also calculated post scraping, like method and field inheritance.
Tested with Javadocs generated from Java 7 to Java 21. I cannot guarantee this will work with older or newer versions.
๐ฆ
Installation and Usagenpm install javadocs-scraper
yarn add javadocs-scraper
pnpm add javadocs-scraper
Scraper
:import { Scraper } from 'javadocs-scraper';
const scraper = Scraper.fromURL('https://...');
This package uses constructor dependency injection for every class.
You can also instantiate Scraper
with the new
keyword, but you'll need to specify every dependency manually.
The easier way is to use the Scraper.fromURL()
method, which will use the default implementations.
Alternatively, you can provide your own Fetcher
to fetch data from the Javadocs:
import type { Fetcher } from 'javadocs-scraper';
class MyFetcher implements Fetcher {
/** ... */
}
const myFetcher = new MyFetcher('https://...');
const scraper = Scraper.with({ fetcher: myFetcher });
Scraper
to scrape information:const javadocs: Javadocs = await scraper.scrape();
/** for example */
const myInterface = javadocs.getInterface('org.example.Interface');
The Javadocs
object uses discord.js' Collection
class to store all the scraped data. This is an extension of Map
with utility methods, like find()
, reduce()
, etc.
These collections are also typed as mutable, so any modification will be reflected in the backing Javadocs
. This is by design, since the library no longer uses this object once it's given to you, and doesn't care what you then do with it.
Check the discord.js guide or the Collection
docs for more info.
๐
Warningsscrape()
method will take a while to scrape the entire website. Make sure to only run it when necessary, ideally only once in the entire program's lifecycle.๐
SpecificsThere are distinct types of objects that hold the library together:
Fetcher
ยน, which makes requests to the Javadocs website.Entities
ยฒ, which represent a scraped object.QueryStrategies
ยน, which query the website through cheerio. Needed since HTML class and ids change between Javadoc versions.Scrapers
ยน, which scrape information from a given URL or cheerio object, to a partial object.Partials
ยฒ, which represent a partially scraped object, that is, an object without circular references to other objects.ScraperCache
, which caches partial objects in memory.Patchers
ยน, which patch partials to make them full entities, by linking them together.Javadocs
, which is the final result of the scraping process.ยน - Replaceable via constructor injection.
ยฒ - Only a type, not available in runtime.
The scraping process ocurs in the following steps:
QueryStrategy
is chosen by the QueryStrategyFactory
.RootScraper
iterates through every package in the Javadocs root.PackageScraper
.PackageScraper
iterates through every class, interface, enum and annotation in the package and passes them to the appropriate Scraper
.ScraperCache
.Scraper
uses the Patchers
to patch the partial objects together, by passing the cache to each patcher.Scraper
returns the patched objects, in a Javadocs
object.
You can provide your own QueryStrategyFactory
to change the way the QueryStrategy
is chosen.
import { OnlineFetcher } from 'javadocs-scraper';
const myFetcher = new OnlineFetcher('https://...');
const factory = new MyQueryStrategyFactory();
const scraper = Scraper.with({
fetcher: myFetcher,
queryStrategyFactory: factory
});