Usage

Download WARC files

Crawled websites in the Common Crawl dataset are stored in WARC Web ARChive format. Common Crawl’s storage is currently divided into 64.700 parts, each containing on average 65.400 WARC records.

Hostnames of universities

In

vendor/world-universities-csv/world-universities.csv

is a list of about 9300 universities websites. Which is a clone of endSly/world-universities-csv from github.

Get WARC file locations in CC-archive for hostnames

To be able to download the WARC files for a given hostname from the CC-archive one has to get their location in the archive first.

Run:

$ src/index_fetcher.py

This will download all available html sites crawled for each host in the world-universities.csv file.

Download WARC files from CC-archive

After fetching the locations use:

$ src/download_warc.py

to download the WARC files given by its locations. Use NUM_PARALLEL_JOBS to adjust how many parallel jobs should be executed at the same time.

Generate Training set