Extract and Analyze Scientist’s Homepages utilizing Common Crawl¶

This page yields information and documentation about a bachelor project (6 ECTS) done in the summer semester 2017 at the Chair for Algorithms and Data Structures, Department of Computer Science, University of Freiburg, headed by Hannah Bast.

Project description¶

The goal of this project is to use the open web crawl data archive of Common Crawl to get scientist’s personal Web pages. Further extract structured data from scientist’s personal Web pages like their name, profession, affiliation and gender.

What is covered in this project page?¶

This documentation page provides information about the approach, results and produced code of the project and should enable you to reproduce the results and take them as a starting point for your own work or just to get inspirations about how certain parts work.

To get an overview you can read the Experiments section which contains all the steps and results of this project in a nutshell and links to more detailed documentation for each step.

Contents¶