Which software is behind dblp?

dblp has grown from a small-scale server intended to test web technology and to serve only a small local community to a web site used by thousands of people worldwide. Nevertheless, we still run it with a minimal amount of software.

There is no database management system behind dblp. The information is stored in several million files in the file system. The programs used to maintain dblp are custom scripts written in C, Perl, and Java, which are glued together by shell scripts. The "production system" of dblp runs on a small number of UNIX machines (Ubuntu Linux).

The early days of dblp

The initial dblp service was a small collection of tables of contents (tocs) of proceedings and journal volumes. The tocs were written directly in static HTML and linked to a few introduction pages by manually edited hyperlinks.

The next stage was to generate "author pages". An author page lists all publications (co)authored by a person. For an example look at the page of Stefano Ceri. Author pages were statically compiled in two steps: In the first step all tocs were parsed. The tocs' HTML source code used a standardized format which made parsing quite simple. The HTML parser from an early version of xmosaic was combined with a simple finite automaton which identified volume, number, author, title and page fields within the HTML source. The parser printed all bibliographic information into a huge single text file ("TOC_OUT"), using a line-oriented format similar to the refer format. After all parsing had been done, a second program (mkauthors) was started. It read TOC_OUT into a compact main memory data structure and produced the author page HTML files, an index of all author pages, and the file AUTHORS which contains all author names.

The files AUTHORS and TOC_OUT were the inputs of author and title, two CGI programs to search dblp. The programs were written in C and worked in a "brute force" manner– they did a simple sequential search for each query.

The mkauthors program and the search engine produced static HTML pages in which all mentions of an author name were hyperlinked to the corresponding author page. To obtain such links into the tocs, we used a modified version of the toc parser. This program added the links to the original toc page if the corresponding author pages were available. The program was started after each run of mkauthors.

Data as file system records

Originally, we intended to include annotated bibliographies and reading lists for seminars and courses into dblp. This meant, bibliographic mata-data had to be used from many different locations. To make this feasible, a simple HTML preprocessor was written. The task of mkhtml is to replace each occurrence of a tag

<cite key="...">

in a pseudo-HTML file by the HTML snippet of bibliographic meta-data information. This mechanism was very similar to the \cite{...} functionality of LaTeX.

When implementing mkhtml, we decided to separate the bibliographic records from the tocs. For each paper, a small file containing the bibliographic meta-data was stored in a file system subtree (/dblp/publ/*). BibTeX would have been an obvious format for these files, but to parse BibTeX is hard and we had no ready-to-use BibTeX parser at hand. Hence, we reused our HTML parser and defined custom tags for the BibTeX record types and field names. Our bibliographic records are still in use today and look like this:

<article key="GottlobSR96">
<author>Georg Gottlob</author>
<author>Michael Schrefl</author>
<author>Brigitte R&#246;ck</author>
<title>Extending Object-Oriented Systems with Roles.</title>
<pages>268-296</pages>
<year>1996</year>
<volume>14</volume>
<journal>TOIS</journal>
<number>3</number>
<url>db/journals/tois/tois14.html#GottlobSR96</url>
</article>

The dawn of XML

A few years later the XML standard appeared. It turned out that our bibliographic records fit perfectly into the XML framework. In a software lab, undergraduate students configured a Java XML parser available on the internet to read in all records. After correcting some minor typos not seen by our parser, the experiment ran successfully.

To avoid redundancy (and maintenance problems), we started using the mkhtml preprocessor for the tocs, too. A table of contents was just another text citing papers. The citation tags had now an optional style attribute to control the HTML appearance of the citations. The mkhtml program generated links for all author names, if the author pages exist. It also knew a few additional tags to produce footers and logos. Local hyperlinks within the dblp web structure were marked by a special reference tag, mkhtml checked the availability of the destination URL. The source of a toc file looked pretty much like this, although arbitrary HTML markup was possible.:

<html><head><title>IEEE Database Engineering Bulletin,
Volume 5</title></head> <body bgcolor="#ffffff" text="#000000" link="#000000"> <logo>
<h1><ref href="db/journals/debu/index.html">IEEE
Database Engineering Bulletin</ref>,
Volume 4</h1><hr>
In 1981 the IEEE-CS Technical Committee on
Database Engineering decided to turn Database
Engineeing from a short newsletter into a
theme-driven magazine.
<h2>Volume 4, Number 2, December 1981</h2>
Special Issue on Database Machines
<ul>
<li><cite key="journals/debu/Kim81">
<li><cite key="journals/debu/Song81">
<li><cite key="journals/debu/Hsiao81">
<li><cite key="journals/debu/BoralD81">
<li><cite key="journals/debu/Ubell81">
<li><cite key="journals/debu/Hawthorn81">
<li><cite key="journals/debu/ShawSIHWA81">
<li><cite key="journals/debu/YaoTS81">
<li><cite key="journals/debu/AroraD81">
</ul>
<footer>

For most tocs, we "reconstructed" the source files using simple Perl scripts. The parser to produce the TOC_OUT file was replaced by another parser which collected the information from the bibliographic records.

This description is still incomplete.

Dynamically generated views

TODO
a service of  Schloss Dagstuhl - Leibniz Center for Informatics