Crawling

From 33C3_Public_Wiki
Revision as of 17:55, 10 December 2015 by Nexus (Talk)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

As within the last years the number of people deciding to crawl the wiki and the Fahrplan to provide mirrors has rapidly grown, there are now static dumps of the freely accessible content available for direct download. Crawling a Mediawiki is not that easy, at it provides a deep structure of elements where only a limited number of pages is really interesting to be crawled. While crawling the dynamic pages for editing and modifying pages causes a lot of traffic and cpu-usage on the host, the packages only contain the necessary information by also replacing dead links to link back to the Main Page.

All information is gathered directly via the HTTP server of the wiki and allows keeping the network traffic low. So, if you decide to provide a local mirror, please do not just crawl our wiki and the Fahrplan but just download the provided packages.

Downloading a Fahrplan-Dump

You can find a file telling you the current version and the link to download the Fahrplan at http://events.ccc.de/congress/2016/wiki/../Fahrplan/version. While crawling the Fahrplan might provide incomplete data, this package always contains the raw data provided by the frab.

The content of the file looks like this:

VER: 2014-12-08 15:58 - en: Version 0.91b 2014: A Congress Odyssee
URL: http://events.ccc.de/congress/2015/Fahrplan/b5e0dab9-72ed-4295-bb0b-855c89efc01b.tar.gz

The first line beginning with "VER: " always tells you about the time of the last export, the locales of the export and the current version. The last line, always beginning with "URL: " tells you, where to download the dump of the Fahrplan as tar.gz and will provide a full URI.

This allows you to create a script to automatically dump the Fahrplan and extract it on a different location and also ensures that you always retrieve a clean version of the fahrplan while a crawl might happen during an update of the Fahrplan and might contain inconsistent data and broken links.

It might also be a good idea to full this file for apps using the JSON, XML or iCal format of the Fahrplan in oder to know, if anything might have changed. There won't be any changes without the file beeing updated as the structure itself is just a static export of the Fahrplan und the information within the file itself is used as a reference for a fresh download.

Downloading a Wiki-Dump

Downloading a dump of the wiki works in the same way. You find a version file at http://events.ccc.de/congress/2016/wiki/version. As the wiki does not have a version, a UUID is used as version information instead.

The version file looks like that:

VER: b656e7f2-d50d-4c6c-9b0c-02ce8e7dcc70
URL: http://events.ccc.de/congress/2015/wiki/download.b656e7f2-d50d-4c6c-9b0c-02ce8e7dcc70.tgz

A full dump of the wiki is provided every three hours. You can download the file if the version has changed from the URL given in the last line of the file. The archive file contains the /wiki directory providing a structure of hashed files. It also contains a /wiki/index.php that hashes the URLs and opens the appropriate file and presents it to the browser. This should be easy to replace it by a python, ruby or whatever implementation and should even be easy to load to a database system.