Difference between revisions of "Static:Crawling"
m (1 revision imported)
Revision as of 14:55, 25 October 2017
As within the last years the number of people deciding to crawl the wiki and the Fahrplan to provide mirrors has rapidly grown, there are now static dumps of the freely accessible content available for direct download. Crawling a Mediawiki is not that easy, at it provides a deep structure of elements where only a limited number of pages is really interesting to be crawled. While crawling the dynamic pages for editing and modifying pages causes a lot of traffic and cpu-usage on the host, the packages only contain the necessary information by also replacing dead links to link back to the Main Page.
All information is gathered directly via the HTTP server of the wiki and allows keeping the network traffic low. So, if you decide to provide a local mirror, please do not just crawl our wiki and the Fahrplan but just download the provided packages.
Downloading a Fahrplan-Dump
You can find a file telling you the current version and the link to download the Fahrplan at https://fahrplan.events.ccc.de/congress/2016/Fahrplan/version. While crawling the Fahrplan might provide incomplete data, this package always contains the raw data provided by the frab.
The content of the file looks like this:
VER: 2014-12-08 15:58 - en: Version 0.91b 2014: A Congress Odyssee URL: http://events.ccc.de/congress/2015/Fahrplan/b5e0dab9-72ed-4295-bb0b-855c89efc01b.tar.gz
The first line beginning with "VER: " always tells you about the time of the last export, the locales of the export and the current version. The last line, always beginning with "URL: " tells you, where to download the dump of the Fahrplan as tar.gz and will provide a full URI.
This allows you to create a script to automatically dump the Fahrplan and extract it on a different location and also ensures that you always retrieve a clean version of the fahrplan while a crawl might happen during an update of the Fahrplan and might contain inconsistent data and broken links.
It might also be a good idea to full this file for apps using the JSON, XML or iCal format of the Fahrplan in oder to know, if anything might have changed. There won't be any changes without the file beeing updated as the structure itself is just a static export of the Fahrplan und the information within the file itself is used as a reference for a fresh download.
As the the Fahrplan is hosted inside the congress network this year you might consider using the internal DNS servers or add the following line to your /etc/hosts file.
This year there is also a github mirror where you can get the Fahrplan. It should be as up to date as the dump on events.ccc.de as it's generate by the same scripts but using it will lighten the load on events.ccc.de You'll find the mirror at
Fahrplan of four main rooms + Sendezentrum + WIKI/self-organized Sessions in Rooms
Only WIKI/self-organized Sessions in Rooms
Merged raw dump from wiki data (event+session)
(updated about every 10 minutes via https://github.com/voc/schedule/blob/master/wiki2schedule_33C3.py )