21C3 Schedule Release 1.1.7

21st Chaos Communication Congress
Lectures and workshops

Speakers
Picture of Steven James Murdoch Steven James Murdoch
Picture of Maximillian Dornseif Maximillian Dornseif
Schedule
Day 1
Location Saal 4
Start Time 11:00 h
Duration 01:00
INFO
ID 271
Type Lecture
Track Hacking
Language english
FEEDBACK

Hidden Data in Internet Published Documents

Many files are being published on the Internet which hold unexpected (and potentially embarrassing) data. We examine different cases of hidden data in file formats (including Word, PDF and JPEG) and show examples of these from a crawl of the Internet.

There is a growing trend to publish information on the Internet, rather than more conventional paper based distribution system. While this brings many benefits, complex document formats increase the risk of unintended document disclosure. A reasonably well known example is hidden information in Microsoft Office documents, in particular Word. These contain several items of potentially compromising hidden data. For example the GUID (Globally Unique IDentifier) allows different documents to be linked together, and allows the Ethernet address of the author to be derived. The revision history shows previous edits and links them to a name. Also even if revision tracking is turned off, the undo history can provide similar data.

Likewise PDF documents contain metadata on the author and software used. Also since PDF can contain vector-based graphics, information not shown on the screen because it is obscured by a different object, may still exist in the file. This is a particular problem with redaction, where confidential information is covered with a black rectangle. If the redaction is performed in the PDF producer rather than editing the original image or text, then the redacted material remains in the file. While it will not be shown, the PDF file can be modified to reveal the data, or tools could be written to extract the data directly.

Another potential leak of data is EXIF thumbnails in JPEG images. These are typically created by digital cameras or image manipulation software, but not all graphics programs will update them along with the main image. This results in edited images retaining the original version of the image in the thumbnail. In some cases this may only be inconvenient, such as rotated images showing the unrotated preview, but in other cases this could be a significant information leak. We performed an experiment of crawling the Internet for JPEG images and automated the process of identifying those whose thumbnail was significantly different from the main image. Our result was that almost 1% of JPEG had an incorrect EXIF thumbnail. While many were simple cropping, some were considerably more embarrassing. For example image manipulation was exposed since the thumbnail showed the original unmodified version, the source of a image could be seen despite the copyright notice being cropped out, a supposedly anonymised image showed the identity of the subject and in some cases by looking at the thumbnail, a partially nude photograph revealed more of the subject than originally intended.

Our presentation will cover the issues with these formats and show real world incidents of compromising information leakage.