Version 1.5b Castle in the Sky

lecture: Gibberish Detection 102

Event large 4b8aa978adbb7c8e80151f5a83c6782a12e763374ae3a042a55e7e626a64d93b

DGAs (Domain Generation Algorithms) have become a trusty fallback mechanism for malware that’s a headache to deal with, but they have one big drawback – they draw a lot of attention to themselves with their many DNS request for gibberish domains.

When basic entropy-based Machine Learning methods rose to the challenge of automatically detecting DGAs, DGAs responded by subtly changing their output to be /just/ plausible enough to fool those methods. In this talk we’ll harness the might of the English dictionary, cut corners to achieve sane running times for insane computations, and use fancy Machine Learning® methods – all in order to build a classifier with a higher standard for gibberish plausibility.

In recent years, there has been a rising trend in malware’s use of Domain Generation Algorithms (DGAs) as a fallback mechanism in case the campaign is shut down at the DNS level. DGAs are a headache to deal with, but they have one big drawback – they make a lot of noise. To be more precise, they generate a very large amount of DNS requests for domains, and the domains are often complete gibberish.

This situation looks ripe to be exploited with your favorite Cyber™ Machine Learning® Big Data© solution; and indeed, advances were made by basic language processing methods that could detect and stop the outright complete gibberish. These worked well, until DGAs mutated, and started producing more reasonable gibberish. A milestone in this regard was the introduction of KWYJIBO, a DGA that generates gibberish where every other letter is a vowel (e. g. „garolimoja“), which stumps the old methods completely.

How do you thwart KWYJIBO and other DGAs of its sophistication? How do you look for meaninglessness in string-space? In this talk we’ll harness the might of the English dictionary; cheat mathematics to cut running times from impossible to reasonable; and demonstrate a fancy Cyber™ Machine Learning® Big Data© tool based on all the above to tell apart meaningful domain names from nonsense. Where is this arms race going, anyway? Is there such a thing as undetectable gibberish?