28C3 - Version 2.3.5

28th Chaos Communication Congress
Behind Enemy Lines

Speakers
Fabian Mihailowitsch
Schedule
Day Day 2 - 2011-12-28
Room Saal 2
Start time 13:15
Duration 00:30
Info
ID 4770
Event type Lecture
Track Hacking
Language used for presentation English
Feedback

Don't scan, just ask

A new approach of identifying vulnerable web applications

For years, we tried to identify vulnerable systems in company networks by getting all the companies netblocks / ip addresses and scanning them for vulnerable services. Then with the growing importance of web applications and of course search engines, a new way of identifying vulnerable systems was introduced: "Google hacking".

However this approach of identifying and scanning companies ip addresses as well as doing some Google hacking for the (known) URLs of the company doesn't take all aspects into account and has some limitations. At first we just check the systems which are obvious, the ones that are in the companies netblocks, the IP addresses that were provided by the company and the URLs that are known or can be resolved using reverse DNS. However how about URLs and systems that aren't obvious? Systems maybe even the company in focus forgot? Second, the current techniques are pretty technical. They don't take the business view into account at any point.

Therefore we developed a new technique as well as framework to identify companies’ web pages based on a scored keyword list. In other words: From zero to owning all of a company’s existing web pages, even the pages not hosted by the company itself, with just a scored keyword list as input.

Systems that are hosted by third parties, web pages that were just released for a marketing campaign, maybe even by a third party marketing company but within the name of the company we want to check? Possibly not even the company does remember all the web applications and domains that are running under his name. These systems/applications won’t be detected using traditional techniques and thus impose a potential security risk for the company. Second, the current techniques are pretty technical. They don't take the business view into account. That means, we try to identify certain applications using technical information like version banner or the comapnies ip addresses in order to identify his systems. But how about the other way around, trying to identify applications and systems by using the company’s business data (e.g. product names, company names, tax identification numbers, contact persons, …) and then test the identified systems and applications for vulnerabilities?

That is what we did. The idea is to build up a scored keyword list for the company in focus. This list contains general keywords like the company name, product names, more detailed keywords like an address contained in imprints and very specific keywords like the companies tax number. Every keyword in that list is then rated by human intelligence. Which means specific keywords do have a higher scoring than general keywords. In the next step a spider uses these keywords to query search engines like bing, google, etc. for the keywords and stores all the web sites URLs identified in a database with their scoring. If a web site that already is in the database is found for another keyword, just the score of that entry is increased. At the end, we get a list of websites that contained one or more of the keywords, along with a scoring for each web site. Then the URL is taken and checked whether it contains one of the keywords (e.g. company name). If this is the case, the scoring of the page is increased again. Then for each entry the FQDN as well as the ip is resolved and a whois query is executed. If that whois record does contain the company name, the scoring is increased again. Furthermore the country codes are used to remove results which are not in the target country.

At the end of that process, we do have a list of URLs and FQDNs that could be found using company specific key words. Furthermore that list is scored. Since during that process you get (based on your keyword list) hundred thousands of unique hits, you have to minimize that list. Therefore we did some research on the results generated and found a decent way to minimize the results to an amount that can be checked manually by a human. Then those identified company web pages are passed to a crawler that just extracts external links from those pages, with the idea that correct company pages might link to other company pages, and integrates them to the results list. Using these technique in practice it is possible to identify a lot of web sites hosted (even by third parties) for one company.

During the crawling process not just external links are extracted but all forms, HTTP parameters as well as certain parts of the web content are stored. Thus besides a list, we do have a "mirror" of the web page as well as the forms and dynamic functions that pose an attack surface.

The information collected can then be used as input to special analysis modules. For some of our projects we integrated WAFP (Web Application Finger Printer), SQLMap and other well known tools as well as some other self written fuzzers and fingerprinters into that process. This way the whole process, from identifying web pages belonging to a certain company up to analyzing those for vulnerabilities can be totally automated.

In other words: From zero to owning all of a company’s existing web pages, even the pages not hosted by the company itself, with just a scored keyword list as input.

During our talk we will present our idea as well as our approach of identifying vulnerable web applications that belong to a certain company, based on business data. Furthermore we will explain how our framework is structured and how it does the searching as well as the vulnerability assessment in an automated way. So everybody who is interested will be able to implement his own version or adapt certain ideas for his projects. Besides just telling you how it could work, we will also present our framework that performs all of the steps described above automatically in a demo.