DNS Name Prediction With Google

Johnny Long

johnny@ihackstuff.com

http://johnny.ihackstuff.com

 

Introduction

 

As discussed in “Google Hacking for Penetration Testers” from Syngress publishing[1], there are many different ways to perform network reconnaissance using Google. Since the publication of that text, many different ideas and techniques have come to light. This document addresses one interesting technique, which we’ll call DNS name[2] prediction.  This document assumes you have some knowledge of basic network recon, and is not intended as a hand-holding approach to hacking. If you’re evil, stop reading this and go work out some aggression on a sack-o-potatoes or something.

 

Why Google?

 

If an attacker is willing to throw tons of packets at a target network, he can develop a fairly decent list of targets in very short order. In most cases the quality of the target list often depends on the sheer volume of traffic and the amount of time and effort the attacker is willing to expend on the exercise. From the defender’s standpoint, a flood of mapping traffic is often easy to detect, and most intrusion monitoring tools are well equipped to detect these common signatures.  If an attacker is willing to offload some of these tasks to Google, less packets are passed to the target, and the attacker can rely on some of Google’s strengths, in this particular case the ability to group hosts by domain name, and the ability to perform intelligent language analysis. In an extreme case, it’s even possible for an attacker to get a decent list of targets without communicating with those targets at all, instead relying only on information from Google. If the attacker is willing to use additional techniques such as standard DNS lookups in addition to Google queries, the attacker can maintain a relatively low profile while building a very decent target footprint. Security professional can also use this technique help protect their clients from undue exposure.

 


Grabbing Hosts

 

The first step in compiling a target list is to decide on a top-level target. As a security professional, this is almost certainly dictated in the form of a customer’s network holdings, their domain name, IP range, or some combination of these. If the target is an IP range, the security tester is very limited in what can be discovered and in most cases the tester will often skip the target discovery phase, instead beginning the assessment by feeding that IP range into some type of industrial grade vulnerability scanner. If the client has supplied a domain name or has authorized a “carte blanche” scan of all it’s network holdings, the tester has more discovery leverage and will most likely begin the testing with a fairly robust target discovery phase, in which the techniques discussed here can be used fully. 

 

Several techniques for host detection have already been discussed in other places, so for the purposes of this document we’ll assume an attacker has collected hosts with Google queries. For example, an attacker might begin a footprint for sdsu.edu with a query like site:sdsu.edu. Although host and subdomain names can be gathered with simple Google queries (using the site: operator, for example) in many cases, automation is needed to quickly parse through the results. One tool that can be used for this purpose is google_miner.pl[3] by Roelof Temmingh of sesnepost.com. This tool automates various (site operator-based) Google queries aimed at a domain name. Figure 1 shows the output of this tool when aimed at sdsu.edu.

 

Figure 1: google_miner.pl results for sdsu.edu

 

This type of scan produces a decent list of what appears to be host names, although further investigation reveals they are actually subdomains.

 

Expansion: Using Google To Manually Determine Word Relationships

 

Next, we will try to intelligently expand the list of targets. When reviewing this list, it becomes relatively obvious that some of the subdomains (specifically math, geology, chemistry and physics) are named for college disciplines. An attacker could easily expand this search by performing WHOIS or forward DNS queries for other similarly named subdomains or hosts like history.sdsu.edu or geography.sdsu.edu. This technique works well as long as the word pattern is recognized, but this requires human intervention or some relatively heavy lifting by a program. In cases where a pattern is not relatively evident, or in cases where word relationships need to be determined programmatically, our old friend Google once again comes to the rescue. Google provides a relatively easy mechanism for determining whether or not words are related.  Let’s take a look at a few examples. First, consider the words cat, dog, and pickle. Anyone can plainly see that only two of the three words are related (cat and dog, both domesticated house pets) and even a poorly contrived computer program would have little trouble figuring this out, but for the sake of example, let’s have Google give us an idea of the relative association of these words. We can do this in three queries, as shown in Table 1.

 

Table 1: Google comparison of cat, dog, and pickle.

 

Google Query

Google Results

cat dog

11,000,000

pickle dog

341,000

pickle cat

255,000

 

According to Google, the words cat and dog are referenced together on the web much more that either pickle and cat or pickle and dog. This gives cat  and dog a stronger relative association. This is a simple example, but let’s take a look at a slightly more complex example. Consider the three words cat, dog and horse. As anyone can tell you, these three words are all related (they all refer to animals, specifically mammals, and to some extent domesticated animals) but a computer program would have some degree of difficulty determining that only two of the three words refer to domesticated house pets, thus making those two words more closely related. Even a child would describe cats and dogs as more related than cats and horses or dogs and horses, even if the child couldn’t accurately describe why those two words are more related. Once again, let’s turn to Google to determine which pair of words floats to the top.

 


Table 2: Google comparison of cat, dog, and horse.

 

Google Query

Google Results

cat dog

13,600,000

horse dog

9,520,000

horse cat

8,760,000

 

As shown in Table 2, the terms cat and dog again float to the top even though all the words are fairly closely associated. The point here, is that Google offers a relatively simplistic mechanism for an attacker to determine relationships between entire lists of words, and as we’ll see in the next step, word relationships can play a key role in DNS name prediction. Let’s focus again on our example scan of sdsu.edu, which as an aside is an excellent institution and my use of their domain is in no way a statement about their overall security posture. Have I said enough to keep myself in SDSU’s good graces? =)

 

Expansion: Automating Google Word Relationship Determination

When we last left sdsu.edu, we had a list of subdomain names and had determined that there was some sort of pattern to the naming convention they were using. Using the Google query techniques we just explored, we could manually create Google queries pairs from the DNS names we’ve already gathered. Examining a part of the name list, we would see names like:

 

chemistry.sdsu.edu 

csrc.sdsu.edu 

ces.sdsu.edu 

physics.sdsu.edu 

ivcampus.sdsu.edu 

borderecoweb.sdsu.edu

 

Since our goal is to discover more DNS names, we could start querying Google with every combination of the words chemistry, csrc, ces, physics, ivcampus and borderecoweb in order to first to locate a pattern in their naming convention, and then ultimately expand that pattern to locate more names, which may be subdomains or actual hosts. Instead of relying on manual Google queries, we’ll use the program hostsieve[4] by Jimmy Neutron to do the work for us. As shown in Figure 2, we feed hostsieve the hosts we’ve discovered, and an optional proxy server to bounce the Google queries off of. Hostsieve will chop off the relevant part of the host names, create a list of words, pair those words into every possible combination, and submit Google queries for each word pair, printing the top three result pairs.

 


Figure 2: Hostsieve determines related words from DNS names using Google

 

Hostsieve rightly displays the words chemistry and physics as being the most closely related by a spread of nearly 12 million hits.

 

Manual DNS Name Prediction With Google Sets

Once an attacker discovers a pattern in the naming convention, he would logically expand that pattern in an attempt to predict more host names. Google again comes to the rescue through the use of the Google Sets program, which “automatically create sets of items from a few examples.[5] For example, Google Sets would expand the words dog and cat to the next most relevant words as shown in Figure 3.

 


Figure 3: Basic set expansion with Google Sets

 

This small set of 15 predicted terms could be expanded to a large set very easily, although the strength of the word associations suffers once more results are presented.

 

Applied to our SDSU names, Google Sets expands the words chemistry and physics to the following small set:

 

 

Armed with this list, forward DNS queries could be made for each of these words, followed by the sdsu.edu domain name.

 

Automated DNS Name Prediction With Google Sets

Although we could certainly construct and execute these nslookups manually with a simple shell script, a much more elegant approach would include a program written to perform both the Google Sets expansion as well as the forward lookup function, given only a pair of terms to expand and a domain name to use for queries. Dnspredict[6] by Jimmy Neutron performs these functions.

 

Figure 4: dnspredict uses Google Sets to predict DNS names

 

When provided with two similar words and a domain name, dnspredict will (as shown in figure 4) first expand the two provided items with Google sets, append the domain name to each of the returned words and perform a forward DNS lookup for each constructed DNS name, returning the address for each valid host. In this way, dnspredict can take some of the guesswork out of DNS name prediction.  Fed only with the words chemistry and physics, dnspredict located thirteen more DNS names at sdsu.edu, as shown in Table 3.


 

Table 3: New Predicted DNS names

Original DNS names

New, Predicted DNS name

bio.sdsu.edu

biologylessons.sdsu.edu

borderecoweb.sdsu.edu

ces.sdsu.edu

chemistry.sdsu.edu

cs.sdsu.edu

csrc.sdsu.edu

drjamessallis.sdsu.edu

edcenter.sdsu.edu

eli.sdsu.edu

foundation.sdsu.edu

geology.sdsu.edu

ivcampus.sdsu.edu

math.sdsu.edu

music.sdsu.edu

physics.sdsu.edu

sa.sdsu.edu

scec.sdsu.edu

sci.sdsu.edu

serg.sdsu.edu

accounting.sdsu.edu

anthropology.sdsu.edu

art.sdsu.edu

astronomy.sdsu.edu

biology.sdsu.edu

education.sdsu.edu

engineering.sdsu.edu

geography.sdsu.edu

history.sdsu.edu

linguistics.sdsu.edu

nursing.sdsu.edu

philosophy.sdsu.edu

psychology.sdsu.edu

 

 

Just because a DNS name resolves doesn’t mean it’s a live target. In order to verify that these targets actually exist, an attacker would take steps to verify the vitality of that target. There are several ways to do this, of course, but since I personally love using Google to do goofy stuff, let’s get Google to do a first-pass vitality check of all the new hosts. This can be as simple as performing site: queries against each of the new DNS names. Although it’s a bit ugly, a simple bash script can perform this task. If we first populate a file (po2) with a list of our newfound targets, we can execute this simple script to query Google for the sites:

 

for site in `cat po2`

 do echo -n $site": "

    lynx -dump http://www.google.com/search?q=site:$site | grep "of about" || echo " "

 done

 

The output of this script (shown in Figure 5) indicates that Google has information about several of the hosts, including art, engineering, geography, nursing, philosophy, and psychology. This is a nice technique for mapping purposes since a Google site: query will return a hit even if the supplied DNS name is actually a subdomain rather than a hostname.


 

Figure 5: Querying Google For DNS Name Information

 

 

At this point, we are left with several hosts which have valid addresses, but which Google knows nothing about. Stepping outside of Google to use more traditional vitality checks such as TCP/ICMP ‘pings’ or port scans we discover that some of the Google-invisible hosts such as anthropology, astronomy, biology and education are actually alive, functional web servers, as shown in Table 4. The interesting thing here is that Google helped discover hosts that even Google didn’t know about.

 

Table 4: Google Results and Server Status of Found Hosts

Predicted DNS Name

Results from site query

Up / down?

engineering.sdsu.edu

6,370

Up

geography.sdsu.edu

3,500

Up

art.sdsu.edu

2,600

Up

psychology.sdsu.edu

416

Up

nursing.sdsu.edu

38

Up

philosophy.sdsu.edu

19

Up

accounting.sdsu.edu

0

Down

anthropology.sdsu.edu

0

Up

astronomy.sdsu.edu

0

Up

biology.sdsu.edu

0

UP

education.sdsu.edu

0

Up

history.sdsu.edu

0

Down

linguistics.sdsu.edu

0

Down

 


Conclusion

To an attacker, each new DNS name serves as a new target. By scratching the surface of the nice folks over at SDSU, we’ve seen that DNS prediction with Google helped locate thirteen additional DNS names, four of which Google didn’t even know about. This process is fairly simple, and begins with some sort of hostname gathering using any method including the use of the Sensepost DNS miner, which relies on Google searching. Once a list of hostnames is established, Google can be used, either manually, or with the help of a program like hostsieve, to determine patterns in the way the names were created. Once a pattern is observed, Google Sets can be used, again with manually or with the dnspredict tool, to predict and verify additional hosts named through extension of that naming pattern. Finally, Google can be used alongside other techniques to test the vitality, or existence or connectedness of those discovered hosts. As with most other Google Hacking techniques, most of this discovery (when performed properly) can be completed by querying Google, without sending a single packet to the target host.

 

Afterword

All this begs the question “So what can be done about this?” The simple fact is, if an attacker isn’t sending you packets, there’s not much you can do, at least not directly. You could take steps to ensure that your public footprint is exactly what you want it to be, and that means footprinting yourself. Armed with the insider knowledge you possess, you can cut to the chase and figure out what will be detected without actually doing a footprint. As I’ve always said, if you don’t want it public, don’t put it on the Internet. It’s just that simple. Now, if an attacker is willing to send you some packets to get the job done, there’s quite a bit you can do actively and reactively to keep things from getting too out of hand. One alternative is outright deception and mild strikeback (see When the Tables Turn by Sensepost research[7] and Aggressive Network Self-Defense by Syngress Publishing[8]) although these steps are generally too radical for most folks. If you’re getting DNS queries, you should be monitoring them, and if the logs get too insane, you should parse them for patterns. If you’re getting Google referrers, you should be parsing them as well to figure out if someone is Google Hacking away at your server. These topics, however, will be the subject of another paper. Stay tuned.

 

Thanks

Thanks to God for the gift of life, and my family for the gift of love. Thanks to Jimmy Neutron and Roelof Temmingh for the great code used in this paper. Thanks to all the moderators at http://johnny.ihackstuff.com: Murfie, JN, thePsyko, JBrashars, Wasabi. Thanks to all the contributors to the GHDB and the forums, especially the Google Masters.



[1] http://www.syngress.com/catalog/?pid=3150

[2] It’s worth mentioning that there is a distinct difference between the term “hostname” or “DNS name” and “subdomain” although these terms are often used interchangeably. We use the term ‘DNS name’ to refer to any resolvable address, so if a subdomain resolves… we call it the wrong thing. Sorry.

 

[3] google_miner available from the sensepost.com website, perhaps as a different name: http://www.sensepost.com/garage_portal.html#Miscellaneous

 

[4] Hostsieve is available as part of the dnspredict package which can be downloaded from http://johnny.ihackstuff.com

[5] http://labs.google.com/sets

[6] dnspredict is available as part of the dnspredict package which can be downloaded from http://johnny.ihackstuff.com

 

[7] www.sensepost.com/garage_portal.html

[8] http://www.syngress.com/catalog/?pid=3190