My site has had search pretty much since it's very inception. However, it is used relatively rarely - most visitors arrive at my site via a search engine, view whatever article they clicked on, perhaps follow related internal links, but otherwise don't feel the need to do manual searches (analysis in the past showed that use of the search function dropped dramatically when article tags were introduced).
But, search does get used. I originally thought it'd be interesting to look at whether searches were being placed for things I could (but don't currently) provide.
Search terms analysis is interesting/beneficial, because they represent things that users are actively looking for. Page views might be accidental (users clicked your result in Google but the result wasn't what they needed), but search terms indicate exactly what they're trying to get you to provide.
As an aside to that though, I thought it be far more interesting to look at what category search terms fall under, and how the distribution across those categories varies depending on whether the search was placed against the Tor onion, or the clearnet site.
This post details some of those findings, some of which were fairly unexpected (all images are clicky)
If you've unexpectedly found this in my site results, then congratulations, you've probably searched a surprising enough term that I included in this post.
The aim was to look solely at search terms used - no relevance is given to whether the term returns any results.
For each category (see below) a regex was built containing the relevant terms, this was simply used to
egrep against lists containing all search terms recorded in the last 12 months.
One list refers to searches received via the hidden service, the other via the Clearnet CDN.
By going over the logs a few times, I sorted search terms into various categories (and in some cases, subcategories). The naming of these categories is fairly arbitrary, but terms were divided into
- Anonymous Overlay Networks
- Crypto Currencies
- Fucked up searches (Sickos)
- Geographic Names
- Internet Technologies
- Video Delivery
- Bombs (seriously, wtf are people searching this site for bomb?)
- Me/My sites
- Downloads etc
- Mobiles (Cell Phones)
- Obvious Exploit Attempts
- Peoples Names
- Specific Companies
- System Administration
- Specific Software
- Backups etc
- Other/All Else
In the stats/graphs below, I generally only refer to the top level categories unless there's something interesting about the distribution of their subcomponents.
Hopefully most categories are fairly self explanatory, but some probably deserve some extra notes
What constitutes porn is obviously very subjective, and it's very hard to accurately infer context into search terms, particularly given there should be no matches on this site for many of the results I observed.
When putting keywords into this category, I've given this a fairly broad brush though - if you're searching for
bikini on this site, I think it's reasonable to assume you're not hoping to find my take on whether bikini's will become bluetooth enabled.
Some of the terms seen are almost certainly driven by my Photography section, but as the related images would all fall under the NSFW tag, I've considered those NSFW too. The vast majority though, are not.
Whilst there are some searches, it seems none relate to that content, with the highlights instead being searches like
how to make GHB and
how to rise PH in meth.
Fucked Up Searches
Fucked up searches encompasses searches for content that is, and should be, illegal.
Seriously, if you're going to random sites and searching for terms like
loli seek help now before you ruin an innocent life.
I won't detail the search terms used for this category: not only do I not want to match against them in Google, but I've no intention of giving anyone ideas to search elsewhere. Again, if you're searching for this kind of content - seek professional help.
Place names basically - I was surprised to find them in there (beyond a few obvious examples). Some of the locations searched related to districts within cities, so having spotted these I was curious to see whether they were being searched by Tor users (potentially giving away information about their own location), or via clearnet (in which case we could probably tell anyway).
Obvious exploit attempts
This category is almost certainly under-represented, as I run a Web Application Firewall (WAF) with quite restrictive settings on the edge. The stats for this category represent the few that made it past that WAF before the backend's WAF (hopefully) blocked them.
This category consists of any terms left when we exclude the terms used to build the rest of the categories
Although not always successful, I try to avoid making political posts on this site. However, political content is sometimes being searched for.
And so start the graphs
It probably makes most sense to start looking at searches in aggregate (i.e. Tor and Clearnet) combined to get an idea of how searches are distributed.
Adult category is extremely well represented in searches. Unfortunately, so is
Fucked Up (Sickos)
Breaking some of the categories down into their subcategories gives some mildly interesting results
The vast, vast majority of searches are for
I suspect that Virya (a company most won't have heard of) makes the cut here, because I used to work for Virya Technologies some years ago.
We can see that the majority of searches within the category Hacking/Cracking relate to vulnerability exploitation rather than to (tangentially) related technologies.
Bad news for advertisers, the majority of users searching for Privacy related terms were searching for those categorised as being about adblocking
The majority of searches within the Misc category related to people trying to find information about me, my sites, or downloads that I offer
A staggering 2.1% was searches for the term
bomb. I... I have no words for that. Nearly 15% of searches were either looking for information to improve their drug making or to blow shit up (maybe, like in Breaking Bad, some want to do both?).
Although there is some bias towards searches from the clearnet (world wide web), the two sources aren't that far off being equal in distribution
There's a very good chance you've come into this with the same expectations as me. My expectations were
- Anonymous Overlay related searches will mostly be from Tor
- Privacy related searches will probably split evenly across both (maybe slightly more for Tor)
- Censorship/Filtering related searches will probably split evenly across both (maybe slightly more for Tor)
- Fucked up searches will be Tor users (taking advantage of their anonymity)
- Adult/Porn/NSFW will probably split across both fairly evenly
- Geographic names will (hopefully) not be on Tor at all
- Exploit attempts will be from both
Rather than generating comparative pie charts for each category, I decided instead to create column charts showing the percentage of searches that came from Tor for that category (i.e. the Clearnet %age is whatever's left to reach 100)
So, most of those initial expectations were wrong
- Less than 25% of searches about Anonymous Overlay Networks came from Tor
- Only about 30% of searches about Privacy came from Tor
- No Censorship/Filtering related terms originated from Tor
- Fucked Up searches mainly originate from Tor, but a surprising percentage came via Clearnet
- Most Adult/Porn/NSFW searches came via Tor
- Nearly all Geographic name searches originated on Tor - hopefully that's not people searching their own locations
- All exploit attempts came in via Tor. That may be higher activity, or that my Clearnet filters are better at catching attempts (impossible to tell from these stats alone)
This becomes more apparent when looking at the category distribution per source
We can see, then that
Fucked Up accounts for a disturbing proportion (5%) of clearnet searches. These are people who are connecting out, over the open internet, to a random site and then searching for Child Sexual Abuse material.
Because the percentage was so much higher than expected, I ran a quick grep over the raw logs to check the terms (in case the stat collection had made some false positives), but unfortunately not, they genuinely all are pedo terms.
Whatever the source, I find the habit of searching people's names quite odd. It looks, though, like a good chunk of those are people trying to find porn of that person: the names are primarily female. Although there are a couple of male names mixed in, they tend to be tech CEOs and the like.
Odd Search Terms
It'd probably be remiss of me not to note some of the more surprising terms that were observed in the logs. I've already mentioned (a couple of times) that people seem to view me as a potential source of bomb making tips.
Unsurprisingly, as we've seen, some of the terms were people searching for the unforgiveable.
I think more or less everyone in tech gets emails along the lines of "can you help me hack a facebook account", so it comes as no surprise to find similar searches in the resultset.
Less expected though were terms like
How to DJ right(sorry chap, I can't hold a beat for toffee)
How to make GHB
How to correct PH balance in Meth
watchmygirlfriend(seems to be a revenge porn site - you won't find any support for that here. Seek professional help)
Cocteau Twins(no idea why you'd search that here, but they're a band)
Nuke the moon
I guess, ultimately, people see a search box and type whatever they want into it - it's just not really clear how you end up on my site in the first place if you're looking for Cocteau Twins.
The first thing that really stands out in all of this is how the traffic is distributed across the two sources in comparison to expectations.
It would have been reasonable to assume that as Tor users have more anonymity and privacy, as a group they'll be both more privacy and censorship focused, and more likely to "exploit" that anonymity.
This seems to actually only be partly true.
It's true that Tor users are more likely to search for Sexual Content (though they're far from exclusive in this regard), but despite Tor's use in circumventing censorship and filtering, they're much less likely to search for that content (perhaps because they feel they've already circumvented it?).
Surprisingly, search terms focused on Programming and mobile devices are far more likely to originate from Tor than the clearnet. Conversely, posts relating to hacking and vulnerability exploitation are much less likely to originate from Tor than from the clearnet.
Queries relating to (sometimes quite granular) geographic locations were observed in the logs - the majority of them came from Tor users. This carries the implications that users may complacently be putting details of their location into unknown sites.
As surprising as it might have been to find people searching for bomb making instructions, it's probably less surprising that every one of these queries was received via Tor.
Users on the traditional world wide web were more likely to search for information relating to System Administration than anything else. This makes sense in many ways, considering that the majority of posts on this site probably related to systems management in one way or another.
Where more extreme searches (such as category
Fucked Up) are made, it's clear that the source is statistically more likely to be a Tor user.
But, it's also clear that the clearnet has it's own share of problematic users, with some truly disturbing search terms being received through that channel.
However, it's also extremely clear that objectionable content is not the sole purpose for which Tor is used, with the vast majority of searches relating to much more acceptable content.
To use an analogy - you don't have to be racist to have voted for Brexit, but if you are racist then you probably did vote Leave as only a minority of racists voted Remain.
I must admit, though, to a certain temptation to redirect the results page for some of those terms to https://fbi.gov