• A comparative analysis of search terms used on bentasker.co.uk and it's Onion

    My site has had search pretty much since it's very inception. However, it is used relatively rarely - most visitors arrive at my site via a search engine, view whatever article they clicked on, perhaps follow related internal links, but otherwise don't feel the need to do manual searches (analysis in the past showed that use of the search function dropped dramatically when article tags were introduced).

    But, search does get used. I originally thought it'd be interesting to look at whether searches were being placed for things I could (but don't currently) provide.

    Search terms analysis is interesting/beneficial, because they represent things that users are actively looking for. Page views might be accidental (users clicked your result in Google but the result wasn't what they needed), but search terms indicate exactly what they're trying to get you to provide.

    As an aside to that though, I thought it be far more interesting to look at what category search terms fall under, and how the distribution across those categories varies depending on whether the search was placed against the Tor onion, or the clearnet site.

     

    This post details some of those findings, some of which were fairly unexpected (all images are clicky)

    If you've unexpectedly found this in my site results, then congratulations, you've probably searched a surprising enough term that I included in this post.

     

  • Building a Tor Hidden Service CDN

    Last year I started experimenting with the idea of building a Hidden Service CDN.

    People often complain that Tor is slow, though my domain sharding adjustments to the bentasker.co.uk onion have proven fairly effective in addressing page load times.

    On the clearnet, the aim traditionally, is to try and direct the user to an edge-node close to them. That's obviously not possible for a Tor Hidden service to do (and even if it were, the users circuit might still take packets half-way across the globe). So, the primary aim is instead to spread load and introduce some redundancy.

    One option for spreading load is to have a load balancer run Tor and then spread requests across the back-end. That, however, does nothing for redundancy if the load-balancer (or it's link) fails.

    The main aim was to see what could be achieved in terms of scaling out a high traffic service. Raw data and more detailed analysis of the results can be seen here. Honestly speaking, It's not the most disciplined or structured research I've ever done, but the necessary information should all be there.

    This document is essentially a high-level write up along with some additional observations

  • NGinx: Accidentally DoS'ing yourself

    It turned out to be entirely self-inflicted, but I had a minor security panic recently. Whilst checking access logs I noticed (a lot of) entries similar to this

    127.0.0.1 [01/Jun/2014:13:04:12 +0100] "GET /myadmin/scripts/setup.php HTTP/1.0" 500 193 "-" "ZmEu" "-" "127.0.0.1"
    

    There were roughly 50 requests in the same second, although there were many more in later instances.

    Generally an entry like that wouldn't be too big of a concern, automated scans aren't exactly a rare occurrence, but note the source IP - 127.0.0.1 - the requests were originating from my server!

    I noticed the entries as a result of having received a HTTP 500 from my site (so looked at the logs to try and find the cause). There were also (again, a lot of) corresponding entries in the error log

    2014/06/01 13:04:08 [alert] 19693#0: accept4() failed (24: Too many open files)
    

    After investigation, it turned out not to be a compromise. This post details the cause of these entries.

  • Recovering from InnoDB Page Corruption: A Post Mortem

    Recovering from InnoDB Page Corruption: A Post-Mortem

    I recently wrote about how to Repair a database following InnoDB Page corruption.

    This post is a post-mortem of the circumstances that led to the corruption prompting that post. Some of it comes from log observation, other elements are from re-creating the circumstances in a VM.