A comparative analysis of search terms used on bentasker.co.uk and it's Onion

Ben Tasker

2020-06-18 16:55 (updated 2020-06-20 08:51)

My site has had search pretty much since it's very inception. However, it is used relatively rarely - most visitors arrive at my site via a search engine, view whatever article they clicked on, perhaps follow related internal links, but otherwise don't feel the need to do manual searches (analysis in the past showed that use of the search function dropped dramatically when article tags were introduced).

But, search does get used. I originally thought it'd be interesting to look at whether searches were being placed for things I could (but don't currently) provide.

Search terms analysis is interesting/beneficial, because they represent things that users are actively looking for. Page views might be accidental (users clicked your result in Google but the result wasn't what they needed), but search terms indicate exactly what they're trying to get you to provide.

As an aside to that though, I thought it be far more interesting to look at what category search terms fall under, and how the distribution across those categories varies depending on whether the search was placed against the Tor onion, or the clearnet site.

This post details some of those findings, some of which were fairly unexpected (all images are clicky)

If you've unexpectedly found this in my site results, then congratulations, you've probably searched a surprising enough term that I included in this post.

Nginx logs two upstream statuses for one upstream

Ben Tasker

2020-06-12 15:54

I'm a big fan (and user) of Nginx.

Just occasionally, though, you'll find something that looks a little odd - despite having quite a simple underlying explanation.

This one falls firmly into that category.

When running NGinx using ngx_http_proxy_module (i.e. using proxy_pass), you may sometimes see two upstream status codes recorded (specifically in the variable upstream_status) despite only having a single upstream configured.

So assuming a logformat of

'$remote_addr\t-\t$remote_user\t[$time_local]\t"$request"\t'
'$status\t$body_bytes_sent\t"$http_referer"\t'
'"$http_user_agent"\t"$http_x_forwarded_for"\t"$http_host"\t$up_host\t$upstream_status';

You may, for example, see a logline line this

1.2.3.4	-	-	[11/Jun/2020:17:26:01 +0000]	"GET /foo/bar/test/ HTTP/2.0"	200	60345109	"-"	"curl/7.68.0"	"-"	"testserver.invalid"	storage.googleapis.com	502, 200

Note the two comma-seperated status codes at the end of the line, we observed two different upstream statuses (though we only passed the 200 downstream).

This documentation helps explain why this happens.

Onion Location Added to Site

Ben Tasker

2020-06-07 09:38

Bentasker.co.uk has been multihomed on Tor and the WWW for over 5 years now.

Over that time, things have changed slightly - at first, although the site was multi-homed, the means of discovery really was limited to noticing the "Browse via Tor" link in the privacy bar on the right hand side of your screen (unless you're on a mobile device...).

When Tor Browser pulled in Firefox's changes to implement support for RFC 7838 Alt-Svc headers, I added support for that too. Since that change, quite a number of Tor Browser Bundle users have connected to me via Onion Services without even knowing they had that additional protection (and were no longer using exit bandwidth).

The real benefit of the Alt-Svc method, other than it being transparent, is that your browser will receive and validate the SSL cert for my site - the user will know they're hitting the correct endpoint, rather than some imposter wrapper site.

Which brings us to today.

Tor have released a new version - 9.5 - of Tor Browser bundle which implements new functionality: Onion Location

Cynet 360 Uses Insecure Control Channels

Ben Tasker

2020-04-23 11:23

For reasons I won't go into here, recently I was taking a quick look over the "Cynet 360" agent, essentialy an endpoint protection mechanism used as part of Cynet's "Autonomous Breach protection Platform".

Cynet 360 bills itself as "a comprehensive advanced threat detection & response cybersecurity solution for for [sic] today's multi-faceted cyber battlefield".

Which is all well and good, but what I was interested in was whether it could potentially weaken the security posture of whatever system it was installed on.

I'm a Linux bod, so the only bit I was interested in, or looked at, was the Linux server installer.

I ran the experiment in a VM which is essentially a clone of my desktop (minus things like access to real data etc).

Where you see [my_token] or (later) [new_token] in this post, there's actually a 32 byte alphanumeric token. [sync_auth_token] is a 88 byte token (it actually looks to be a hex encoded representation of a base64'd binary string)

Writing (and backdooring) a ChaCha20 based CSPRNG

Ben Tasker

2020-02-20 15:39 (updated 2023-08-30 12:25)

Recently I've been playing around with the generation of random numbers.

Although it's not quite ready yet, once of the things I've built is a source of (hopefully) random data. The writeup on that will come later.

But, as an interesting distraction (and in some ways, the natural extension) is to then create a Psuedo Random Number Generator (PRNG) seeded by data from that random source.

I wanted it to be (in principle) Cryptographically Secure (i.e. so we're creating a CSPRNG). In practice it isn't really (we'll explore why later in this post). I also wanted to implement what Bernstein calls "Fast Key Erasure" along with some techniques discussed by Amazon in relation to their S2N implementation.

In this post I'll be detailing how my RNG works, as well as at looking at what each of those techniques do to the numbers being generated.

I'm not a cryptographer, so I'm going to try and keep this relatively light-touch, if only to try and avoid highlighting my own ignorance too much. Although this post (as a whole) has turned out to be quite long, hopefully the individual sections are relatively easy to follow

The Pitfalls of Building an Elasticsearch backed Search Engine

Ben Tasker

2020-02-05 09:27 (updated 2020-02-05 09:31)

There are a ton of articles on the internet describing how to go about building a self-hosted fulltext search engine using ElasticSearch.

Most of the tutorials I read describe a fairly simple process, install some software, write a little bit of code to insert and extract data.

The underlying principle really is:

Install and set up ElasticSearch
Create a spider/crawler or otherwise insert your content into Elasticsearch
Create a simple web interface to submit searches to Elasticsearch
???
Profit

At the end of it you get a working search engine. The problem is, that search engine is crap.

It's not that it can't be saved (it definitely can), so much as that most tutorials seem not to lend any thought to improving the quality of search results - it returns some results and that's good enough.

Over the years, I've built up a lot of internal notes, JIRA tickets etc, so for years I ran a self-hosted internal search engine based upon Sphider. It's code quality is somewhat questionable, and it's not been updated in years, but it sat there and it worked.

The time came to replace it, and experiments with off-the-shelf things like yaCy didn't go as well as hoped, so I hit the point where I considered self-implementing. Enter ElasticSearch, and enter the aforementioned Internet tutorials.

The intention of this post isn't to detail the process I followed, but really to document some of the issues I hit that don't seem (to me) to be too well served by the main body of existing tutorials on the net.

The title of each section is a clicky link back to itself.

Building a Raspberry Pi Based Music Kiosk

Ben Tasker

2020-02-04 16:57 (updated 2020-02-04 17:00)

I used to use Google's Play Music to host and play our music collection.

However, years ago, I got annoyed with Google's lacklustre approach to shared collections, and odd approach to VMs. So, our collection migrated into a self-hosted copy of Subsonic.

Other than a few minor frustrations, I've never looked back.

I buy my music through whatever music service I want, download it onto the NFS share and Subsonic picks up on it following the next library scan - we can then stream it to our phones (using DSub), to the TV (via a Kodi plugin) or to a desktop (generally, using Jamstash). In the kitchen, I tend to use a bluetooth speaker with the tablet that I use to look up recipes.

However, we're planning on repurposing a room into a puzzle and playroom, so I wanted to put some dedicated music playback in there.

Sonos devices have Subsonic support, but (IMO) that's a lot of money for something that's not great quality, and potentially has an arbitrarily shortened lifetime.

So, I decided to build something myself using a Raspberry Pi, a touchscreen and Chromium in kiosk mode. To keep things simple, I've used the audio out jack on the Pi, but if over time I find the quality isn't what I hope, it should just be a case of connecting a USB soundcard in to resolve it.

There's no reason you shouldn't be able to follow almost exactly the same steps if you're using Ampache or even Google Play Music as your music source.

Recovering files from SD Cards and How to protect yourself

Ben Tasker

2020-01-23 19:20 (updated 2020-01-23 19:42)

I was working on writing this up anyway, but as the UK Government's lawyers have recommend weakening protections around Police's ability to search phones, I thought today might be a good day to get a post up about the protection of content on SD cards.

I never seem to have a micro-SD card to hand when I need one, they're generally all either in use or missing.

I tinker with Raspberry Pi's quite a lot, so, I ordered a job lot of used micro-SDs from ebay so that I could just have a pot of them sat there.

I thought it'd be interesting to see how many of the cards had been securely erased, and by extension what nature of material could wind up being restored off them.

Part of the point in this exercise was also to bring my knowledge of recovery back up to date, although I've done it from time to time - I've not really written anything on it since 2010 (An easier method for recovering deleted files on Linux, and the much earlier Howto recovered deleted filenodes on an ext2 filesystem - yes, that old that it's ext2!).

In this post I'll walk through how I (trivially) recovered data, as well as an overview of what I recovered. I'll not be sharing any of the recovered files in any identifiable form - they are, after all, not my files to share.

I'll also detail a few techniques I tested for securely erasing the cards so that the data could no longer be recovered

videos v0.15

Ben Tasker

2019-12-15 13:49 (updated 2019-12-15 13:51)

Version: 0.15

Resolving GFID mismatch problems in Gluster (RHGS) volumes

Ben Tasker

2019-12-13 10:55 (updated 2019-12-13 11:54)

Gluster is a distributed filesystem. I'm not a massive fan of it, but most of the alternatives (like Ceph) suffer with their own set of issues, so it's no better or worse than the competition of the most part.

One issue that can sometimes occur is Gluster File ID (GFID) mismatch following a split-brain or similar failure.

When this occurs, running ls -l in a directory will generally lead to I/O errors and/or question marks in the output

ls -i
ls: cannot access ban-gai.rss: Input/output error
? 2-nguoi-choi.rss ? game.rss

If you look within the brick's log (normally under /var/log/glusterfs/bricks) you'll see lines reporting Gfid mismatch detected

[2019-12-12 12:28:28.100417] E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-shared-replicate-0: Gfid mismatch detected for <gfid:31bcb959-efb4-46bf-b858-7f964f0c699d>/ban-gai.rss>, 1c7a16fe-3c6c-40ee-8bb4-cb4197b5035d on shared-client-4 and fbf516fe-a67e-4fd3-b17d-fe4cfe6637c3 on shared-client-1.
[2019-12-12 12:28:28.113998] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 31bcb959-efb4-46bf-b858-7f964f0c699d/ban-gai.rss: failed to resolve (Stale file handle)

This documentation details how to resolve GFID mismatches

Related Snippets