Ben Tasker's Blog

The Pitfalls of Building an Elasticsearch backed Search Engine

There are a ton of articles on the internet describing how to go about building a self-hosted fulltext search engine using ElasticSearch.

Most of the tutorials I read describe a fairly simple process, install some software, write a little bit of code to insert and extract data.

The underlying principle really is:

  1. Install and set up ElasticSearch
  2. Create a spider/crawler or otherwise insert your content into Elasticsearch
  3. Create a simple web interface to submit searches to Elasticsearch
  4. ???
  5. Profit

At the end of it you get a working search engine. The problem is, that search engine is crap.

It's not that it can't be saved (it definitely can), so much as that most tutorials seem not to lend any thought to improving the quality of search results - it returns some results and that's good enough.

Over the years, I've built up a lot of internal notes, JIRA tickets etc, so for years I ran a self-hosted internal search engine based upon Sphider. It's code quality is somewhat questionable, and it's not been updated in years, but it sat there and it worked.

The time came to replace it, and experiments with off-the-shelf things like yaCy didn't go as well as hoped, so I hit the point where I considered self-implementing. Enter ElasticSearch, and enter the aforementioned Internet tutorials.

The intention of this post isn't to detail the process I followed, but really to document some of the issues I hit that don't seem (to me) to be too well served by the main body of existing tutorials on the net.

The title of each section is a clicky link back to itself.

Read more ...

Recovering files from SD Cards and How to protect yourself

I was working on writing this up anyway, but as the UK Government's lawyers have recommend weakening protections around Police's ability to search phones, I thought today might be a good day to get a post up about the protection of content on SD cards.

 

I never seem to have a micro-SD card to hand when I need one, they're generally all either in use or missing.

I tinker with Raspberry Pi's quite a lot, so, I ordered a job lot of used micro-SDs from ebay so that I could just have a pot of them sat there.

I thought it'd be interesting to see how many of the cards had been securely erased, and by extension what nature of material could wind up being restored off them.

Part of the point in this exercise was also to bring my knowledge of recovery back up to date, although I've done it from time to time - I've not really written anything on it since 2010 (An easier method for recovering deleted files on Linux, and the much earlier Howto recovered deleted filenodes on an ext2 filesystem - yes, that old that it's ext2!).

In this post I'll walk through how I (trivially) recovered data, as well as an overview of what I recovered. I'll not be sharing any of the recovered files in any identifiable form - they are, after all, not my files to share.

I'll also detail a few techniques I tested for securely erasing the cards so that the data could no longer be recovered

Read more ...

Spamhaus still parties like it's 1999

I recently had visibility of a Spamhaus Block List (SBL) listing notification on the basis of malware being detected within a file delivered via HTTP/HTTPS.

As part of the report, they provide the affected URL (for the sake of this post we'll say it's https://foo.example.com/app.exe) along with details of the investigation they've done.

Ultimately that investigation is done in order to boil back to a set of IPs to add to their list.

Concerningly, this is, literally just 

dig +short foo.example.com

Which gives them output of the form

CNAME1
CNAME2
1.2.3.4
4.5.6.7

They then run a reverse lookup (using nslookup) on those IP addresses in order to identify the ISP. The IPs are added to the SBL, and a notification sent to the associated ISP.

In this case, the URL was a legitimate file, though it had been bundled with some software falling under the Possibly Unwanted Application (PUA) category. The point of this post, though, is not to argue about whether it should have been considered worthy of addition.

The issue is that Spamhaus' investigation techniques seem to be stuck in the last century, causing potentially massive collateral damage whilst failing to actually protect against the very file that triggered the listing in the first place.

In case you're wondering why Spamhaus are looking for malware delivery over HTTP/HTTPS, it's because the SBL has URI blocking functionality - when a spam filter (like SpamAssasin) detects a URL in a mail, it can check whether the hosting domain resolves back to an IP in the SBL, and mark as spam if it does (in effect limiting the ability to spread malware via links in email - undoubtedly a nice idea).

 

Just to note, although they make it difficult to identify how to contact them about this kind of thing, I have attempted to contact Spamhaus about this (also tried via Twitter too).

It also seems only fair (to Spamhaus) to note that I also saw a Netcraft incident related to the same file, and they don't even provide the investigative steps they followed. So not only might Netcraft be falling for the same traps, but there's a lack of transparency preventing issue from being found and highlighted.

Read more ...

(Hopefully) Rescuing a bottle of drink

With the change in weather, I'm having to take painkillers a lot more regularly, which means I can't drink.

I thought, as an option, I'd explore some non-alcoholic spirits - there seems to be quite a market for them, so there must be some good ones out there.

I did have some luck in finding some "gin". However, whilst searching, I stumbled upon "Xachoh Blend No. 7 Non Alcoholic Spirit", which lists the following tasting notes

Xachoh Blend No. 7 has a warm and richly spiced aroma. The prominent flavours of ginger root and blades of mace strike a perfect blend of warmth, spice and a subtle fruitiness. The luxurious aroma of cinnamon quills brings sweetness to the nose and palate, balancing perfectly with saffron & the other spices. Dark crystal malt adds delicious toasted notes and a real depth of flavour, similar to that of a well-aged dark spirit. All of these rich and dark flavours are balanced by a refreshing acidity of sumac on the palate, leaving the way for a long finish and an eagerness for that next sip.

Sounds good eh? As with anything on Amazon, reviews were incredibly mixed, some love it, some hate it.

So, as it sounded good, I took a risk and ordered a bottle.

It arrived this morning:

 

So having been looking forward to it's arrival, I had a little taste. 

It's got a nice and very varied aroma to it. But things go downhill once you get it to your mouth - if it was just a little less watery, I'd probably be looking to add Ribena to it. 

Disappointing doesn't cover it, the only trace of flavour it has is a somewhat unpleasant aftertaste. Unfortunately, if you mix it with ginger ale, it transpires that all you get is ginger ale with a horrendous aftertaste.

The answer for why lies on the back label (and in fairness *is* listed on the Amazon listing)

Free from:

  • Alcohol
  • Extracts
  • Gluten
  • Sugar
  • Calories
  • Sweeteners

With the exception of a tiny bit of salt, the nutritional information is just 0's. This stuff is literally water with some Barley Malt and a few flavourings.

It's "natural", it Gluten Free, it's vegan, it's... it's fucking shit and it's destined for the drain. Yuck

But, rather than pour a £30 bottle of water down the drain, I thought I'd have a go at improving it first - worst comes to worst I'm just pouring a slightly more expensive bottle of water down the drain, and it's not like I could realistically make it much worse.

As I'm extremely unlikely to try making this again, and there's not a lot of room there for snark, I figured this was better placed here than on my recipes site.

Read more ...

Twitter Screws Up With Data It Shouldn't Hold

I recently had a (NSFW) grumble about Twitter. Part of that grumble was about the fact that Twitter insist you provide a mobile phone number in order to re-instate your account after a suspension.

As part of my appeal against the suspension I noted that that's arguably not GDPR compliant - a phone number is (undoubtedly) PII, and is not required in order to provide the service. For Twitter to hold that number requires consent, and it's unlawful for them to withhold the service if consent is not given for non-essential data processing.

Part of the reason for my objection was because Social Media companies (in the form of Facebook) have already proven they cannot be trusted with things like mobile phone numbers.

Presumably Twitter weren't happy with the fact that I needed to use Facebook as an example, as they've now gone ahead and had a data processing screw up of their own.

Read more ...