• The Pitfalls of Building an Elasticsearch backed Search Engine

    There are a ton of articles on the internet describing how to go about building a self-hosted fulltext search engine using ElasticSearch.

    Most of the tutorials I read describe a fairly simple process, install some software, write a little bit of code to insert and extract data.

    The underlying principle really is:

    1. Install and set up ElasticSearch
    2. Create a spider/crawler or otherwise insert your content into Elasticsearch
    3. Create a simple web interface to submit searches to Elasticsearch
    4. ???
    5. Profit

    At the end of it you get a working search engine. The problem is, that search engine is crap.

    It's not that it can't be saved (it definitely can), so much as that most tutorials seem not to lend any thought to improving the quality of search results - it returns some results and that's good enough.

    Over the years, I've built up a lot of internal notes, JIRA tickets etc, so for years I ran a self-hosted internal search engine based upon Sphider. It's code quality is somewhat questionable, and it's not been updated in years, but it sat there and it worked.

    The time came to replace it, and experiments with off-the-shelf things like yaCy didn't go as well as hoped, so I hit the point where I considered self-implementing. Enter ElasticSearch, and enter the aforementioned Internet tutorials.

    The intention of this post isn't to detail the process I followed, but really to document some of the issues I hit that don't seem (to me) to be too well served by the main body of existing tutorials on the net.

    The title of each section is a clicky link back to itself.