Migrating from Joomla! to Nikola

I've used Joomla for years, the first incarnation of BenTasker.co.uk used Joomla 1.5 back in 2011, then 2.5 with an eventual move to the 3.x branch of Joomla in 2013.

Recently though, I made the decision to migrate from Joomla! to Nikola, a change that went live in the last few days.

This post details how and why

Why

I'm a big believer in using the best tool for the job, and my needs/requirements have changed quite significantly in the 10 years since I first started using Joomla for this site.

Back then, I was working for Virya Technologies: a company that, amongst other things, specialised in Joomla! website design and hosting. I was launching a new site as a successor to my original benscomputer.no-ip.org and it made sense to use the same tool as I was working with during the daytime.

Using a dynamic CMS allowed me to do a range of things, including hosting a shop selling a broad range of downloadables (including Joomla extensions, stock images and the like).

But in January 2015, the EU changed it's VAT rules in relation to place of supply, which meant either closing the shop, or having to take on some fairly onerus paperwork requirements. I ultimately decided that it wasn't worth the effort and closed the shop.

What this meant in practical terms, is that the content of my site had returned to being largely static - there was no longer anything user specific being displayed. Practicalities of management aside, the site could have worked as a collection of HTML files on disk.

Security and Performance

In security terms, the difference between dynamically generating pages and serving static pages is night and day.

The dynamic approach means that you're exposing interpreted code to the internet at large, whilst with static pages your attack surface is limited to that of your web server (which is, of course, also present in a dynamic site).

In practice, it's not just the CMS you're likely to be exposing either. My Joomla! site had a Web-Application-Firewall built into it (via the fabulous Akeeba Admin Tools) and I had another WAF running at the edge.

A dynamic site also necessitates running additional things (Database server etc), which have their own security surface.

Dynamic sites also means suffering the relative performance impact of generating pages on-demand (though I've mitigated this via extensive use of caching for quite some time).

The truth is, that by all technical measures, I could (and should) have migrated to something else, something that's niggled at me for quite some time.

But, apathy is one hell of a drug: there's a non-neglibile cost (in terms of effort) to migrating out of any CMS, I liked the template that I had, and everything was all set up the way I wanted. Put simply, staying where I was cost me nothing, whilst moving to something else required planning and care.

Joomla! 4

The release of Joomla 4 changed that balance.

Upgrading from Joomla! 3 to Joomla! 4 is pretty straightforward (certainly nothing like the hellish procedure of getting from 1.5 to 2.x), but as with any major release there are associated considerations:

  • I'd need to find a new template (or port my existing to Joomla! 4)
  • I'd need to make sure all of my extensions work on Joomla! 4 (port/replace if not)

It's only a couple of tasks, but they're both quite indeterminate in terms of time demand - there's no way to know if they're going to be quick/easy or long/hard until you've started down that path.

I've always found template selection, in particular, extremely hard - I struggle to find something I like, but also can't often identify what I don't like to figure out how to tweak it.

So, the move to Joomla! 4 meant an indeterminate amount of work for me.

The need to do that work meant that suddenly I was in a position where the choice between using "the right tool" and continuing with Joomla! was more equal.

Ultimately, the security argument won out - it simply doesn't make sense for me to expose PHP to the world when I can, effectively, serve my site from static files.

This isn't a slight on Joomla! - if I needed a CMS I'd still choose it over Wordpress or Drupal, it's just that any dynamic CMS isn't the best fit for my use case.

Enter Nikola

Nikola is a Static Site Generator (SSG).

With Nikola, you manage your content in text files (this post is just a markdown file at it's core - though Nikola supports a range of different formats). Just like with Joomla!, the content and template are seperate entitites - if I want to change something in the template, then I do so and tell Nikola to rebuild the site

nikola build

So, I get all the content-management benefits of a dynamic CMS, whilst keeping all the security and speed advantages of serving static files.

One of the things I like about Joomla! is that I can easily extend it (because it's in PHP). Nikola is written in Python so I get more or less the same benefit.

The more I thought about it, the more it became the only path I could really justify to myself.


Migrating Content

That, of course, left me with the challenge of exactly how to go about migrating my content into Nikola. With the site holding content spanning 17 years, the idea of having to manually import wasn't exactly appetising.

However, Nikola supports a range of plugins, including one called import_page. It's fairly generic and imports more of the page than I wanted, but I figured I could customise it so that it'd better understand my site template and only get the information I needed.

After a bit of testing, I settled on the following

# -*- coding: utf-8 -*-

# Copyright © 2015 Roberto Alsina, B Tasker and others

# Permission is hereby granted, free of charge, to any
# person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the
# Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the
# Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice
# shall be included in all copies or substantial portions of
# the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
# KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS
# OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR
# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
# OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

from __future__ import unicode_literals, print_function

import codecs

try:
    import libextract.api
except ImportError:
    libextract = None
import lxml.html
import requests
import sys

from lxml import etree
from nikola.plugin_categories import Command
from nikola import utils

LOGGER = utils.get_logger('import_page', utils.STDERR_HANDLER)


doc_template = '''<!--
.. title: {title}
.. slug: {slug}
.. date: {date}
.. updated: {mdate}
.. tags: {tags}
.. category: {category}
.. author: Ben Tasker
.. previewimage: {preview_image}
.. description: {description}
-->

{pageimage}
{content}
'''


class CommandImportPage(Command):
    """Import a Page."""

    name = "import_bentasker_page"
    needs_config = False
    doc_usage = "[options] page_url [page_url,...]"
    doc_purpose = "import arbitrary web pages"

    def _execute(self, options, args):
        """Import a Page."""
        if libextract is None:
            utils.req_missing(['libextract'], 'use the import_page plugin')
        for url in args:
            self._import_page(url)

    def _import_page(self, url):
        r = requests.get(url)
        if 199 < r.status_code < 300:  # Got it
            # Use the page's title
            doc = lxml.html.fromstring(r.content)
            title = doc.find('*//title').text.replace(" - www.bentasker.co.uk","").strip()
            if sys.version_info[0] == 2 and isinstance(title, str):
                title = title.decode('utf-8')


            # Pull the original slug from the input URL
            slug = self.get_slug(url)

            #nodes = list(libextract.api.extract(r.content))

            article_body = doc.xpath("//div[@itemprop='articleBody']")[0]

            # Rewrite internal links
            for element, attribute, link, pos in article_body.iterlinks():
                if attribute == "href" and link[0] == "/":
                    # It's a link and it's relative
                    #print(f"found {link}")
                    newlink = self.process_url(link)
                    element.set('href', newlink)


            # Insert a Readmore (INDEX_TEASERS in Nikola)
            for element in article_body.xpath("//div[@itemprop='articleBody']/h3"):
                print(f"got {element.text}")
                c = etree.Comment(" TEASER_END ")
                element.addprevious(c)
                break


            meta_desc_e = doc.xpath("//meta[@name='description']")

            if len(meta_desc_e) > 5 and len(meta_desc_e[0].get("content")) > 0:
                # Use that
                meta_desc = meta_desc_e[0].get("content").replace("\n"," ").replace("\r","")
            else:
                # Turn the first few sentences into the meta description
                #print("Generating meta desc")
                b = []
                for t in article_body.itertext():
                    b.append(t)

                inp = "\n".join(b).strip().replace("\r","").replace("\n","")
                meta_desc = ".".join(inp.split(".")[0:3])
                del(inp)


            # Get tags
            tag_list = doc.xpath("//a[@itemprop='keywords relatedLink']")

            tags = []
            for tag in tag_list:
                tags.append(tag.text.strip())

            tags = ','.join(tags)

            # Get publish date
            # Need to convert it into the format Nikola expects
            pub_meta = doc.xpath("//meta[@itemprop='datePublished']")[0]
            pub_date = pub_meta.get('content')
            pdate = self.convert_datestring(pub_date)

            # Now do the same with modification date
            pub_meta = doc.xpath("//meta[@itemprop='dateModified']")[0]
            mdate_date = pub_meta.get('content')
            mdate = self.convert_datestring(mdate_date)

            # Get category details
            category = doc.xpath("//a[@itemprop='articleSection']")[0].text

            # There may also be a parent category listed
            parent_cat = False
            parent_cat_list = doc.xpath("//dd[@class='parent-category-name']/a")

            if len(parent_cat_list) > 0:
                parent_cat = parent_cat_list[0].text
                category = f"{parent_cat}/{category}"


            # Get the sharing image
            preview_image = ''
            img_list = doc.xpath("//meta[@itemprop='image']")
            if len(img_list) > 0:
                preview_image = img_list[0].get("content").replace("https://www.bentasker.co.uk/", "/")


            pageimage = ""
            if "/photos-archive/" in url:
                print("Photos page")
                # We need special handling for this because of how the images are embedded
                img_ele = doc.xpath("//div[@class='img-fulltext-']/img[@itemprop='image']")[0]

                pageimage = "<!-- TEASER_END -->\n{}".format(
                    lxml.html.tostring(img_ele, encoding='utf8', method='html', pretty_print=True).decode('utf8')
                    )


            document = doc_template.format(
                title=title,
                date=pdate,
                slug=slug,
                tags=tags,
                category=category.lower(),
                preview_image=preview_image,
                mdate=mdate,
                description=meta_desc,
                pageimage=pageimage,
                content=lxml.html.tostring(article_body, encoding='utf8', method='html', pretty_print=True).decode('utf8')
            )

            with codecs.open(slug + '.html', 'w+', encoding='utf-8') as outf:
                outf.write(document)

        else:
            LOGGER.error('Error fetching URL: {}'.format(url))

    def get_slug(self, url):
        ''' Extract the existing slug from the url
        '''

        url_parts=url.split("/")
        return url_parts[-1].lower()


    def process_url(self,link):
        ''' Take a relative link and process it into what we expect the Nikola URL to be
        '''

        # Force lowercase
        link = link.lower()

        if link.startswith("/images/") or link.startswith("/media/"):
            return link

        return f"/posts{link}.html"


    def convert_datestring(self, pub_date):
        ''' Take a YYYYMMDDTHHMMSS date string
            and convert it to the format Nikola uses
        '''
        y = pub_date[0:4]
        m = pub_date[4:6]
        d = pub_date[6:8]
        H = pub_date[9:11]
        M = pub_date[11:13]
        S = pub_date[13:15]
        return f"{y}-{m}-{d} {H}:{M}:{S} UTC"

The plugin fetches the page content and writes it into a text file in the current working directory.

The import should then have largely been a case of iterating over a list of links to my site and then moving the resulting files into the correct place in the heirachy.

In order to get those I passed a copy of my HTML sitemap into the following function

def do_work(page_text):
    '''
        Pull links out
    '''
    doc = lxml.html.fromstring(r.content)
    sm_body = doc.xpath("//div[@id='xmap']")[0]

    # Pull out links
    for element, attribute, link, pos in sm_body.iterlinks():
        if attribute == "href" and link[0] == "/":
            # It's a link and it's relative

            link_parts = link.split("/")
            link_type = categorise_link(link_parts)

            # Write the details out in a format we can trivially use later
            linkline = f"{'/'.join(link_parts[:-1])} {link_parts[-1]}\n"

            fh = open(f"{os.getenv('OUT_DIR')}{link_type}.txt",'a')
            fh.write(linkline)
            fh.close()

This categorised (via categorise_link) URLs as being one of

  • tag (to be discarded - Nikola generates those for us)
  • videopost (thought they might need special hadnling, they did not)
  • photopost (spun out to a seperate site)
  • feed (to be discarded)
  • post (posts like this)
  • page (static pages like about me)

For post, page and videopost I passed the page through a loop to run the import and move into the category heirachy

cat ${OUT_DIR}post.txt | sort | uniq |while read -r postline
do

    category=`echo "$postline" | cut -d\  -f1`
    page=`echo "$postline" | cut -d\  -f2`

    url="https://www.bentasker.co.uk${category}/$page"

    nikola import_bentasker_page "$url" 2>&1 | grep "ERROR"
    if [ "$?" == "0" ]
    then
        echo $postline >> errored.txt
        continue
    fi

    # We should now have a file on disk, so just need to move it into place
    mkdir -p posts/$category/
    mv ${page}.html posts/$category/
    if [ ! "$?" == "0" ]
    then
        echo $postline >> errored.txt
    fi

done

I copied my images directory from the old site over, and then built the site

nikola build

That, for all intents and purposes, was the migration done (in practice I've also done a bit of template tweaking, spot-checking etc).

Conclusion

Of course, changing content management system is quite a significant re-tooling, and no large migration is ever entirely seamless - there will be things that I find going forwards, and there will undoubtedly be things I wish I'd done differently/better.

But, the result is that I now have a site that doesn't expose any "active" code to the whims of visitors, beyond that required to serve any HTTP/HTTPS connection.

I do still have to tolerate some more template pain though, as I work to try and improve it (and the automated syntax highlighting could clearly do with some work).

I'd still use Joomla! as my tool of choice for anything requiring a dynamic CMS, it's just that it's no longer the right tool to meet the needs of www.BenTasker.co.uk