Writing A Simple RSS To Mastodon Bot

Having recently set up a Mastodon instance I wanted to play around with using Mastodon's statuses API endpoint to create a simple bot that publishes toots.

As well as letting me play around a little with the API, the bot provides a way for others to follow my content on Mastodon without necessarily having to follow me: Those who want to can follow @rssbot@mastodon.bentasker.co.uk and not be subjected to any of my idle chatter.

In this post, I'll walk through the process that I followed to create a simple python bot which periodically checks the RSS feed for my site and toots any new entries out, using a Content Warning where the page's tagset indicates that that's appropriate.


Creating a Bot Account

The first thing that's needed is a dedicated account for people to follow.

If you're running your own Mastodon instance, there should be no issue in creating a bot account.

If you're on someone else's instance though, be aware that not all instances are willing to play host to bots, so check your instance's rules (usually available at https://<instance address>/about). If your instance doesn't allow bots, there is an instance (botsin.space) dedicated to exactly this.

Once the account is created, you need to configure the profile.

Log in and go to Preferences -> Profile -> Appearance before doing the following

  • Check This is a bot account
  • Make sure the account's bio is clear that it's a bot, what it does, and who to contact if it misbehaves
  • (optional) tick Hide your social graph so that your bots followers aren't publicly listed

This should leave you with a bio that pretty clearly says "I'm a bot"

Profile clearly labels account as bot

Note: If you can it's better to do development and testing against a non-federated development instance so that you don't annoy others. There are rate limits on things like toot deletion (30 deletions per 30 mins), so you'll also find that correcting mistakes made in prod can be problematic.


API Credentials Creation

Because the bot is going to send toots, it needs to be able to authenticate with Mastodon.

Log into your bot account, and then choose Preferences -> Development, and then click New Application to access the Application creation screen

Application Creation

The only scope needed by this bot is write:statuses, so uncheck the others (and set a name).

Once you've hit SUBMIT and created the application, click back into it and Mastodon should provide you with some secrets - the one that you're interested in is labelled Your access token.

access token and client secret


Feed Parsing

Fetching and parsing the RSS feed itself is a solved problem: we can use python's feedparser module. It'll transparently handle differences in feed structure and format (for example if an atom feed is used instead of RSS).

We do, however, need to build a little bit of logic to keep track of which feed items we've seen so that subsequent runs of the script do not re-toot the same content.

We use a file on disk to track state between runs

def process_feed(feed):
    ''' Process the RSS feed and generate a toot for any entry we haven't yet seen
    '''
    if os.path.exists(feed['HASH_FILE']):
        hashtracker = open(feed['HASH_FILE'],'r+')
        storedhash = hashtracker.read()
    else:
        hashtracker = open(feed['HASH_FILE'],'w')
        storedhash = ''

    # This will be overridden as we iterate through
    firsthash = False

    # Load the feed
    d = feedparser.parse(feed['FEED_URL'])

The function's feed argument is just a dict (because the feed config is loaded from a JSON file - we'll come to that a bit later).

Once the feed has been parsed, we need to iterate over the entries and build a dict for each containing the information that we want to use in our toot (we could, technically, just pass the feedparser object through, but converting to a dict helps make subsequent functions more re-usable).

    # Iterate over entries
    for entry in d.entries:
        en = {}
        en['title'] = entry.title
        en['link'] = entry.link
        en['author'] = False       
        en['tags'] = []

        if hasattr(entry, "tags"):
            # Iterate over tags and add them
            [en['tags'].append(x['term']) for x in entry.tags]

        en['cw'] = CW_TAG in en['tags']

        if INCLUDE_AUTHOR == "True" and hasattr(entry, "author"):
            en['author'] = entry.author

Before we do too much though, we need to check whether we've seen this entry before, so we generate a SHA1 checksum of the URL and compare that to the hash loaded from disk

        linkhash = hashlib.sha1(entry.link.encode('utf-8')).hexdigest()

        if storedhash == linkhash:
            print("Reached last seen entry")
            break

Assuming we didn't break out of the loop, we also need to check whether firsthash needs updating (it should only contain the hash of the first listed RSS item).

        # Keep a record of the hash for the first item in the feed
        if not firsthash:
            firsthash = linkhash

And then trigger sending of the toot, before updating the on-disk hash record

        # Send the toot
        if send_toot(en):
            # If that worked, write hash to disk to prevent re-sending
            hashtracker.seek(0)
            hashtracker.truncate()
            hashtracker.write(firsthash)

        # Don't spam the API
        time.sleep(1)

    # Close filehandle
    hashtracker.close()

We only ever store one hash per feed: that of the first item in the most recent run.

There's an implicit assumption here that the feed is an accurate representation of the order of publishing (i.e. that items won't be inserted down the feed). So, there's a trade-off here: the bot won't pick up on items which later get inserted earlier in the feed. We could store the hash of every item we've seen, but it would quickly become unweildy.


Sending Toots via Mastodon's statuses API

Mastodon's API is fairly simple to use, and as a result the function responsible for actually sending the status update is too

def send_toot(en):
    ''' Turn the dict into toot text
        and send the toot
    '''

    # Turn the dict into a toot
    toot_txt = build_toot(en)

    # Build the dicts that we'll pass into requests
    headers = {
        "Authorization" : f"Bearer {MASTODON_TOKEN}"
        }

    # Build the payload
    data = {
        'status': toot_txt,
        'visibility': MASTODON_VISIBILITY
        }

    # Are we adding a content warning?
    if en['cw']:
        data['spoiler_text'] = en['title']

    # Don't send!
    if DRY_RUN == "Y":
        print("------")
        print(data['status'])
        print(data)
        print("------")
        return True

    try:
        resp = SESSION.post(
            f"{MASTODON_URL.strip('/')}/api/v1/statuses",
            data=data,
            headers=headers
        )

        if resp.status_code == 200:
            return True
        else:
            print(f"Failed to post {en['link']}")
            print(resp.status_code)
            return False
    except:
        print(f"Urg, exception {en['link']}")
        return False

Note: HTTP 406

I initially modelled the API call using curl, and as a result found a useful factoid about calling this API endpoint (and presumably, some of the others).

A call with curl looks something like this

curl -v \
--data-urlencode "status=Hello World" \
--data-urlencode "visibility=unlisted" \
-H "Authorization: Bearer <token>" \
https://mastodon.example.invalid/api/v1/statuses

The documentation will tell you that there are 3 possible HTTP status codes:

  • 200: OK: toot sent
  • 401: Unauthorized: Auth failed (header missing or invalid)
  • 422: Unprocessable entity: Validation failed (you missed a required field)

However, this is untrue, there's at least one more status that can be returned.

This endpoint can also return a HTTP 406, and will do so if the Content-Type request header is not application/x-www-form-urlencoded or multipart/form-data (in the snippets above, both curl and requests do that automatically, but it's worth being aware of if you're in the habit of laying out request headers explicitly).


Build Toot

There's nothing particularly special about the build_toot() function, it just accepts the dict that we built earlier, and constructs a string to be used as the toot text body.

def build_toot(entry):
    ''' Take the entry dict and build a toot
    '''
    skip_tags = SKIP_TAGS + ["blog", "documentation", CW_TAG]      
    toot_str = ''

    # Prefix Blog and Documentation entries
    if "blog" in entry['tags']:
        toot_str += "New #Blog: "
    elif "documentation" in entry['tags']:
        toot_str += "New #Documentation: "

    toot_str += f"{entry['title']}\n"

    if entry['author']:
        toot_str += f"Author: {entry['author']}\n"

    toot_str += f"\n\n{entry['link']}\n\n"

    # Tags to hashtags
    if len(entry['tags']) > 0:
        for tag in entry['tags']:
            if tag in skip_tags:
                # Skip the tag
                continue
            toot_str += f'#{tag.replace(" ", "")} '

    return toot_str

Wrapping it all up

Functions aren't much use without something to call them, so in __main__ we load config and then call process_feed()

# Load feed info
fh = open("feeds.json", "r")
FEEDS = json.load(fh)
fh.close()

# Get config from env vars
HASH_DIR = os.getenv('HASH_DIR', '')
INCLUDE_AUTHOR = os.getenv('INCLUDE_AUTHOR', "True")

# if Y, toots won't be sent and we'll write to stdout instead
DRY_RUN = os.getenv('DRY_RUN', "N").upper()

# Posts with this tag will toot with a content warning
CW_TAG = os.getenv('CW_TAG', "content-warning")

# Tags in this comma seperated list won't be included in toots
SKIP_TAGS = os.getenv('SKIP_TAGS', "").lower().split(',')

# Mastodon config
MASTODON_URL = os.getenv('MASTODON_URL', "https://mastodon.social")
MASTODON_TOKEN = os.getenv('MASTODON_TOKEN', "")
MASTODON_VISIBILITY = os.getenv('MASTODON_VISIBILITY', 'public')

# We want to be able to use keep-alives if we're posting multiple things
# so set up a connection pool
SESSION = requests.session()

# Iterate over the feeds in the config file
for feed in FEEDS:
    # Define the state tracking file
    feed['HASH_FILE'] = HASH_DIR + hashlib.sha1(feed['FEED_URL'].encode('utf-8')).hexdigest()
    # Process the feed
    process_feed(feed)

Dockerfile

The heavy reliance on environment variables above probably made it fairly clear where this was heading: the bot is going to be wrapped in a docker container.

We create a Dockerfile

FROM python

# Define some defaults
ENV HASH_DIR /hashdir/
ENV DRY_RUN "N"
ENV MASTODON_VISIBILITY "public"
ENV SKIP_TAGS ""

# Install deps, create basedirs and
# create a user to run things as
RUN pip install feedparser requests \
    && mkdir /app \
    && mkdir /hashdir \
    && useradd rssfeed \
    && chown -R rssfeed /hashdir

# oddly fetching arbitrary feeds 
# from the net as root
# seems like a bad idea
USER rssfeed

COPY py_post_on_rss_change.py /app/

WORKDIR /app
CMD /app/py_post_on_rss_change.py

This will be used to build an image with the script's dependencies baked in, as well as creating an unprivileged user to run the script, within a disposable containerised environment.


feeds.json

There isn't currently much in the config file

[
    {
        "FEED_URL": "https://www.bentasker.co.uk/rss.xml"
    }
]

It might seem that using JSON for this is overkill, because it absolutely is.

However, the intention, down the line, is to add the ability to define per-feed hashtags which should always be included in toots.

The bot's script is based upon one I created previously to send emails on RSS feed changes and the more complex config there used JSON, which carried over when I forked the repo.

So in summary, we're using JSON because a) I'm lazy and b) one day (if a) doesn't get in the way), the script's capabilities should justify it.


py_post_on_rss_change.py

Once put together, the full script looks like this

#!/usr/bin/env python3
#
# Simple Python bot to retrieve a RSS feed 
# and toot items
#
# Copyright (c) B Tasker, 2022
#
# Released under BSD 3 Clause
# See https://www.bentasker.co.uk/pages/licenses/bsd-3-clause.html
#
import feedparser
import json
import hashlib
import os
import requests
import time

def build_toot(entry):
    ''' Take the entry dict and build a toot
    '''
    # Initialise
    skip_tags = SKIP_TAGS + ["blog", "documentation", CW_TAG]   
    toot_str = ''

    if "blog" in entry['tags']:
        toot_str += "New #Blog: "
    elif "documentation" in entry['tags']:
        toot_str += "New #Documentation: "

    toot_str += f"{entry['title']}\n"

    if entry['author']:
        toot_str += f"Author: {entry['author']}\n"

    toot_str += f"\n\n{entry['link']}\n\n"

    # Tags to hashtags
    if len(entry['tags']) > 0:
        for tag in entry['tags']:
            if tag in skip_tags:
                # Skip the tag
                continue
            toot_str += f'#{tag.replace(" ", "")} '

    return toot_str


def send_toot(en):
    ''' Turn the dict into toot text

    and send the toot
    '''
    toot_txt = build_toot(en)

    headers = {
        "Authorization" : f"Bearer {MASTODON_TOKEN}"
        }

    data = {
        'status': toot_txt,
        'visibility': MASTODON_VISIBILITY
        }

    if en['cw']:
        data['spoiler_text'] = en['title']

    if DRY_RUN == "Y":
        print("------")
        print(data['status'])
        print(data)
        print("------")
        return True

    try:
        resp = SESSION.post(
            f"{MASTODON_URL.strip('/')}/api/v1/statuses",
            data=data,
            headers=headers
        )

        if resp.status_code == 200:
            return True
        else:
            print(f"Failed to post {en['link']}")
            print(resp.status_code)
            return False
    except:
        print(f"Urg, exception {en['link']}")
        return False


def process_feed(feed):
    ''' Process the RSS feed and generate a toot for any entry we haven't yet seen
    '''
    if os.path.exists(feed['HASH_FILE']):
        hashtracker = open(feed['HASH_FILE'],'r+')
        storedhash = hashtracker.read()
    else:
        hashtracker = open(feed['HASH_FILE'],'w')
        storedhash = ''

    # This will be overridden as we iterate through
    firsthash = False

    # Load the feed
    d = feedparser.parse(feed['FEED_URL'])

    # Iterate over entries
    for entry in d.entries:

        # compare a checksum of the URL to the stored one
        # this is used to prevent us re-sending old items
        linkhash = hashlib.sha1(entry.link.encode('utf-8')).hexdigest()

        if storedhash == linkhash:
            print("Reached last seen entry")
            break

        en = {}
        en['title'] = entry.title
        en['link'] = entry.link
        en['author'] = False       
        en['tags'] = []

        if hasattr(entry, "tags"):
            # Iterate over tags and add them
            [en['tags'].append(x['term']) for x in entry.tags]

        en['cw'] = CW_TAG in en['tags']

        if INCLUDE_AUTHOR == "True" and hasattr(entry, "author"):
            en['author'] = entry.author

        # Keep a record of the hash for the first item in the feed
        if not firsthash:
            firsthash = linkhash

        # Send the toot
        if send_toot(en):
            # If that worked, write hash to disk to prevent re-sending
            hashtracker.seek(0)
            hashtracker.truncate()
            hashtracker.write(firsthash)

        time.sleep(1)
    hashtracker.close()

fh = open("feeds.json", "r")
FEEDS = json.load(fh)
fh.close()

HASH_DIR = os.getenv('HASH_DIR', '')
INCLUDE_AUTHOR = os.getenv('INCLUDE_AUTHOR', "True")

# Posts with this tag will toot with a content warning
CW_TAG = os.getenv('CW_TAG', "content-warning")

MASTODON_URL = os.getenv('MASTODON_URL', "https://mastodon.social")
MASTODON_TOKEN = os.getenv('MASTODON_TOKEN', "")
MASTODON_VISIBILITY = os.getenv('MASTODON_VISIBILITY', 'public')
DRY_RUN = os.getenv('DRY_RUN', "N").upper()
SKIP_TAGS = os.getenv('SKIP_TAGS', "").lower().split(',')

# We want to be able to use keep-alive if we're posting multiple things
SESSION = requests.session()

for feed in FEEDS:
    feed['HASH_FILE'] = HASH_DIR + hashlib.sha1(feed['FEED_URL'].encode('utf-8')).hexdigest()
    process_feed(feed)

Building

There's nothing particularly special required to build.

Put Dockerfile and py_post_on_rss_change.py into the same directory and then run

docker build -t mastobot .

First Run

The container needs to be provided with a few things at call time

  • A copy of feeds.json
  • Somewhere to write out the tracking hash (otherwise it'll be lost between runs)
  • The Mastodon token and URL

So, when setting up for the first time we want to run

mkdir hashdir
sudo chown 1000:1000 hashdir/

And then invoke a dry run as follows

docker run --rm \
-e MASTODON_TOKEN="<token>" \
-e MASTODON_URL=https://mastodon.bentasker.co.uk \
-v $PWD/feeds.json:/app/feeds.json \
-v $PWD/hashdir:/hashdir/ \
-e SKIP_TAGS="howto,opinion" \
-e DRY_RUN="Y" \
mastobot

This should print output for each of the entries.

When invoked a second time, it should print Reached last seen entry.

Once you're happy it's not going to spam your feed, remove the contents of hashdir

sudo rm hashdir/*

And then trigger a for-real run

docker run --rm \
-e MASTODON_TOKEN="<token>" \
-e MASTODON_URL=https://mastodon.bentasker.co.uk \
-v $PWD/feeds.json:/app/feeds.json \
-v $PWD/hashdir:/hashdir/ \
-e SKIP_TAGS="howto,opinion" \
mastobot

You can add this command to crontab or wrap it in a bash script to be called by cron.


The Toots

The result is a toot that looks something like this

A bot originated toot in Mastodon

It's considered polite to apply Content Warnings to certain topics (exactly which topics, inevitably, differs betweeen people), so that other users can make an active choice on whether or not they want to view a toot, rather than having a feed that starts to lend itself to doomscrolling.

Current events at Twitter are a good example of potentially CW'able content: it's not offensive per se, but does tend to be fairly consistently downbeat. Conveniently, I have a couple of twitter orientated blog posts in my feed to test with.

If a post in the feed has the (newly created) tag Content-Warning the bot will include spoiler_text in it's API call so that a Content Warning will be used:

A bot originated toot with a Content Warning in Mastodon

The title makes it clear what the post is about, whilst the CW prevents unwanted link previews and hashtags from appearing directly in the user's feed.

I'm not sure that I'll need it often, but it was so trivial to add that it seemed like something of a no-brainer.


Conclusion

It's not particularly advanced or shiny, but the result is a functional bot capable of posting periodic status updates into Mastodon via API.

There are improvements that can be made, including

  • Implementing the ability for a page to (waves magic wand) somehow specify the text to use on content warnings
  • Fetch the meta-descriptions from pages and insert that into the toot (I much prefer descriptive shares to those with just a title and link)
  • Add a list of per-feed hashtags which should be added to each toot
  • Make the toot text templatable
  • Add the ability to @ the author (easy to add, but not something that I currently want)

The main thing, though, is this quick script has shown how accessible and easy to integrate against the Mastodon API is.

When creating a bot, it's well worth keeping in mind that bot activity is not always welcomed. Whatever bot you create should aim to be low volume and considerate of meat-bag users.