Nextcloud News Stopped Syncing My Feeds

I subscribe to various RSS feeds using Nextcloud News.

It makes them easily accessible on my phone whilst ensuring that things will be marked read if I instead read the feed on my desktop (that's not to say that it's perfect, though: as I've written previously, I also use fediverse notifications to help make sure that I see new posts).

Recently, however, my phone has generated fewer notifications for Nextcloud News. It actually took me a little while to notice: I was still getting some notifications, including for a once-weekly serial so it seemed like things were working (humans are bad at noticing the absence of things etc etc)

Eventually, though, it occurred to me that I hadn't seen a notification for https://neilzone.co.uk/ in quite a while. Neil posts semi-frequently and I'd definitely seen things going about in the Fediverse, so this struck me as odd.

When I went into Nextcloud News to check, the latest post showing for Neil's feed was dated February 2022. Either I'd failed to notice the feed not updating for 2 years or something else had broken.

The latter seemed much more likely - I remembered that at some point during the last couple of years, Neil redesigned his site and the feed moved, around that time I updated News to use the new url (but probably hadn't checked very carefully after that).


Likely Causes

For the record, the last post that Nextcloud News showed for Neil's site was "I have a complicated relationship with cryptocurrency":

Screenshot of the latest post showing for Neil's feed - it's dated Feb 23 2022

Fetching Neil's RSS feed with curl showed that it definitely included more recent posts, which suggests that the issue was being triggered by something within the feed itself.

I guessed that it would probably be one of two things:

  • Is there some character/issue in the next post that breaks things?
  • Is there some kind of limit in terms of article count?

The article count thing might sound silly, but I've seen similar unexpected limitations in the past - it only really takes someone using the wrong type on a primary key and you can run out of possible options (I once had a customer using forum software which used an auto-incrementing tinyint for category IDs - they'd created 254 categories and then couldn't create any more).

Article count was quite quick and easy to check:

# Extract titles
$ curl -s https://neilzone.co.uk/index.xml | grep '<title>' > tmp/titles

# Get Total count
$ wc -l tmp/titles 
310 tmp/titles

# Position of article
$ grep -n "complicated relationship" tmp/titles 
180:      <title>I have a complicated relationship with cryptocurrency</title>

The post that Nextcloud News had is 180 lines from the top. RSS feeds run in reverse chronological order, so that makes it the (310-180) 130th post.

130 felt like it'd be quite a strange number to break on, so I moved onto checking whether anything in the next post might be causing issues.


The Next Post

The next post in the feed is "New plans to protect people from anonymous trolls online: new UK government proposals".

Reading through the RSS entry, I spotted that Neil had used a couple of emojii's in his post (middle line):

Screenshot of the RSS feed, in the middle there's a peach emojii followed by an aubergine

Thinking that could very likely be it, I triggered a refresh and watched Nextcloud's logs, which soon showed it trying to insert that post:

"args":["INSERT INTO `oc_news_items` (`unread`, `url`, `guid`, `guid_hash`, `pub_date`, `last_modified`, `title`, `categories_json`, `body`, `search_index`, `fingerprint`, `content_hash`, `feed_id`) VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
["*** sensitive parameters replaced ***","https://neilzone.co.uk/2022/02/new-plans-to-protect-people-from-anonymous-trolls-online-new-uk-government-proposals/"

Buried in the log object was an error

An exception occurred while executing a query: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\\xF0\\x9F\\x8D\\x91\\xF0\\x9F...' for column 'body' at row 1

MySQL was complaining of an incorrect string value.

The error message provides the byte values of the characters, for example FO 9F 8D 91. If we look this up online, we can see the character that it maps too:

Picture of the Peach emojii along with its character code - f09f8d91

For the sake of completeness, I made a copy of the feed on my own server:

$ curl -s https://neilzone.co.uk/index.xml > feed.xml

I subscribed to that in Nextcloud News and it broke in exactly the same place. This time, though, a notification popped up complaining of an error

Screenshot of Nextcloud News in a browser. My copy of Neil's feed has stopped in the same place and there's an unobtrusive notification in the top right corner that some kind of error has been encountered

The error didn't appear in the android app, which is why I hadn't previously seen it - it's only emitted when the feed is read/parsed.

To verify that the emojii's were the only issue, I made a second copy of the feed with them stripped

cat feed.xml | tr -cd "[:print:]\n" > feed2.xml

Nextcloud News loaded that just fine:

Screenshot of nextcloud news having loaded the fixed feed - it now shows the latest post

So, I could reliably say that either the feed, or the handling of it, was broken.


Can I Blame Neil?

The first thing that I wondered was whether Neil's server was correctly declaring the content-encoding of the feed (if it wasn't, then perhaps Nextcloud News wasn't attempting to handle/escape non-ASCII chars).

A request to Neil's server showed that it didn't declare the encoding when providing Content-Type:

HTTP/1.1 200 OK
Date: Sat, 20 Apr 2024 16:40:09 GMT
Server: Apache/2.4.59 (Debian)
Strict-Transport-Security: max-age=15768000
Upgrade: h2
Connection: Upgrade
Last-Modified: Sat, 20 Apr 2024 16:30:05 GMT
ETag: "1ae76e-61689b77ebb04"
Accept-Ranges: bytes
Content-Length: 1763182
Vary: Accept-Encoding
Content-Type: application/xml

However, looking at the top of the response body showed that the feed itself did declare things properly

<?xml version="1.0" encoding="utf-8" standalone="yes"?>

The emojii's used are definitely part of UTF-8, so at this point, Neil was more or less off the hook.


Nextcloud Database

The error that appeared in the logs showed that MySQL was objecting to the characters, so the next thing to do was to look at the table itself - perhaps body was of a type that couldn't accept them?

mysql> describe oc_news_items;
+-------------------+-----------------+------+-----+---------+----------------+
| Field             | Type            | Null | Key | Default | Extra          |
+-------------------+-----------------+------+-----+---------+----------------+
| id                | bigint unsigned | NO   | PRI | NULL    | auto_increment |
| feed_id           | bigint unsigned | NO   | MUL | NULL    |                |
| guid_hash         | varchar(32)     | NO   | MUL | NULL    |                |
| fingerprint       | varchar(32)     | YES  | MUL | NULL    |                |
| content_hash      | varchar(32)     | YES  |     | NULL    |                |
| rtl               | tinyint(1)      | NO   |     | 0       |                |
| search_index      | longtext        | YES  |     | NULL    |                |
| guid              | longtext        | NO   |     | NULL    |                |
| url               | longtext        | YES  |     | NULL    |                |
| title             | longtext        | YES  |     | NULL    |                |
| author            | longtext        | YES  |     | NULL    |                |
| pub_date          | bigint          | YES  |     | NULL    |                |
| body              | longtext        | YES  |     | NULL    |                |
| enclosure_mime    | longtext        | YES  |     | NULL    |                |
| enclosure_link    | longtext        | YES  |     | NULL    |                |
| media_thumbnail   | longtext        | YES  |     | NULL    |                |
| media_description | longtext        | YES  |     | NULL    |                |
| unread            | tinyint(1)      | NO   | MUL | 0       |                |
| starred           | tinyint(1)      | NO   | MUL | 0       |                |
| last_modified     | bigint          | YES  | MUL | 0       |                |
| categories_json   | json            | YES  |     | NULL    |                |
| shared_by         | varchar(64)     | YES  |     | NULL    |                |
+-------------------+-----------------+------+-----+---------+----------------+
22 rows in set (0.00 sec)

I was sure that the type longtext should accept UTF-8 chars.

Whilst verifying online, I found information that showed that MySQL has used UTF-8 by default since version 4 (which was a long time ago). However, there was also a note which said that MySQL's UTF-8 charset only accepts 3 byte characters.

That's a bit of a problem, because both the peach and the aubergine are 4 byte characters:

Echoing the emojiis through XXD shows that the bytes are f0 94 8d 91 (peach) and f0 94 8d 86 (aubergine)

In order to be able to accept them, MySQL needs to be using the utf8mb4 character set.

The only thing was, the database already seemed to have that set as its default:

mysql> select default_character_set_name FROM information_schema.SCHEMATA WHERE schema_name = 'nextcloud';
+----------------------------+
| DEFAULT_CHARACTER_SET_NAME |
+----------------------------+
| utf8mb4                    |
+----------------------------+

When I checked that specific column, though, I found that it was indeed using utf8:

mysql> SELECT character_set_name FROM information_schema.`COLUMNS`  WHERE table_schema = 'nextcloud' AND TABLE_NAME = 'oc_news_items' AND
COLUMN_NAME = 'body';
+--------------------+
| CHARACTER_SET_NAME |
+--------------------+
| utf8               |
+--------------------+
1 row in set (0.01 sec)

I initially tried altering the charset just for the column

mysql> alter table nextcloud.oc_news_items MODIFY body longtext CHARACTER SET utf8mb4;
mysql> SELECT character_set_name FROM information_schema.`COLUMNS`  WHERE table_schema = 'nextcloud' AND TABLE_NAME = 'oc_news_items' AND
COLUMN_NAME = 'body';
+--------------------+
| CHARACTER_SET_NAME |
+--------------------+
| utf8mb4            |
+--------------------+
1 row in set (0.01 sec)

But this didn't prevent the error from occurring.

Searching around, I found a Nextcloud doc: Enabling MySQL 4-byte support, which included making a config change to Nextcloud itself, So, I whizzed through the process described there:

Check that innodb_file_per_table is enabled:

mysql> SHOW VARIABLES LIKE 'innodb_file_per_table';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| innodb_file_per_table | ON    |
+-----------------------+-------+

Put Nextcloud into maintenance mode

su -s /bin/bash www-data
php occ maintenance:mode --on

Update the DB to use the new charset

ALTER DATABASE nextcloud CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Tell Nextcloud to use the charset and to then repair tables

php occ config:system:set mysql.utf8mb4 --type boolean --value="true"
php occ maintenance:repair

Turn maintenance mode off:

php occ maintenance:mode --off

The entire process took about 10 minutes.

I re-added my test copy of the feed and everything loaded

Nextcloud news has now correctly loaded the feed

A minute or two later, I started getting notifications for other feeds that had less obviously fallen silent. It seems that the routine fetcher exited after encountering the error and so didn't update any feed which appears after Neil's in the database.


Conclusion

Although the issue occurred whilst processing Neil's feed, he'd done nothing other than having the sheer temerity to use UTF-8 characters within a feed that was (gasp) correctly declared as being UTF-8. In fact, it's even simply using UTF-8 characters was not enough, to cause the issue there had to be 4 byte characters in use.

The emojiis that Neil included in his post form a combination that most adults are probably fairly familiar with:

Screenshot of the emojiis - a peach followed by an aubergine

Given the connotations associated with this, there is a certain amount of irony in it being the thing that fucked my feed-reader.

Getting to the cause took a bit of troubleshooting, but once it was identified, the issue was quite easily remedied.

My Nextcloud install is quite old (I've been running it for years now), so I wondered whether the issue might occur on newer, clean installs.

A quick look at the commit history for Nextcloud suggests that it gained the ability to auto-detect and enable support for 4 byte characters, back in version 12, so newer installs shouldn't encounter this issue (upgrades, like mine, might though).