Resolving GFID mismatch problems in Gluster (RHGS) volumes

Gluster is a distributed filesystem. I'm not a massive fan of it, but most of the alternatives (like Ceph) suffer with their own set of issues, so it's no better or worse than the competition of the most part.

One issue that can sometimes occur is Gluster File ID (GFID) mismatch following a split-brain or similar failure.

When this occurs, running ls -l in a directory will generally lead to I/O errors and/or question marks in the output

ls -i
ls: cannot access ban-gai.rss: Input/output error
? 2-nguoi-choi.rss ? game.rss

If you look within the brick's log (normally under /var/log/glusterfs/bricks) you'll see lines reporting Gfid mismatch detected 

[2019-12-12 12:28:28.100417] E [MSGID: 108008] [afr-self-heal-common.c:392:afr_gfid_split_brain_source] 0-shared-replicate-0: Gfid mismatch detected for <gfid:31bcb959-efb4-46bf-b858-7f964f0c699d>/ban-gai.rss>, 1c7a16fe-3c6c-40ee-8bb4-cb4197b5035d on shared-client-4 and fbf516fe-a67e-4fd3-b17d-fe4cfe6637c3 on shared-client-1.
[2019-12-12 12:28:28.113998] W [fuse-resolve.c:61:fuse_resolve_entry_cbk] 0-fuse: 31bcb959-efb4-46bf-b858-7f964f0c699d/ban-gai.rss: failed to resolve (Stale file handle)

This documentation details how to resolve GFID mismatches

 

Topology

The loglines in the example used in this documentation are from a replica 2 setup comprising of 2 Gluster nodes, with the gluster volume mounted on each of them using Gluster-FUSE.

In this instance the issue arose after a brick process repeatedly died on one of the Gluster nodes, leading to split-brain.

 

Understanding the logline

It's probably helpful here to break the format of that logline down a little:

  • 0-shared-replicate-0 - the volume in my example is called shared, if yours is called foo then this will probably read 0-foo-replicate-0
  • <gfid:31bcb959-efb4-46bf-b858-7f964f0c699d>/ban-gai.rss> This tells us the affected GFID and the file that was being accessed. So in this case, the affected GFID is the parent of the file ban-gai.rss
  • 1c7a16fe-3c6c-40ee-8bb4-cb4197b5035d on shared-client-4 this is the GFID of the file on one of the Gluster hosts
  • fbf516fe-a67e-4fd3-b17d-fe4cfe6637c3 on shared-client-1 this is the GFID of the file on the other Gluster Host

 

Understanding the Issue

GFIDs are unique identifiers for files and directories within your Gluster volume.

Although not quite how RedHat would describe it, the "metadata" for your volume is basically a tangled mess of symlinks and hardlinks underneath the .glusterfs directory on each of your bricks, mapping out the structure of your volume in what is probably one of the most convoluted manners imaginable.

GFIDs are gluster's (rough) equivalent to an inode in a traditional filesystem - all replicated copies of a file should have the same GFID.

GFID mismatch occurs when different replica copies end up with a different GFID. This might be because a file was uploaded twice (one to each node) during a split-brain, or caused by some other gluster oddity. 

This mismatch then confuses Gluster's replication module, causing it to return an I/O error.

 

Resolving the Issue

In order to resolve this, you'll need to remove files from one of the bricks and then trigger a heal in order to have gluster re-replicate the files back over.

As a matter of best practice, you should try to preserve the copy with the most recent mtime as it's likely to be the most up-to-date copy. If there are a lot of files, with little risk of them having changed, though you might opt to just pick an gluster node to sacrifice files on.

However, it seems that removing the GFID link for specific files isn't sufficient, and will just cause a different failure mode

[2019-12-12 17:39:16.918389] W [fuse-bridge.c:582:fuse_entry_cbk] 0-glusterfs-fuse: 64264823: LOOKUP() /files/rss/ban-gai.rss => -1 (Input/output error)

What you need to do is find and remove the parent directory.

In the example above, we already know what the path will be, but if you're working from logs alone you may not. The logline provides the parent's GFID - <gfid:31bcb959-efb4-46bf-b858-7f964f0c699d>, so we need to resolve that GFID back to a path.

curl https://projects.bentasker.co.uk/static/resolve-gfid.sh -o ./resolve-gfid.sh
chmod +x resolve-gfid.sh
./resolve-gfid.sh /data1/gluster 31bcb959-efb4-46bf-b858-7f964f0c699d

The usage of resolve-gfid.sh is resolve-gfid.sh [path to brick] [gfid]

This should give some output like

31bcb959-efb4-46bf-b858-7f964f0c699d	==	Directory:	/data1/gluster/files/rss

This tells you where on the brick the directory is. Being a directory, it'll exist on all your bricks

What we need to do now is look at the GFID links themselves to see which is the most recent, so on each of your gluster nodes, and for each of your bricks do

GFID=31bcb959-efb4-46bf-b858-7f964f0c699d
ls -l /path/to/brick/.glusterfs/`echo $GFID | cut -c1-2`/`echo $GFID | cut -c3-4`/$GFID

This should give you the modification time for each.

Now, it's time to move/remove that directory from the bricks on all but the gluster node with the most recent copy. My preference is always to move, verify and then remove later as it provides a route back if you've made a mistake.

It's the directory, rather than the GFID path that we need to shift out of the way, so, for each of your bricks on each node

mv /data1/gluster/files/rss /data1/gluster/files/rss.old

Now, if you try and ls the files via Gluster's filesystem you should get a result back

-rw-r--r-- 1 userftp userftp 9405 Dec 9 15:17 /mnt/glusterfs/files/rss/ban-gai.rss

If you run gluster volume heal $volname info you should now see all the affected files listed, so it's time to kick of a heal

gluster volume heal $volname

Gluster should now replicate the directory and it's files back over, and the GFID mismatch should be resolved. Once happy, you can safely remove the directory from wherever you moved it to. 

 

Automated Process

If you've a lot of files to deal with and/or you're certain the files themselves won't differ, then you could instead just pick a Gluster node to act as a "master" and sacrifice the copies on the other nodes.

I've published a snippet which can be run on each of the sacrificial nodes in order to automatically identify affected GFIDs and move them

It essentially goes through the process detailed above, apart from checking mtimes. Once it's been run on all but one of the Gluster nodes, you'll need to trigger a heal

gluster volume heal $volname