FUSE Loopback is Incomplete, Breaking NFS

Surprisingly, it’s not possible to make a loopback filesystem in FUSE that is complete. By “complete”, I mean one that correctly implements all filesystem operations in such a way that the loopback filesystem is fully transparent to applications which use it.

It’s mostly there, but there’s a niche of file I/O on Linux that you may not have heard of: file handles. The result of this is that NFS exporting a FUSE filesystem is currently always brittle.

File Handles

No, I don’t mean file descriptors. File handles are created using name_to_handle_at(2) or open_by_handle_at(2), and are represented by a struct file_handle (rather than an int as is done for file descriptors).

This was added to Linux in 2.6.39, with some discussion on LWN.

… but why? Well, file handles offer something that isn’t available with file descriptors. Where with file descriptors you can open a file, or even open a path (to represent a stable location in the file hierarchy), there is no way to reliably refer to a file without opening it. Consider, for example, wanting to refer to a file that is in a directory where that directory is currently the target for another filesystem’s mount (thereby hiding the file)? The only way to open that file is via a file handle. There’s several cases like this where files can exist, but not (1) be accessible via any path, or (2) not be accessible at the same path it was earlier (e.g. via rename).

Additionally, these handles are serializable and storable. This means that you can generate a file handle, save it to disk, reboot the computer, load it from disk, and call open_by_handle_at and it is expected to succeed. This is important for NFS (which is the main user of file handles) in order to allow clients to resume talking to an NFS server after that server has been rebooted. Consider, for example, if your home directory is mounted over NFS and the NFS server is rebooted. You’ll want your vim session to continue working across that reboot.

FUSE File Handles

Let’s look at how FUSE file handles are constructed. The code is here, and is (code slightly simplified for brevity):

int fuse_encode_fh(struct inode *inode, u32 *fh, int *max_len, struct inode *) {
	u64 nodeid = get_fuse_inode(inode)->nodeid;
	u32 generation = inode->i_generation;

	fh[0] = (u32)(nodeid >> 32);
	fh[1] = (u32)(nodeid & 0xffffffff);
	fh[2] = generation;

	*max_len = 3;
	return FILEID_INO64_GEN;
}

This means that the file handle returned is the file’s inode number combined with the generation number. Inode numbers are generated by the FUSE filesystem and sent to the kernel in response to calls like fuse_lookup. As an author of a FUSE filesystem, that means you can usually set them to pretty much whatever you want, but in the case of a loopback filesystem you’ll want to use exactly the same inode number as the underlying filesystem. Thankfully, generation numbers are set by the FUSE filesystem too, so you can also mirror those.

What are generation numbers? The libfuse comments give us a hint:

/** Generation number for this entry.
 *
 * If the file system will be exported over NFS, the
 * ino/generation pairs need to be unique over the file
 * system's lifetime (rather than just the mount time). So if
 * the file system reuses an inode after it has been deleted,
 * it must assign a new, previously unused generation number
 * to the inode at the same time.
 *
 */
uint64_t generation;

Consider a filesystem that reuses inode numbers (which is allowed!). For example, when you delete a file, the filesystem is allowed to use the same inode number next time another file is created. … but they’re not the same file. The generation number is therefore used to differentiate these two files. The rule is that an {inode number, generation} pair must be globally unique, forever, in the filesystem. Remember that this FUSE filesystem may be exported over NFS and we want that to keep working even after we get rebooted, meaning we can’t just keep an in-memory counter of generation numbers that resets on reboot.

Thankfully, there’s an ioctl that lets us get the generation number of files in our underlying filesystem, which we can then pass back to the kernel to show up in our FUSE filesystem. So we can mirror the underlying FS’s generation numbers in addition to the inodes by using ioctl(FS_IOC_GETVERSION).

… but file handles are a problem. As discussed, FUSE in-kernel encodes the generation number in the file handle, but when an open_by_handle_at call happens, the kernel will issue a lookup call to userspace that does not include the generation number! This makes it impossible to forward that same call to an underlying filesystem correctly, which means that if you forward along only the inode number, and the generation number has changed, the lookup may return an entirely different file! This has some bad repercussions, potentially producing data corruption (if, for example, it’s a write call). The safest thing is to return EOPNOTSUPP from name_to_handle_at calls, disabling NFS support entirely; or alternatively, if your FUSE filesystem stores inode numbers in an in-memory cache, return ESTALE from calls where the inode number isn’t found in the cache.

Libfuse even calls out this limitation in their example loopback (they call it “passthrough”) filesystem:

/* Disable NFS export support, which also disabled name_to_handle_at.
 * Goal is to make xfstests that test name_to_handle_at to fail with
 * the right error code (EOPNOTSUPP) than to open_by_handle_at to fail with
 * ESTALE and let those test fail.
 * Perfect NFS export support is not possible with this FUSE filesystem needs
 * more kernel work, in order to passthrough nfs handle encode/decode to
 * fuse-server/daemon.
 */
fuse_set_feature_flag(conn, FUSE_CAP_NO_EXPORT_SUPPORT);

Workarounds

Working around this limitation is fairly annoying. Reboots are the most obvious way to trigger this issue, but like I said this can happen any time the filesystem reuses inode numbers.

If you use a FUSE filesystem that doesn’t reuse inode numbers, you’re likely safe from the worst case, but still need to worry about reboots. I use MergerFS with devino-hash, and underneath is ZFS + BTRFS. ZFS reuses inodes, but I only rarely delete files, so it’s not too much of a concern for me.

The most important thing I have to remember to do is unmount my NFS filesystems before rebooting my NFS server. Otherwise, written data that’s not yet flushed over NFS will be lost (since rebooting invalidates all the handles).

That said, I think the only missing piece in the kernel is passing up the generation number. Maybe I’ll tackle writing that.

Machinae Elegantiam ← Russell Harmon

FUSE Loopback is Incomplete, Breaking NFS

File Handles

FUSE File Handles

Workarounds