FiberFS Technical Overview

What is FiberFS?

FiberFS is a POSIX compatible networked filesystem that uses S3 compatible storage as a backend. It has a custom distributed filesystem protocol built for S3 and it supports a virtually unlimited number of concurrent readers and writers across any number of hosts.

How do you use FiberFS?

All you need is the FiberFS binary (fiberfs) and a config (fiberfs.conf). The FiberFS config just needs to define your S3 endpoint. Starting FiberFS will then automatically mount your S3 endpoint using FUSE.

FiberFS can also be used directly as an API in your application, allowing you to bypass both the kernel and FUSE while still retaining full filesystem functionality. As of right now, there is no stable API for this and we plan on formally releasing and supporting this in an upcoming version.

How is FiberFS designed internally?

FiberFS was designed ground up around S3, HTTP, and caching. These principles permeate all aspects of FiberFS’s design and architecture. FiberFS has an in-memory multi-state multi-version core. All FiberFS states are versioned with no inter-dependencies. States cannot encompass more than a single directory or file and many different versions of the same state can co-exist and be operated on concurrently. Collectively these versioned states are combined as a consistent and cohesive view of the mounted filesystem. FiberFS was designed this way because both clients and S3 can be slow and potentially error prone, so by allowing operations to happen across many different versioned states on many different hosts without any overarching IO locks or dependencies allows every operation to run smoothly, consistently, and potentially latency free.

FiberFS has 3 levels of distributed synchronization: file, host, and S3. Each level allows clients to operate concurrently, consistently, and most importantly, correctly.

Inodes within FiberFS map to files and are purely synthetic. They are primarily used to track and invalidate system cache (page cache and directory cache). Inodes are unique per mount and cannot be correlated across mounts or systems. When an external update comes in, those changes are isolated and made visible as a new inode. This means readers and their page cache will stay consistent and intact on whichever inode they are reading from regardless of external updates. New readers will pick up the new inodes and consume fresh content and attributes in a consistent manner based on global flush order.

The only exception to this rule is when an inode has a local writer. In this scenario, FiberFS will match writeback page cache behavior and make updates available on the existing inode. So for example, if an inode has both local readers and local writers concurrently, those writes will be seen by the local readers immediately based on local write order. Remote readers will continue to see updates isolated to newer inodes based on global flush order as described in the paragraph above. If remote writes are also happening concurrently to local writes, they will trigger a page cache invalidation and merge back into the local writers inodes based on global flush order and can be read by local readers on the same inode. This process is designed to allow all writers to correctly merge their writes with global writes in global flush order. In the future, FiberFS will introduce a special “isolation mode” configuration flag which when used with O_WRONLY will isolate local readers from local writers, making the merge process private to the local writers and readers will return to being consistent on their inode’s view of things. This would be similar to “open to close consistency” semantics without having to invalidate any page cache. Be aware that this process only happens when you have concurrent readers and writers on the same file and the same inode, which is a scenario that most applications explicitly avoid.

How does FiberFS store itself on S3?

While FiberFS was built ground up around S3, its core logic has no knowledge of S3. FiberFS simply flushes itself after an operation and that state gets translated and written to S3, in a consistent and distributed correct manner. FiberFS can also operate in reverse where a state is read from S3 and loaded into the local system, again consistently, correctly and internally versioned. This process is isolated to a single directory, so operations happening in different directories have no relation or dependencies on each other. Meaning if 100 clients are operating on 100 different directories concurrently, including sub directories, they will all operate independently of each other with zero impact between them.

FiberFS stores 4 types of files on S3. Chunks, indexes, roots, and full files.

Chunks are file content, but they are the result of what’s given to FiberFS at the end of a write, close, or flush operation. FiberFS will split these buffers up into logical chunk sizes and operate on them concurrently. Meaning, if you write and flush 256KB, FiberFS will split that write into 4 64KB chunks and upload them concurrently. The same will happen on download, chunks needed to fulfill a read are downloaded concurrently. Chunks can be variable length and can overlap. A file is simply a composition of chunks.

Full files (in a FiberFS S3 context) are normal whole files in S3. A full file can pre-exist in S3 or can be uploaded and managed outside of FiberFS. These files can be instantly imported or refreshed into FiberFS with a single command. On the writing side, FiberFS can attempt to write a full file on its first flush (or on close if a flush never occured) if the writes are linear starting from offset zero. Note that full files cannot be parallelized when writing but can be parallelized when reading. Also note that if later writes or flushes happen, those will appear as distinct chunks. All FiberFS files are only editable via chunk composition, always leaving any full files intact and untouched. A FiberFS file can have at most one S3 full file and any number of chunks composing it. FiberFS S3 full file importing and writing support is planned for a future release and currently FiberFS only supports reading and writing via chunks.

Indexes are complete directory listings stored as compressed JSON. Indexes contain everything FiberFS needs to know about a directory, its files, attributes, and content (chunks and files). When an index is loaded into FiberFS, a streaming decompression and parsing algorithm is used to build the directory into FiberFS memory. On a circa 2019 laptop, generating a JSON index of 10000 files, 250000 chunks, and 220GB of content takes about 240ms and produces a compressed index that is approx 1.1MB in size. Reading, decompressing, parsing, and composing the same index from local cache takes approx 260ms. A small directory of 16 files and 4 sub directories is less than 1KB of compressed index data. Once parsed and loaded, FiberFS can operate on a directory completely from memory at near zero latency. Indexes are tagged with unique ids, so any number of indexes can live side by side to each other in S3 and in cache.

Roots point to indexes and are used for global FiberFS synchronization. Roots are tiny uncompressed JSON files, less than 100 bytes.

Chunks, indexes, and roots are all prefixed with the logical path of the file or directory they are associated with. So for example, if we have a file called “file.txt” in a directory “dir1/dir2”, its chunked layout in S3 will look like:

dir1/dir2/.fiberfsroot
dir1/dir2/.fiberfsindex.[fiberfs_id]
dir1/dir2/file.txt.fiberfschunk.[fiberfs_id].[offset]

Or with full files:

dir1/dir2/.fiberfsroot
dir1/dir2/.fiberfsindex.[fiberfs_id]
dir1/dir2/file.txt

Or both:

dir1/dir2/.fiberfsroot
dir1/dir2/.fiberfsindex.[fiberfs_id]
dir1/dir2/file.txt
dir1/dir2/file.txt.fiberfschunk.[fiberfs_id].[offset]
dir1/dir2/file.txt.fiberfschunk.[fiberfs_id].[offset]

"fiberfs_id" is a random numerical id which is creation time sortable.

To recompose a file that has chunked composition as an S3 full file, simply read the file from FiberFS into an external S3 uploading utility (hopefully one that supports upload parallelization) and upload the file to its logical path. Once completed, refresh the file into FiberFS and its chunks will be deleted from S3 and the file will be restored to a full file state.

FiberFS only uses plain HEAD, GET, PUT, and DELETE S3 operations on regular files. For synchronization purposes, FiberFS requires “If-Match: [unique_id]” and “If-Match-None: *” support. FiberFS will return an EINVAL error if any file is generated with ".fiberfs" in its filename.

FiberFS Clustered Cache

FiberFS has a built-in distributed cache. The cache is LRU based and stores chunks, full files, indexes, and roots. Data is stored into cache on both read side and write side operations. Because writes are cached, reads after writes can come directly from cache.

When clustered, caches are shared across all hosts in the cluster. Chunks and full files are stored at most once per cluster. Indexes and roots are stored 1+1 in the cluster meaning on every read or write they are stored in 2 host cache locations with the eventual goal of storing indexes and roots on every host cache in the cluster.

Clustered caching increases overall performance of all clients as the cluster grows. If a single host has a 8GB cache, a cluster of 12 hosts increases the cache size to 96GB, meaning a lot more data will be available locally within the cluster. It also means that instead of having different clients requesting the same file or index from S3 over and over, the cluster requests it only once. There is also potentially more aggregate bandwidth available to the cluster since transfers are happening client to client (many to many) instead of clients to S3 (many to one).

FiberFS and CDNs

FiberFS supports utilizing a CDN for extra caching. This can even be combined with clustered caching. Any request sent to the CDN can be cached for as long as needed using just the URL as the cache key. Cached URLs never need to be invalidated. The CDN configuration is not limited to public internet facing CDNs, local caching proxies can be used for this purpose as well. If multiple CDN endpoints are configured, FiberFS will use a hashed distribution across them (ie build your own CDN cluster).

FiberFS Async

FiberFS has a custom async engine which is designed to allow for operations to burst into higher parallel performance. Parallel uploads and downloads allow for operations to complete faster and sustain higher aggregate transfer rates. When the engine reaches maximum capacity, FiberFS will degrade into fair share scheduling, allowing all operations to continue equally at lower or no parallelism.

FiberFS and POSIX support

stat() -
mkdir() - This operation is globally atomic.
opendir() -
readdir() -
closedir() -
rmdir() -
open() -

O_RDONLY -
O_WRONLY - Future versions will introduce an optional local writer inode isolation flag.
O_RDWR -
O_CREAT -
O_SYNC - This will flush to S3 after every operation.
O_APPEND - This operation is globally atomic across all append writers. Note that if writeback page caching is enabled and you read from the same local inode doing O_APPEND, you might read duplicate and/or incomplete data. To work around, do not read and append concurrently on the same mount or disable writeback page caching (not recommended). Future local writer isolation will allow readers and appenders to operate concurrently with writeback page caching enabled.
O_EXCL - This operation is globally atomic on open().
O_TRUNC - Truncate happens on first flush or close(). When combined with O_SYNC truncate will happen on open().

read() -
write() - If writes overlap, last write wins locally, last flush wins globally.
lseek() -
fsync() - This triggers a FiberFS flush. Note that fsync() can fail due to network or S3 endpoint errors, in which case an error code will be returned. State will be preserved locally and future fsync() or close() calls can potentially recover all previous errors.
close() - Anything unflushed will be flushed at this point. Note that a failed flush to S3 cannot be retried after this call returns. An error code will be returned if there is a flush failure. A future feature will dump any unflushed buffers (due to error) into a configured recovery directory.
truncate() -
unlink() -
symlink() -
link() - Linking support only within the same directory.
rename() - Rename only atomic within the same directory.
chmod() -
chown() -
utime() - Last access time is not writable. Its set to the local time file was generated from S3.
xattr() -
flock() - Lost locks can be revoked via command line.

FiberFS and Security

FiberFS supports both POSIX style ACL file/directory permissions and S3 read/write permissions. Users can only write if they have both valid POSIX ACLs and a valid S3 write key. Users can only read if they have valid POSIX ACLs and a valid S3 read key. FiberFS's cache is secured using the single S3 key configured for the host. Support for multiple more grainular S3 read/write keys for user assignment and cache access is planned. When using a CDN, FiberFS does not provide any additional security mechanisms apart from standard S3 version 4 request signing and its upto the CDN to provide proper authorization, if desired.

FiberFS Logging

FiberFS uses a memory based circular logging system. This allows FiberFS to essentially log anything and everything it does in fine detail. Reading this log can be done outside of FiberFS using “fiberfs_log”. Logging is always on, which means at any time detailed logs can be found to assist with any situation.

FiberFS Meta Files

FiberFS directories will contain several .fiberfs special control files. These files will provide information and internal statistics regarding FiberFS and can even trigger behavior like an index refresh. A directory called .fiberfs_history will be the entrypoint for viewing previous revisions of the directory and allows for reading and recovering previous file states. For this reason, any user generated file containing “.fiberfs” in its name will be blocked with an EINVAL error. Note that revision history and recovery is planned for a future release.

Who is FiberFS for?

FiberFS has several use cases that are all centered around the ability to store data in S3, share that data, and read it back at low latency from cache.

Easily move files and data in and out of S3. Most S3 services have vast amounts of durable storage, so your data is safe but also easily accessible thru FiberFS’s POSIX filesystem presentation.
Share your files and data across many different servers and workstations. You retain the ability to read and write data from anywhere and at anytime.
High performance reading. Most of the time data is stored and read over and over again across many different servers and workstations. Using FiberFS’s clustered cache and CDN support, these workloads will never hit remote storage and can be read at high speed and low latency directly from cache.

Isn't S3 slow?

Yes and no. Large cloud based S3 services can have higher than expected latencies due to a lot of factors including geographic location, network congestion, and overall load. It’s for these reasons that FiberFS has put such a strong emphasis on caching and write efficiency. However, it is not uncommon to see high performance and on premise S3 services having millisecond and sub-millisecond response latencies.

FiberFS and AI

Up to this point, no AI was used to write any FiberFS code. This was not a rule or anything intentional, AI was simply not used for code generation. Because FiberFS is a relatively small yet complex codebase, humans will likely continue to have a leading role in the project. AI can potentially have a role in the domains of security, code analysis, and overall fitness and safety. However, FiberFS success lies in humans having a complete understanding of how everything works.

OS Support

Right now FiberFS can only run on Linux, but we are planning on supporting OSX and Windows in the future.

FiberFS Testing

FiberFS is very test centric. A significant portion of time and effort goes into testing FiberFS, especially when compared to actual feature development. All features of FiberFS have rigorous automated tests applied to them. This includes functional tests, correctness tests, safety tests, fuzzing tests, and concurrent access tests. Tests cover both FiberFS and its FUSE bindings and testing coverage currently stands at approx 95% of the codebase.

Content

Home