Data Subsystem
The data subsystem is optimized for the storage of large and small files, which can be accessed in a sequential or random fashion.
Features
- Large File Storage
For large files, the contents are stored as a sequence of one or multiple extents, which can be distributed across different data partitions on different data nodes. Writing a new file to the extent store always causes the data to be written at the zero-offset of a new extent, which eliminates the need for the offset within the extent. The last extent of a file does not need to fill up its size limit by padding (i.e., the extent does not have holes), and never stores the data from other files.
- Small File Storage
The contents of multiple small files are aggregated and stored in a single extent, and the physical offset of each file content in the extent is recorded in the corresponding meta node. ChubaoFS relies on the punch hole interface, textit{fallocate()}footnote{url{http://man7.org/linux/man-pages/man2/fallocate.2.html}}, to textit{asynchronous} free the disk space occupied by the to-be-deleted file. The advantage of this design is to eliminate the need of implementing a garbage collection mechanism and therefore avoid to employ a mapping from logical offset to physical offset in an extent~cite{haystack}. Note that this is different from deleting large files, where the extents of the file can be removed directly from the disk.
- Replication
The replication is performed in terms of partitions during file writes. Depending on the file write pattern, ChubaoFS adopts different replication strategies.
When a file is sequentially written into ChubaoFS, a primary-backup replication protocol is used to ensure the strong consistency with optimized IO throughput.
When overwriting an existing file portion during random writes, we employ a MultiRaft-based replication protocol, which is similar to the one used in the metadata subsystem, to ensure the strong consistency.
- Failure Recovery
Because of the existence of two different replication protocols, when a failure on a replica is discovered, we first start the recovery process in the primary-backup-based replication by checking the length of each extent and making all extents aligned. Once this processed is finished, we then start the recovery process in our MultiRaft-based replication.
HTTP APIs
API | Method | Parameters | Description |
---|---|---|---|
/disks | GET | N/A | Get disk list and informations. |
/partitions | GET | N/A | Get parttion list and infomartions. |
/partition | GET | partitionId[int] | Get detail of specified partition. |
/extent | GET | partitionId[int]&extentId[int] | Get extent informations. |
/stats | GET | N/A | Get status of the datanode. |