Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices by yhr · Pull Request #6961 · facebook/rocksdb · GitHub

您所在的位置:网站首页 llistsize Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices by yhr · Pull Request #6961 · facebook/rocksdb · GitHub

Request for comments: ZenFS, a RocksDB FileSystem for Zoned Block Devices by yhr · Pull Request #6961 · facebook/rocksdb · GitHub

2023-03-13 12:21| 来源: 网络整理| 查看: 265

This is a request for comments for a new file system for zoned block devices.

The pull request based on top of another open pull request : #6878 that enables db_bench and db_stress to be used with custom file systems.

With this pull request I am mainly asking for comments on the high-level architecture and feedback on the following:

What applicable testing is available? I have mainly run smoke testing using db_bench and db_stress up till now and looking for ways to do i.e. recovery/crash/power fail testing.

Would a completely self-contained file system be preferable? Currently ZenFS stores logs and lock files on the default file system. The reason for this if to avoid duplicating already working code and for easy access to the log files.

What kind of workloads are most interesting to optimize for?

Looking forward to feedback as I finish up my laundry list of todos and optimizations.

Thanks!

Overview

ZenFS is a simple file system that utilizes RockDBs FileSystem interface to place files into zones on a raw zoned block device. By separating files into zones and utilizing the write life time hints to co-locate data of similar life times, the system write amplification is greatly reduced(compared to conventional block devices) while keeping the ZenFS capacity overhead at a very reasonable level.

ZenFS is designed to work with host-managed zoned spinning disks as well as NVME SSDs with Zoned Namespaces.

Some of the ideas and concepts in ZenFS are based on earlier work done by Abutalib Aghayev and Marc Acosta.

Dependencies

ZenFS depends on libzbd and Linux kernel 5.4 or later to perform zone management operations.

Architecture overview

zenfs stack

ZenFS implements the FileSystem API, and stores all data files on to a raw zoned block device. Log and lock files are stored on the default file system under a configurable directory. Zone management is done through libzbd and zenfs io is done through normal pread/pwrite calls.

Optimizing the IO path is on the TODO list.

Example usage

This example issues 100 million random inserts followed by as many overwrites on a 100G memory backed zoned null block device. Target file sizes are set up to align with zone size.

make db_bench zenfs DEV=nullb1 ZONE_SZ_SECS=$(cat /sys/class/block/$DEV/queue/chunk_sectors) FUZZ=5 ZONE_CAP=$((ZONE_SZ_SECS * 512)) BASE_FZ=$(($ZONE_CAP * (100 - $FUZZ) / 100)) WB_SIZE=$(($BASE_FZ * 2)) TARGET_FZ_BASE=$WB_SIZE TARGET_FILE_SIZE_MULTIPLIER=2 MAX_BYTES_FOR_LEVEL_BASE=$((2 * $TARGET_FZ_BASE)) ./zenfs mkfs --zbd=/dev/$DEV --aux_path=/tmp/zenfs_$DEV --finish_threshold=$FUZZ --force ./db_bench --fs_uri=zenfs://$DEV --key_size=16 --value_size=800 --target_file_size_base=$TARGET_FZ_BASE --write_buffer_size=$WB_SIZE --max_bytes_for_level_base=$MAX_BYTES_FOR_LEVEL_BASE --max_bytes_for_level_multiplier=4 --use_direct_io_for_flush_and_compaction --max_background_jobs=$(nproc) --num=100000000 --benchmarks=fillrandom,overwrite

This graph below shows the capacity usage over time. As ZenFS does not do any garbage collection the write amplification is 1.

zenfs_capacity

File system implementation

Files are mapped into into a set of extents:

Extents are block-aligned, continious regions on the block device Extents do not span across zones A zone may contain more than one extent Extents from different files may share zones Reclaim

ZenFS is exceptionally lazy at current state of implementation and does not do any garbage collection whatsoever. As files gets deleted, the used capacity zone counters drops and when it reaches zero, a zone can be reset and reused.

Metadata

Metadata is stored in a rolling log in the first zones of the block device.

Each valid meta data zone contains:

A superblock with the current sequence number and global file system metadata At least one snapshot of all files in the file system

The metadata format is currently experimental. More extensive testing is needed and support for differential updates is planned to be implemented before bumping up the version to 1.0.



【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3