Sometimes it can be useful to share a single block device (e.g., HDD or SSD) between multiple nodes (e.g., Linux OSes) with a coherent file system. So that modifications to files on the shared block device by one node are visible to another node. Such file system is called a clustered file system.
The difference with a distributed file system is that all the data lies into the single block device, whereas in a distributed file system data is distributed over several nodes. This is also not a network file system where data resides on one of the nodes and transits over the network to be read by other nodes.
Here, all nodes directly access the block device, so they can directly write data without sending it to the other nodes. They need however to be informed if other nodes modify the data on the block device, and some locking is necessary to avoid data corruption. That is where the file system comes in. Two examples of such file systems are GFS2 (Global File System 2) and OCFS2 (Oracle Cluster File System 2).
The advantages are :
- All nodes see a unified and coherent file system
- Data does not transit between nodes over the network for reading or writing
The requirement are :
- A block device that can be shared between nodes (e.g., iSCSI, NVMe-oF)
Network file systems, distributed file systems, and cluster file systems all share 1. however, 2. is usually not the case for network file systems and distributed file systems.
Demonstration with QEMU
In this post we will show how to setup a shared block device between two QEMU (emulated) instances. For this we will install Ubuntu server 24.04.1 LTS and setup OCFS2 on two instances (nodes) that will both access a single block device.
First, on the host machine let’s create a directory to work in and download the Ubuntu server 24.04.1 ISO file.
mkdir ocfs2
cd ocfs2
wget https://releases.ubuntu.com/24.04.1/ubuntu-24.04.1-live-server-amd64.iso
Let’s also create two virtual disks to install Ubuntu server, one for each nodes.
qemu-img create -f qcow2 disk1.qcow 10G
qemu-img create -f qcow2 disk2.qcow 10G
Install Ubuntu 24.04.1 LTS
We will install Ubuntu server via the command line without a graphical output on the QEMU instances, this make the emulation lighter (no graphics) and also allows to do this on a remote machine e.g., via SSH. In order to do this we need to tell the kernel that we want the console on a specific serial output, to do so we need to specify the kernel and kernel command line to QEMU. We can do this by mounting the ISO file locally and pass the kernel to QEMU.
mkdir ubuntu_mount
sudo mount -o loop ubuntu-24.04.1-live-server-amd64.iso ubuntu_mount/
The contents of the ISO file are mounted in ubuntu_mount
. We can now launch QEMU.
qemu-system-x86_64 -drive file=disk1.qcow -m 4G -smp 4 \
--enable-kvm -cpu host -nographic -no-reboot \
-cdrom ubuntu-24.04.1-live-server-amd64.iso \
-append "console=ttyS0" -kernel ubuntu_mount/casper/vmlinuz \
-initrd ubuntu_mount/casper/initrd
Here we use an x86_64
architecture, this will probably also work for other architectures. We specify that the instance has a single drive, where we will install Ubuntu (we will attach the shared block device later). The -m
option allows us to specify the amount of RAM, here 4 gigabytes. The -smp
option allows to set the number of cores, here 4. We chose to enable kvm
(the kernel-based virtual machine) because we run this on a Linux-based host that supports it. We chose to pass the CPU information from the host with -cpu host
. That’s it for the machine configuration.
We also pass the -nographic
and -no-reboot
options, because we don’t want a graphical output and we don’t want the machine to reboot as we will relaunch it with another configuration.
Finally, we pass the CD-ROM to install Ubuntu server 24.04.1 LTS and we ask QEMU to launch the kernel present on the CD-ROM with the console=ttyS0
parameter passed to the kernel. The path to the initrd
(Initial RamDisk) is also passed to the kernel.
The installation is illustrated below, all default settings are okay, we chose to name the machine node1
and chose to install an OpenSSH server.
Once installed the system will turn off when we press ENTER
as we chose the -no-reboot
option.
Perform the same installation for the second node with disk2.qcow but this time name the server node2
instead of node1
:
qemu-system-x86_64 -drive file=disk2.qcow -m 4G -smp 4 \
--enable-kvm -cpu host -nographic -no-reboot \
-cdrom ubuntu-24.04.1-live-server-amd64.iso \
-append "console=ttyS0" -kernel ubuntu_mount/casper/vmlinuz \
-initrd ubuntu_mount/casper/initrd
Now that both instances are installed we can unmount and delete the ISO file to save space.
sudo umount ubuntu_mount
rm ubuntu-24.04.1-live-server-amd64.iso
Launching the instances
The instances can be launched individually with :
qemu-system-x86_64 -drive file=disk1.qcow -m 4G -smp 4 \
--enable-kvm -cpu host -nographic
qemu-system-x86_64 -drive file=disk2.qcow -m 4G -smp 4 \
--enable-kvm -cpu host -nographic
QEMU Network setup between two instances
However, we will want to setup a network between the two machines, expose the SSH port to the host machine (if we want to SSH into the machines), and add a shared block device. Let’s start by creating the shared block device backing file with qemu-img create
. We use the raw format because other size-optimised formats will cause issues when being shared.
qemu-img create -f raw shared.img 4G
And we will launch the QEMU instances with :
qemu-system-x86_64 -drive file=disk1.qcow,if=ide -m 4G -smp 4 \
--enable-kvm -cpu host -nographic \
-device e1000,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::2222-:22 \
-netdev socket,id=net1,listen=:3333 \
-device e1000,netdev=net1,mac=52:54:00:12:34:66 \
-drive file=shared.img,if=none,id=nvm,file.locking=off \
-device nvme,serial=deadbeef,drive=nvm
The differences with the previous command are as follows :
- We define a first network device, that forwards the port 22 (SSH) as port 2222 on the host machine, instead of using the default emulated network device as was done during the installation.
- We add another network device to connect both QEMU instances with each other, this is a direct connection over a socket with port 3333. (This could also be setup with a network bridge on the host and TAP interfaces, but the direct connection is simpler and doesn’t require
root
access). We also specify a MAC address because we don’t want the this and other machine to have the same default value. - Finally we add the shared block device, here we chose an NVMe device and added the
file.locking=off
option, otherwise the second QEMU instance will complain that the shared file is locked. Concurrent access will be managed by OCFS2.
qemu-system-x86_64 -drive file=disk2.qcow,if=ide -m 4G -smp 4 \
--enable-kvm -cpu host -nographic \
-device e1000,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::2223-:22 \
-netdev socket,id=net1,connect=127.0.0.1:3333 \
-device e1000,netdev=net1,mac=52:54:00:12:34:77 \
-drive file=shared.img,if=none,id=nvm,file.locking=off \
-device nvme,serial=deadbeef,drive=nvm
The second QEMU instances has the same parameters with the exception of
- The SSH port, which is 2223 on the host.
- The second network device connects to the network device of the first instance. (Because this interfaces connects to the first instance, which listens, the first instance should be launched before launching the second one).
- The MAC address is different.
Setup of both QEMU instances
Launch both instances.
First we will setup the direct network link between the two instances. Both instances have their second network card as ens4
. You can check this with the ip addr
command (you will the the MAC address we set). In each instance setup the network with :
sudo nano /etc/netplan/01-netcfg.yaml
And fill with the following for node1 (sets the IP to 10.0.0.1) :
network:
version: 2
ethernets:
ens4:
addresses:
- 10.0.0.1/24 # For node1
dhcp4: no
And with the following for node2 (sets the IP to 10.0.0.2) :
network:
version: 2
ethernets:
ens4:
addresses:
- 10.0.0.2/24 # For node2
dhcp4: no
Apply the changes (on both) with :
sudo netplan apply
Upon reboot the network service waits for all network interfaces to be online and this hangs with the second network interface, so let’s just remove this (on both instances).
sudo rm /lib/systemd/system/systemd-networkd-wait-online.service
sudo rm /lib/systemd/system/systemd-networkd-wait-online@.service
Now we want both nodes to resolve the hostnames node1
to 10.0.0.1 and node2
to 10.0.0.2 so edit /etc/hosts
on both instances.
sudo nano /etc/hosts
And add the following lines :
10.0.0.1 node1
10.0.0.2 node2
Now you should be able to ping node2
from node1
and vice-versa.
Install the ocfs2-tools
package on both instances.
sudo apt install -y ocfs2-tools
And configure the OCFS2 cluster on both instances
sudo o2cb add-cluster mydatacluster
sudo o2cb add-node mydatacluster node1 --ip 10.0.0.1 --port 7777 --number 1
sudo o2cb add-node mydatacluster node2 --ip 10.0.0.2 --port 7777 --number 2
sudo o2cb register-cluster mydatacluster
sudo o2cb start-heartbeat mydatacluster
sudo dpkg-reconfigure ocfs2-tools
# Choose to enable at boot time
# The name of the cluster to start is mydatacluster
# All other options can be left as default
On node1
prepare the shared disk
sudo fdisk /dev/nvme0n1
# Create a new partition with 'n' new, 'p' primary, keep defaults and 'w' write.
# Format the partition as OCFS2 (-N 2 for 2 nodes)
sudo mkfs.ocfs2 -b 4K -C 1M -N 2 -L myshareddata /dev/nvme0n1p1
On node2
reread the partition of the shared disk with
sudo fdisk /dev/nvme0n1
# Use 'p' to print the partitions, the 'q' to quit
On both instances, create a mount point and edit /etc/fstab
to mount the shared file system automatically.
# Create the mount point
sudo mkdir /mnt/shared
sudo nano /etc/fstab
Add the following line to /etc/fstab
:
/dev/nvme0n1p1 /mnt/shared ocfs2 defaults,_netdev 0 0
Mount the file system on both nodes with :
sudo mount -a
We can check that it mounted and that there are two nodes by reading the kernel messages
sudo dmesg
The OCFS2 setup is complete !
We can now place files in the shared directory /mnt/shared and the changes will we visible to both nodes. We managed to setup a shared OCFS2 file system.
# Write some data on node1
sudo nano /mnt/shared/test_file.txt
# Read it from node2
less /mnt/shared/test_file.txt
You can also demonstrate this continuously with :
# Edit the file on node1
sudo nano /mnt/shared/test_file.txt
# Watch the contents on node2
watch -n 1 cat /mnt/shared/test_file.txt
When you save the file on node1
the changes will be visible on node2
.
If you try to edit the file on both instances at the same time you will be greeted by a message warning us that the file is being edited, this is exactly what we want.
You can now turn off the QEMU instances, upon the next launch they will still be sharing the OCFS2 file system in /mnt/shared
.
Why would one use a cluster file system ?
Well, examples include high-availability systems and load balancing, where several physical nodes access a single storage device, e.g., connected directly with fibre channel to each node and have a coherent file system. This way the multiple nodes can be used for availability (redundancy of nodes but single storage device) or load balancing.
Why are we interested in a cluster file system ?
At REDS we are developing computational storage solutions based on NVMe. We developed computational storage drives (CSDs) that are seen as normal NVMe drives by the host computer. These CSDs however can also mount their own contents to perform computations on their data.
This creates a scenario where both the host computer has direct access to the block device that is the NVMe drive, but also the drive itself, so we have two nodes that access a single block device. Therefore a cluster file system is a good solution to maintain coherency between the two. As for the network link between the CSD and the host computer, we tunnel TCP/IP over NVMe, this way we can setup a virtual private network (VPN) between the CSD and the host. This is also why it is important to us that the data does not travel over the network as both sides directly have access to the block device. Plus the extra layers make the network link slower than the direct NVMe link.
If you have any interest in the technical details, feel free to read our paper or visit the project’s GitHub.
Notes
You may encounter an error upon startup related to mounting the shared file system. Don’t worry about it, it is because the system tries to mount /mnt/shared
before the OCFS2 service is started, anyways the OCFS2 service will mount /mnt/shared
when it is launched.
References
- https://docs.kernel.org/filesystems/ocfs2.html
- https://blogs.oracle.com/cloud-infrastructure/post/a-simple-guide-to-oracle-cluster-file-system-ocfs2-using-iscsi-on-oracle-cloud-infrastructure
- https://www.oracle.com/us/technologies/linux/ocfs2-best-practices-2133130.pdf
- https://www.ibm.com/docs/en/linux-on-systems?topic=ocfs2-overview