前言
最近公司组了一套集群:3个90TB的存储服务器和3个计算服务器,计算服务器后续还会增加,然后内网使用10G的万兆交换机相连。一开始使用NFS将存储服务器的磁盘空间共享给计算服务器,但是发现计算服务器对小文件的读写较多,线程数一多就出现 IO Delay,实际看到网卡的流量并不高。
也就是说:瓶颈出现在网络的IO上,而非带宽上!
解决方案:换iSCSI,然后加上本地缓存。
可能的几种尝试:ZFS(内存缓存)、Bcache(SSD缓存)、LVM+缓存、ext4 journal(写日志)
为什么选ext4 journal
最开始我也考虑过直接NFS上开启本地缓存(FS-CACHE),但是看到缓存会有很多问题,比如多出挂载的时候会出现一致性错误,由于这个平台不止我一个人在管理,怕到时候忘记交代了出现问题,因此直接选择一对一的iSCSI更合适,而且块级存储的iSCSI比起目录存储的NFS性能上更好一些(SCSI is just about the fastest protocol you'll find. It's basically straight disk block access over a wire)。
然后在有iSCSI的条件下,本地缓存本来是想用ZFS的,因为是单盘iSCSI,所以也不用担心阵列导致的问题以及ZFS根据硬件信息调控磁盘IO的问题,冗余性已经交给存储服务器的硬件raid了。网上也确实有人这么做了,但是稳定性和安全性没有得到确认,毕竟是生产环境,还是稳一点好。
LVM和ext4其实没太大差别,如果为了后续扩容考虑的话LVM其实是更好的选择,但是目前分过来的容量目测也够了,而且ext4上存虚拟机磁盘可以按实际用量存储,而lvm除非用thin构架,否则开多少磁盘就占用多少。
这里另附一个缓存性能对比表,仅供参考
前置条件
首先iSCSI target配置和挂载看这里:https://www.liujason.com/article/502.html
然后进行如下操作:
- mkfs.ext4 /dev/sdc 把iSCSI磁盘格式化
- 新建目录mkdir /iscsi
- 写入/etc/fstab:/dev/sdc /iscsi ext4 default 0 0
- 挂载上去mount -a
挂载完成后查看磁盘情况:
root@PVE-EU-1 ~ # df -h Filesystem Size Used Avail Use% Mounted on udev 126G 0 126G 0% /dev tmpfs 26G 1.3M 26G 1% /run /dev/mapper/vg0-root 781G 37G 711G 5% / tmpfs 126G 66M 126G 1% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 126G 0 126G 0% /sys/fs/cgroup /dev/md0 487M 110M 352M 24% /boot /dev/fuse 30M 152K 30M 1% /etc/pve tmpfs 26G 0 26G 0% /run/user/0 172.17.2.1:/storage 74T 5.8T 69T 8% /mnt/pve/NFS-EU-1 172.17.3.1:/pve-eu-3-zfs 54T 619G 54T 2% /mnt/pve/NFS-EU-3 /dev/sdc 5.0T 14G 4.7T 1% /iscsi
开始配置缓存
首先现在SSD上VG里面分出来一个LV作为缓存(/dev/vg0/cache),然后将这个块作缓存设备:
root@PVE-EU-1 ~ # mke2fs -O journal_dev /dev/vg0/cache mke2fs 1.44.5 (15-Dec-2018) Discarding device blocks: done Creating filesystem with 8388608 4k blocks and 0 inodes Filesystem UUID: d56e9b69-e012-4a6f-808c-474de9182b55 Superblock backups stored on blocks: Zeroing journal device:
查看iSCSI磁盘的ID
root@PVE-EU-1 ~ # ls -l /dev/disk/by-path/ip-* /dev/disk/by-id/scsi-* lrwxrwxrwx 1 root root 9 Jan 30 10:12 /dev/disk/by-id/scsi-360000000000000000e00000000010001 -> ../../sdd lrwxrwxrwx 1 root root 9 Jan 30 10:12 /dev/disk/by-path/ip-172.17.3.107:3260-iscsi-iqn.2020-01.pve-eu-3:iscsieu1-lun-1 -> ../../sdd
创建使用外部缓存的ext4磁盘:
root@PVE-EU-1 ~ # mkfs.ext4 -J device=/dev/vg0/cache /dev/disk/by-id/scsi-360000000000000000e00000000010001 mke2fs 1.44.5 (15-Dec-2018) Using journal device's blocksize: 4096 Creating filesystem with 1342177280 4k blocks and 167772160 inodes Filesystem UUID: b217c0e8-3fc9-4b5b-97ef-629a2a295b22 Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Allocating group tables: done Writing inode tables: done Adding journal to device /dev/vg0/cache: done Writing superblocks and filesystem accounting information: done
然后根据最开始的那张性能图,设置fstab参数:
/dev/disk/by-id/scsi-360000000000000000e00000000010001 /iscsi ext4 rw,relatime,journal_checksum,journal_async_commit 0 0
挂载即可
性能测试
使用fio进行io读写性能测试
缓存前:
#顺序读 mytest: (groupid=0, jobs=30): err= 0: pid=18491: Wed Jan 29 23:06:05 2020 read: IOPS=39.7k, BW=620MiB/s (651MB/s)(6204MiB/10001msec) clat (usec): min=112, max=31486, avg=753.33, stdev=435.77 lat (usec): min=112, max=31487, avg=753.44, stdev=435.77 clat percentiles (usec): | 1.00th=[ 400], 5.00th=[ 490], 10.00th=[ 545], 20.00th=[ 603], | 30.00th=[ 644], 40.00th=[ 685], 50.00th=[ 717], 60.00th=[ 750], | 70.00th=[ 791], 80.00th=[ 857], 90.00th=[ 971], 95.00th=[ 1090], | 99.00th=[ 1401], 99.50th=[ 1713], 99.90th=[ 4228], 99.95th=[ 6849], | 99.99th=[21890] bw ( KiB/s): min=16448, max=23776, per=3.32%, avg=21116.73, stdev=1854.02, samples=578 iops : min= 1028, max= 1486, avg=1319.78, stdev=115.89, samples=578 lat (usec) : 250=0.02%, 500=5.69%, 750=54.01%, 1000=31.81% lat (msec) : 2=8.15%, 4=0.22%, 10=0.08%, 20=0.02%, 50=0.01% cpu : usr=0.28%, sys=0.83%, ctx=397789, majf=0, minf=120 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=397086,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=620MiB/s (651MB/s), 620MiB/s-620MiB/s (651MB/s-651MB/s), io=6204MiB (6506MB), run=10001-10001msec Disk stats (read/write): sdc: ios=353516/28, merge=33863/2, ticks=265290/62, in_queue=2936, util=98.47% #顺序写 mytest: (groupid=0, jobs=30): err= 0: pid=19381: Wed Jan 29 23:08:33 2020 write: IOPS=18.8k, BW=294MiB/s (309MB/s)(2944MiB/10002msec); 0 zone resets clat (usec): min=550, max=56581, avg=1590.78, stdev=1617.61 lat (usec): min=550, max=56581, avg=1591.07, stdev=1617.61 clat percentiles (usec): | 1.00th=[ 938], 5.00th=[ 1020], 10.00th=[ 1074], 20.00th=[ 1139], | 30.00th=[ 1188], 40.00th=[ 1254], 50.00th=[ 1319], 60.00th=[ 1401], | 70.00th=[ 1500], 80.00th=[ 1729], 90.00th=[ 2180], 95.00th=[ 2474], | 99.00th=[ 5669], 99.50th=[10814], 99.90th=[25822], 99.95th=[30802], | 99.99th=[54789] bw ( KiB/s): min= 5280, max=13024, per=3.33%, avg=10031.98, stdev=2006.96, samples=594 iops : min= 330, max= 814, avg=626.99, stdev=125.44, samples=594 lat (usec) : 750=0.02%, 1000=3.18% lat (msec) : 2=83.85%, 4=11.32%, 10=1.09%, 20=0.37%, 50=0.15% lat (msec) : 100=0.02% cpu : usr=0.19%, sys=1.08%, ctx=188705, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,188435,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=294MiB/s (309MB/s), 294MiB/s-294MiB/s (309MB/s-309MB/s), io=2944MiB (3087MB), run=10002-10002msec Disk stats (read/write): sdc: ios=0/184696, merge=0/4796, ticks=0/294868, in_queue=29140, util=95.73% #随机读 mytest: (groupid=0, jobs=30): err= 0: pid=20702: Wed Jan 29 23:12:15 2020 read: IOPS=39.0k, BW=610MiB/s (639MB/s)(6098MiB/10002msec) clat (usec): min=278, max=102296, avg=767.45, stdev=299.23 lat (usec): min=279, max=102296, avg=767.57, stdev=299.24 clat percentiles (usec): | 1.00th=[ 465], 5.00th=[ 562], 10.00th=[ 611], 20.00th=[ 660], | 30.00th=[ 693], 40.00th=[ 717], 50.00th=[ 742], 60.00th=[ 775], | 70.00th=[ 807], 80.00th=[ 848], 90.00th=[ 930], 95.00th=[ 1020], | 99.00th=[ 1254], 99.50th=[ 1532], 99.90th=[ 3261], 99.95th=[ 4555], | 99.99th=[ 8979] bw ( KiB/s): min=16672, max=22400, per=3.33%, avg=20813.92, stdev=821.16, samples=577 iops : min= 1042, max= 1400, avg=1300.85, stdev=51.33, samples=577 lat (usec) : 500=1.82%, 750=49.96%, 1000=42.44% lat (msec) : 2=5.58%, 4=0.14%, 10=0.06%, 20=0.01%, 100=0.01% lat (msec) : 250=0.01% cpu : usr=0.31%, sys=1.37%, ctx=390537, majf=0, minf=120 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=390285,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=610MiB/s (639MB/s), 610MiB/s-610MiB/s (639MB/s-639MB/s), io=6098MiB (6394MB), run=10002-10002msec Disk stats (read/write): sdc: ios=386399/7, merge=0/2, ticks=293426/13, in_queue=1516, util=99.04% #随机写 mytest: (groupid=0, jobs=30): err= 0: pid=21040: Wed Jan 29 23:13:18 2020 write: IOPS=1441, BW=22.5MiB/s (23.6MB/s)(227MiB/10068msec); 0 zone resets clat (usec): min=152, max=275020, avg=20724.80, stdev=31748.12 lat (usec): min=152, max=275020, avg=20725.15, stdev=31748.13 clat percentiles (usec): | 1.00th=[ 180], 5.00th=[ 204], 10.00th=[ 221], 20.00th=[ 269], | 30.00th=[ 8455], 40.00th=[ 10028], 50.00th=[ 10814], 60.00th=[ 11731], | 70.00th=[ 13042], 80.00th=[ 28967], 90.00th=[ 63177], 95.00th=[ 91751], | 99.00th=[141558], 99.50th=[158335], 99.90th=[274727], 99.95th=[274727], | 99.99th=[274727] bw ( KiB/s): min= 96, max= 2432, per=3.36%, avg=774.74, stdev=635.51, samples=597 iops : min= 6, max= 152, avg=48.39, stdev=39.72, samples=597 lat (usec) : 250=16.76%, 500=11.93%, 750=0.26%, 1000=0.08% lat (msec) : 2=0.05%, 4=0.10%, 10=10.40%, 20=38.11%, 50=9.51% lat (msec) : 100=8.40%, 250=4.31%, 500=0.11% cpu : usr=0.04%, sys=0.17%, ctx=29101, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,14515,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=22.5MiB/s (23.6MB/s), 22.5MiB/s-22.5MiB/s (23.6MB/s-23.6MB/s), io=227MiB (238MB), run=10068-10068msec Disk stats (read/write): sdc: ios=0/14333, merge=0/2, ticks=0/9903, in_queue=5636, util=53.16%
这个随机写真的是惨不忍睹啊,如果能把随机写入改为连续写入那性能会大幅提升呀!
缓存后:
#随机读 mytest: (groupid=0, jobs=30): err= 0: pid=3640: Wed Jan 29 23:57:58 2020 read: IOPS=43.3k, BW=677MiB/s (710MB/s)(300MiB/443msec) clat (usec): min=157, max=1523, avg=683.14, stdev=120.39 lat (usec): min=157, max=1523, avg=683.25, stdev=120.40 clat percentiles (usec): | 1.00th=[ 388], 5.00th=[ 494], 10.00th=[ 545], 20.00th=[ 603], | 30.00th=[ 635], 40.00th=[ 660], 50.00th=[ 685], 60.00th=[ 709], | 70.00th=[ 734], 80.00th=[ 758], 90.00th=[ 807], 95.00th=[ 857], | 99.00th=[ 1057], 99.50th=[ 1188], 99.90th=[ 1369], 99.95th=[ 1401], | 99.99th=[ 1516] lat (usec) : 250=0.15%, 500=5.40%, 750=71.23%, 1000=21.57% lat (msec) : 2=1.65% cpu : usr=0.67%, sys=0.67%, ctx=19555, majf=0, minf=120 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=19200,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=677MiB/s (710MB/s), 677MiB/s-677MiB/s (710MB/s-710MB/s), io=300MiB (315MB), run=443-443msec Disk stats (read/write): sdc: ios=17055/0, merge=62/0, ticks=11589/0, in_queue=0, util=80.65% #随机写 mytest: (groupid=0, jobs=30): err= 0: pid=2942: Wed Jan 29 23:56:17 2020 write: IOPS=12.5k, BW=195MiB/s (205MB/s)(1952MiB/10002msec); 0 zone resets clat (usec): min=344, max=168170, avg=2393.38, stdev=6630.45 lat (usec): min=345, max=168170, avg=2393.65, stdev=6630.46 clat percentiles (usec): | 1.00th=[ 717], 5.00th=[ 930], 10.00th=[ 1057], 20.00th=[ 1188], | 30.00th=[ 1270], 40.00th=[ 1336], 50.00th=[ 1418], 60.00th=[ 1516], | 70.00th=[ 1647], 80.00th=[ 1844], 90.00th=[ 2573], 95.00th=[ 4752], | 99.00th=[ 23200], 99.50th=[ 38011], 99.90th=[106431], 99.95th=[116917], | 99.99th=[164627] bw ( KiB/s): min= 768, max=11488, per=3.30%, avg=6591.90, stdev=2992.13, samples=586 iops : min= 48, max= 718, avg=411.97, stdev=187.00, samples=586 lat (usec) : 500=0.02%, 750=1.38%, 1000=5.52% lat (msec) : 2=77.11%, 4=10.43%, 10=2.90%, 20=1.48%, 50=0.84% lat (msec) : 100=0.10%, 250=0.22% cpu : usr=0.13%, sys=0.50%, ctx=125486, majf=0, minf=0 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,124957,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=195MiB/s (205MB/s), 195MiB/s-195MiB/s (205MB/s-205MB/s), io=1952MiB (2047MB), run=10002-10002msec Disk stats (read/write): sdc: ios=0/122966, merge=0/887, ticks=0/294579, in_queue=109328, util=79.43%