前言
Ceph集群大了之后,坏盘是很正常的事,这里记录一下更换坏盘的操作步骤。
正文
发现坏盘
首先是通过ceph perf或者直接在管理面板中查看ceph osd的情况,发现latency很高的盘就是有问题的盘了,如果这个盘的使用量低且延迟高,就应该果断的换掉:
移出坏盘
ceph osd out 23
这里的23换成自己的问题OSD编号
停止服务
systemctl stop ceph-osd@23
这里的23换成自己的问题OSD编号
移除磁盘
root@SH-1004:~# ceph osd safe-to-destroy osd.23 OSD(s) 23 are safe to destroy without reducing data durability. #先确认可以安全的移除磁盘,然后再进行操作: root@SH-1004:~# ceph osd destroy 23 --yes-i-really-mean-it destroyed osd.23
更换硬盘
首先要用megacli标记这个盘,然后让机房进行操作,第一步是定位这个盘符对应的磁盘序列号:
root@SH-1004:~# smartctl -a /dev/sda -d megaraid,0 smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.34-1-pve] (local build) Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Vendor: HP Product: MB2000FAMYV Revision: HPD7 Compliance: SPC-3 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Logical block size: 512 bytes Rotation Rate: 7200 rpm Form Factor: 3.5 inches Logical Unit id: 0x5000c500211c625f Serial number: 9WM15QAD0000C050EV4F Device type: disk Transport protocol: SAS (SPL-3) Local Time is: Mon Dec 21 20:04:53 2020 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled Temperature Warning: Enabled === START OF READ SMART DATA SECTION === SMART Health Status: OK Current Drive Temperature: 0 C Drive Trip Temperature: 0 C Elements in grown defect list: 4096 Error Counter logging not supported Device does not support Self Test logging
然后再用PDlist (MegaCli64 -PDlist -a0
)来查看对应的slot:
MegaCli64 -PdLocate -start -physdrv[32:0] -a0
之后offline+亮灯:
MegaCli64 -pdoffline -physdrv[32:0] -a0 MegaCli64 -PdLocate -start -physdrv[32:0] -a0