Diagnose and Replace a Defective Hard Drive (Linux Dedicated Server with Software RAID)
Please use the “Print” function at the bottom of the page to create a PDF.
In this article, we'll show you how to identify a defective hard disk on a Linux Dedicated Server with software RAID and prepare the server for the replacement of the defective disk.
Please Note
This article assumes you have basic knowledge of server administration with Linux. If you have any questions regarding the replacement of a defective hard disk or need assistance, please contact IONOS Customer Service.
In order to ensure the highest possible reliability, it is necessary that you monitor the software RAID of your Dedicated Server. If you discover that a hard disk is defective or you receive a notification email about a defective hard disk, you must contact IONOS Customer Service to arrange for the hard disk to be replaced. This requires that you identify the defective hard disk and prepare the server to replace the defective disk.
Attention
RAID systems allow for greater fail-safety and/or speed. However, they are not a substitute for regular backups. To avoid data loss, we recommend that you back up regularly. Also, be sure to back up before performing the steps below to ensure the safety of your data.
Checking the Status of the Software RAID
To check the status of the software RAID, enter the following command in the shell:
[root@host ~]: cat /proc/mdstat
If both disks are present and mounted correctly, the following message is displayed:
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 sda3[1] sdb3[0]
262016 blocks [2/2] [UU]
md1 : active raid1 sda2[1] sdb2[0]
119684160 blocks [2/2] [UU]
md0 : active raid1 sda1[1] sdb1[0]
102208 blocks [2/2] [UU]
unused devices: <none>
The above example shows three multiple devices or logical drives (md0, md1, md2). For each of these logical drives, it is indicated which partitions they are composed of and on which drives these partitions are located.
Example: The logical drive md0 is composed of the partitions sda1 and sdb 1.
In the line listed below the respective logical drive, the state of the individual partitions is shown at the end of the line in the square brackets. A U means that the respective disk is mounted (up) in the RAID.
In the following example, all logical drives have only one partition mounted, which is located on the sda hard disk. The respective partition located on the second hard disk sdb is not mounted. You can recognize this also by the entry [U_]. The unmounted partitions of the hard disk sdb indicate that there is an error or a defect with this hard disk.
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
md0 : active raid1 sda1[1]
102208 blocks [2/1] [U_]
md1 : active raid1 sda2[1]
119684160 blocks [2/1] [U_]
md2 : active raid1 sda3[1]
262016 blocks [2/1] [U_]
unused devices: <none>
In the following example, a defective disk is still mounted in the RAID:
[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
bitmap: 1/4 pages [4KB], 65536KB chunk
md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices:
<none>
The entry (F) in this example shows that the partition is marked as faulty.
Error Diagnosis and Finding the Necessary Data for Hard Disk Replacement
To detect hard disk errors, we recommend that you do the following:
Install the Smartctl program, which is a command-line program to monitor disks using SMART (Self-Monitoring, Analysis and Reporting Technology). With this program you can check if a disk is defective. It is a part of Smartmontools. The Smartmontools are available as packages for many Linux distributions.
Please Note
In some cases, a hard disk defect may not be detected by means of the smart values. Therefore, we recommend that you also analyze the /var/log/messages log file.
Install Smartctl
To install Smartctl, enter the following command:
CentOS
yum install smartmontools
Ubuntu
sudo apt-get install smartmontools
Get information about the hard disk
To access a list of disks, enter the following command:
smartctl --scanExample:
[root@8E8885C ~]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
To access detailed information for error diagnostics, enter the following command:
smartctl -iHAl error [FIXED NAMES]
Please Note
Device interfaces must be specified in the following format:
SCSI / SATA devices:
smartctl - iHAl error /dev/sd[a-z]
Example:
[root@localhost ~] # smartctl -iHAl error /dev/sda
After entering the command, the following information is displayed, for example:
[root@8E8885C ~]# smartctl -iHAl error /dev/sda
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri May 3 07:45:14 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always 3833
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always 2560
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always 9
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always 26802171994
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always 67
194 Temperature_Celsius 0x0022 116 111 000 Old_age Always 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline 0
SMART Error Log Version: 1
No Errors Logged
Interpretation of Parameters and Fault Diagnosis
Analyze the detailed information that you called by means of the command smartctl -iHAl error [NAMED DISK]. The first section lists information that you can use to identify the hard disk:
=== START OF INFORMATION SECTION ===
Device Model: HGST HUS722T1TALA604
Serial Number: WMC6N0K2RW66
LU WWN Device Id: 5 0014ee 004722db0
Firmware Version: RAGNWA07
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri May 3 07:45:14 2019 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
This section displays, among other things, the device model and serial number of the checked hard disk.
In the second section, the current state of the hard disk is assessed by Smartctl. If the value "PASSED" is not displayed but, for example, the value "Failed" or "UNKNOWN", you should arrange for the hard disk in question to be replaced as soon as possible.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
In the third section, the determined SMART VALUES are listed in detail. Next to each current percentage value (VALUE), the worst value ever measured (WORST) and the respective limit value (THRESH) are listed. If the current, percentage value (VALUE) or the worst, ever measured value (WOR ST) exceeds the limit value (THRESH), a SMART warning is displayed in the WHEN_FAILED column (e.g. FAILING_NOW).
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always 0
3 Spin_Up_Time 0x0027 183 183 021 Pre-fail Always 3833
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always 0
9 Power_On_Hours 0x0032 097 097 000 Old_age Always 2560
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always 9
16 Unknown_Attribute 0x0022 000 200 000 Old_age Always 26802171994
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always 4
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always 67
194 Temperature_Celsius 0x0022 116 111 000 Old_age Always 31
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline 0
The following parameters can indicate an impending hard disk failure before a SMART warning is displayed:
Reallocated_Sector_Ct: Indicates the number of sectors that have been reallocated due to read errors. If a sector can no longer be read, written to or checked correctly, a replacement sector is automatically allocated to it. The faulty sector is permanently marked as unreadable. This is a clear warning sign of incipient surface problems. If this value is not zero, a hard disk failure is often imminent. This value is the most important indicator for a hard disk replacement.
Current_Pending_Sector_Ct: Indicates the number of unstable sectors waiting to be remapped. If a sector cannot be read and written to correctly, it initially receives the status Current Pending Sector. The sector is not reallocated in this state because the data located on the sector is unknown. Only after several unsuccessful read or write attempts is a replacement sector allocated and the faulty sector is permanently marked as unreadable. The Current_Pending_Sector_Ct value is an important indicator for a hard disk replacement. If this value is not zero, a hard disk failure is often imminent.
Offline_Uncorrectable: Indicates the number of uncorrectable errors during read and write access to sectors.
The last section deals with the internal hard disk log. Errors are recorded here if the servers work requests from the hard disk were not processed properly. If at least a two-digit error number is displayed in this section, you should arrange for the hard disk to be replaced as soon as possible.
SMART Error Log Version: 1
No Errors Logged
Required Information for Hard Disk Replacement
The following information is required to initiate the replacement of the defective hard disk:
Designation of the hard disk in the RAID (e.g. sda)
Serial number
Model
Log file (optional)
Creating a SMART Log
To create a full SMART log, enter the following command:
smartctl -x [NAMEFIXED]
Example:
[root@localhost ~]# smartctl -x /dev/sda
If the hard disk can no longer be accessed using Smartctl, you can use the hdparm program to retrieve the necessary information. How to install hdparm:
CentOS
yum -y install hdparm
Ubuntu/Debian
sudo apt-get update
sudo apt-get install hdparm
Then enter the following command to retrieve the information required for disk replacement:
hdparm -i /dev/sda
Notes
If the SMART log was created as described above, this is sufficient information. You can then arrange for the defective hard disk to be replaced. Please contact IONOS Customer Service for this.
If you cannot call up the serial number of the defective hard disk using Smartctl, you can alternatively provide the serial number of the working hard disk(s) to the customer service.
Preparing a Server for Hard Disk Replacement
The following example assumes that the second hard disk (sdb) is to be replaced. For example, the following status of the software RAID is displayed during the status check:
[root@host ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2]
439553856 blocks super 1.0 [2/1] [UU]
md1 : active raid1 sdb1[2] sda1[0]
19529600 blocks super 1.0 [2/1] [UU]
unused devices: <none>
The second hard disk (sdb) is still mounted in the RAID in this example and is therefore still in use.
Manually mark raid device as "faulty" to remove it from RAID
To mark the defective disk as "faulty" so that it can be removed from RAID, enter the following command:
[root@host ~]# mdadm PATH_DES_RAID_ARRAYS -f PATH_OF_FIXED DISK.
In the examples below, the sdb3 or sdb1 disks are marked as faulty:
[root@host ~]# mdadm /dev/md3 -f /dev/sdb3
mdadm: set /dev/sdb3 faulty in /dev/md3
[root@host ~]# mdadm /dev/md1 -f /dev/sdb1
mdadm: set /dev/sdb1 faulty in /dev/md1
After entering the command, the RAID has the following status:
[root@host ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[2](F)
439553856 blocks super 1.0 [2/1] [U_]
md1 : active raid1 sdb1[2](F) sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Remove partition/ from the Multiple Device
To remove a partition from the Multiple Device, issue the following command:
[root@host ~]# mdadm -r /PFAD_DES_RAID_ARRAYS /PFAD_DER_FESTPLATTE
In the examples below, the sdb3 and sdb1 disks are removed from the multiple device md3 and md1, respectively:
[root@host ~]# mdadm -r /dev/md3 /dev/sdb3
mdadm: hot removed /dev/sdb3 from /dev/md3
[root@host ~]# mdadm -r /dev/md1 /dev/sdb1
mdadm: hot removed /dev/sdb1 from /dev/md1
Then check the status of the RAID. In this example, the RAID that was prepared for disk replacement has the following final state:
[root@host ~]# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0]
439553856 blocks super 1.0 [2/1] [U_]
md1 : active raid1 sda1[0]
19529600 blocks super 1.0 [2/1] [U_]
unused devices: <none>
Check which swap partitions are used
Check which swap partitions are used by the operating system. To do this, type the following command:
[root@host ~]# cat /proc/swaps
Filename Type Size Used Priority
/dev/sda2 partition 9765884 0 -1
/dev/sdb2 partition 9765884 0 -2
Alternatively, you can check which swap partitions are defined in fstab by entering the following command:
[root@host ~]# grep swap /etc/fstab
/dev/sda2 none swap sw
/dev/sdb2 none swap sw
Disable swap partition on the defective device
Disable the swap partition on the defective disk so that it can be swapped. To do this, type the following command:
[root@host ~]# swapoff PATH_OF_FIXED_DISK
Example:
[root@host ~]# swapoff /dev/sdb2
Please Note
If the swap partition on the defective disk is not deactivated and a disk replacement is performed, the swap partition in /proc/swaps receives the deleted status.
Arranging for Hard Disk Replacement
Now the replacement of the defective hard disk can be arranged. For this purpose please contact IONOS Customer Service.
Required Steps After Replacing the Hard Disk
After replacing the defective hard disk, it is necessary that you rebuild the software RAID. For more information about rebuilding a software RAID, click here:
Content
- Checking the Status of the Software RAID
- Error Diagnosis and Finding the Necessary Data for Hard Disk Replacement
- Interpretation of Parameters and Fault Diagnosis
- Required Information for Hard Disk Replacement
- Creating a SMART Log
- Preparing a Server for Hard Disk Replacement
- Arranging for Hard Disk Replacement
- Required Steps After Replacing the Hard Disk
- To top