Is Your SSD Failing? Learn to Check Its Health on Linux
If your data center uses Linux machines, one of the management tasks you need to perform is to regularly check the health of the SSD drives used in these machines. Why? Because, although solid-state drives significantly outlast spinning platter hard drives, they do have a limited lifespan. The last thing you want to do is become a victim of that special day.
How do I check the health of these drives? As with everything in Linux, there are options. Although GUI solutions exist (GNOME Disk), I highly recommend using command line tools for this task. Why? Most of the time, your Linux server won’t contain a GUI; Through the command line, you can easily use it by securely shelling to a remote Linux server and running tests from the terminal.
The tool in question is smartctl. Through this command, you can quickly understand the health status of your SSD. Of course, the mileage you get from the command depends on the make/model of SSD you use. Unfortunately, SMART (Self-Monitoring, Analysis and Reporting Technology) tools are not always in sync with every SSD drive.
Therefore, you cannot determine the number of writes to the SSD die. Even taking this into account, you can get a good estimate of the wear and tear on your drive.
Let’s install and use smartctl.
look: How to view SSH keys in Linux, macOS, and Windows (Technology Republic)
Install
I will use the Ubuntu platform for demonstration. The required packages can be found in all standard repositories, so adapt the installation commands to suit the specific distribution of your choice.
The smartctl utility is part of the smartmontools suite. Can be installed using a single command:
sudo apt install smartmontools
Note that the above command will also install libgsasl7, libkyotocabinet16v5, libmailutils5, libntlm0, mailutils, mailutils-common, and postfix.
After installing the package, you’re ready to start using it.
look: Securing Linux Strategies (Technology Republic Advanced Edition)
usage
To use the smartctl tool, the first thing you have to do is collect information about the drive. This is done with the following command:
sudo smartctl -i /dev/sdX
where sdX is the name of the drive to be tested.
The above command will print detailed information about your drive.
As you can see, the drive in question is in the smartctl database, so the information should be up to date.
Let’s put the drive through a brief test. These tests will actually give you the most accurate data on your drive (so it’s important to use these included tools). Issue the command:
sudo smartctl -t short -a /dev/sdX
This will report some information immediately.
I recommend that you perform short and long term tests on your drive on a weekly or (monthly) basis. To run a long test, the command is:
sudo smartctl -t long -a /dev/sdX
One of the first things you should look at are the results of the SMART Holistic Health Self-Assessment Test. That should say “passed”. If not, you know there’s something immediately wrong with your SSD.
The short test will check the following:
- Electrical Characteristics: The controller tests its own electronics, which are different for each manufacturer.
- Mechanical properties: Servos and positioning mechanisms are tested (also for each manufacturer).
- Read/Verify: An area of the disk will be read to verify certain data (the size and location of the read area are unique to each manufacturer).
The long test runs everything included in the short test, plus:
- No time limit and in read/verify segment.
- Check the entire disk (instead of just a portion).
The short test takes approximately two minutes to complete, while the long test takes 20-60 minutes (depending on your hardware). To view the test results, issue the command sudo smartctl -a /dev/sdX (where sdX is the name of the drive being tested).
This command will print the test results and all the information needed to verify the health of the SSD.
In addition to the self-detection log, the output has two values to check:
- Power_On_Hours — The number of hours the drive has had power. Each make/model of drive has a recommended “shelf life”, which is the number of hours it can be used. The longevity of most modern SSDs is pretty incredible, so you’re likely not going to encounter end-of-life. This may be a problem if you are using an older drive.
- Wear_Leveling_Count — Represents the percentage of the drive’s remaining endurance (starting at 100 and decreasing linearly as the drive writes).
It is important to look at the value and worst value columns. As you can see, my Samsung SSD currently has a Wear_Leveling_Count value of 99, which is a very healthy drive.
One thing to keep in mind is that different manufacturers report different data using smartctl. For example, I have older Intel and Kingston SSD drives connected to the same machine. Both drives report similar (and more comprehensive) data. However, neither reports Wear_Leveling_Count. Why? These are older drives and will not report ID 177 (Wear_Leveling_Count). Instead, your best option is to run short-term and long-term tests and verify the drive’s health through these reports.
look: How to connect to a Linux Samba share from Windows (Technology Republic)
Obvious warning
There are two caveats to smartctl.
First, reported data can easily be misinterpreted. Therefore, you must know the make and model of the drive you are testing. Once you have this information, you can use the reported data to investigate any anomalies.
Secondly, using testing tools is crucial. Although you can execute commands such as smartctl -A /dev/sdX, you don’t get the added benefit of test results. Make sure to run short-term and long-term tests regularly to get the most up-to-date information possible about your SSD drive.
This article was originally published in October 2017.
2024-12-17 21:30:11