Skip to main content

Zabbix v6 SMART HDD and CPU Temperature Check

TST, Hong Kong

Install Smartmontools and LM Sensors

apt install lm-sensors smartmontools

Harddrive Monitoring

S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs), solid-state drives (SSDs), and eMMC drives

The smartmontools package comes with two utilities, smartctl which you can use to check your hard drives on the command line, and smartd, a daemon that checks your hard disks at a specified interval and logs warnings/errors to the syslog and can also send warnings and errors to a specified email address (usually the admin of the system).

smartctl -v

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-11-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Using Smartctl

Harddrives

Find partition:

df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 6.3G 2.3M 6.3G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 98G 36G 58G 38% /
tmpfs 32G 36M 32G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/md126p2 2.0G 428M 1.4G 24% /boot
/dev/md126p1 1.1G 6.1M 1.1G 1% /boot/efi
smartctl  --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device

In the case of my test server below we have a virtual machine - that, obviously, does not have access to the underlying HDD hardware /dev/sda1:

smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-97-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD10EFRX-68FYTN0
Serial Number: WD-WCC4J4NHYJJ2
LU WWN Device Id: 5 0014ee 269c5648a
Firmware Version: 82.00A82
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 9 09:21:06 2024 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (14100) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 160) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 169 133 021 Pre-fail Always - 2516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 63
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 18640
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 62
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 59
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5458
194 Temperature_Celsius 0x0022 121 094 000 Old_age Always - 22
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 18594 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

NVME Drives

Additional NVME drive:

df -h

Filesystem Size Used Avail Use% Mounted on
tmpfs 32G 2.9M 32G 1% /run
/dev/mapper/ubuntu--vg--1-ubuntu--lv 438G 81G 338G 20% /
tmpfs 63G 914M 62G 2% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p2 2.0G 304M 1.5G 17% /boot
/dev/nvme0n1p1 1.1G 6.1M 1.1G 1% /boot/efi
tmpfs 13G 4.0K 13G 1% /run/user/1000
smartctl  --scan
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
smartctl -a /dev/nvme0

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-91-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: SanDisk Extreme Pro 500GB
Serial Number: 212181449612
Firmware Version: 111130WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 8215
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 4a4944db11
Local Time is: Sat Mar 9 09:45:02 2024 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 88 Celsius
Namespace 1 Features (0x02): NA_Fields

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 5.50W - - 0 0 0 0 0 0
1 + 3.50W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 4000 10000
4 - 0.0035W - - 4 4 4 4 4000 40000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 3%
Data Units Read: 7,580,500 [3.88 TB]
Data Units Written: 37,759,770 [19.3 TB]
Host Read Commands: 38,034,128
Host Write Commands: 1,870,013,477
Controller Busy Time: 279
Power Cycles: 25
Power On Hours: 13,542
Unsafe Shutdowns: 17
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

When you see that the SMART support is: disabled run the following command to enable it:

smartctl -s on -a /dev/sda1

CPU Temperature

sensors

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +42.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +46.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +49.0°C (high = +100.0°C, crit = +100.0°C)
Core 6: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 7: +47.0°C (high = +100.0°C, crit = +100.0°C)

nvme-pci-0900
Adapter: PCI adapter
Composite: +48.9°C (low = -5.2°C, high = +83.8°C)
(crit = +87.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: +16.8°C (crit = +20.8°C)
temp2: +27.8°C (crit = +105.0°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A

enp7s0-pci-0700
Adapter: PCI adapter
PHY Temperature: +47.9°C
MAC Temperature: +48.5°C

nvme-pci-0300
Adapter: PCI adapter
Composite: +50.9°C (low = -5.2°C, high = +83.8°C)
(crit = +87.8°C)

Zabbix

Preparing Zabbix-Agent2

Add the zabbix agent to Sudoers:

nano /etc/sudoers
# Zabbix user SMART control
zabbix ALL=(ALL) NOPASSWD:/usr/sbin/smartctl

# Zabbix user LM Sensors
zabbix ALL=NOPASSWD:/bin/sensors

Let's start by allowing the Zabbix server to execute ANY script (parental supervision advised):

nano /etc/zabbix/zabbix_agent2.conf
### Option: AllowKey
# Allow execution of item keys matching pattern.
# Multiple keys matching rules may be defined in combination with DenyKey.
# Key pattern is wildcard expression, which support "*" character to match any number of any characters in ce>
# Parameters are processed one by one according their appearance order.
# If no AllowKey or DenyKey rules defined, all keys are allowed.
#
# Mandatory: no
AllowKey=system.run[*]

Direct CLI Command Execution

Now prepare a few sensor/smarttools commands to extract single values of interest:

smartctl -a /dev/sda | grep Temperature_Celsius | awk {'print $10'}
23

smartctl -a /dev/nvme0 | grep Temperature | awk {'print $2'} | grep -o '[0-9]\+'
51

sudo sensors | grep 'Core 0' | awk -F'[+|.]' {'print $2'}
30

We can add these scripts to our Zabbix Server Scripts config:

Zabbix v6 SMART HDD and CPU Temperature Check

Zabbix v6 SMART HDD and CPU Temperature Check

As a manual script item we can now execute those scripts directly from our global dashboard:

Zabbix v6 SMART HDD and CPU Temperature Check

If you run into the following error message Cannot execute script. Unknown metric system.run you skipped the step above of adding the zabbix agent to your host sudoers - or forgot to restart the Zabbix Agent service:

Zabbix v6 SMART HDD and CPU Temperature Check

If everything is set up right your server should now be able to retrieve the Temperature value from your host system:

Zabbix v6 SMART HDD and CPU Temperature Check

Working with Shell Scripts

To replace the nasty wildcard execution permission we can now replace the direct commands with a shell script. Just add all CLI commands you want to execute to separate shell files in a directory accessible to the Zabbix Agent:

/opt/zabbix/temp_sda.sh

#!/bin/bash
sudo smartctl -a /dev/sda | grep Temperature_Celsius | awk {'print $10'}

/opt/zabbix/temp_nvme0.sh

#!/bin/bash
smartctl -a /dev/nvme0 | grep Temperature | awk {'print $2'} | grep -o '[0-9]\+'

/opt/zabbix/temp_core0.sh

#!/bin/bash
sudo sensors | grep 'Core 0' | awk -F'[+|.]' {'print $2'}

and so on...

Now replace the wildcard with the explicit script file calls to exclude any script not specifically defined by you:

nano nano /etc/zabbix/zabbix_agent2.conf
### Option: AllowKey
# Allow execution of item keys matching pattern.
# Multiple keys matching rules may be defined in combination with DenyKey.
# Key pattern is wildcard expression, which support "*" character to match any number of any characters in ce>
# Parameters are processed one by one according their appearance order.
# If no AllowKey or DenyKey rules defined, all keys are allowed.
#
# Mandatory: no
AllowKey=system.run[sh /opt/zabbix/temp_nvme0.sh]
AllowKey=system.run[sh /opt/zabbix/temp_nvme1.sh]
AllowKey=system.run[sh /opt/zabbix/temp_sda.sh]
AllowKey=system.run[sh /opt/zabbix/temp_sdb.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core0.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core1.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core2.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core3.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core4.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core5.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core6.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core7.sh]

Now change the scripts accordingly on the Zabbix server:

Zabbix v6 SMART HDD and CPU Temperature Check

Zabbix v6 SMART HDD and CPU Temperature Check

Zabbix v6 SMART HDD and CPU Temperature Check

Verify that it is still working:

Zabbix v6 SMART HDD and CPU Temperature Check