Zabbix v6 SMART HDD and CPU Temperature Check
Install Smartmontools and LM Sensors
apt install lm-sensors smartmontools
Harddrive Monitoring
S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology; often written as SMART) is a monitoring system included in computer hard disk drives (HDDs), solid-state drives (SSDs), and eMMC drives
The smartmontools package comes with two utilities, smartctl which you can use to check your hard drives on the command line, and smartd, a daemon that checks your hard disks at a specified interval and logs warnings/errors to the syslog and can also send warnings and errors to a specified email address (usually the admin of the system).
smartctl -v
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-11-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Using Smartctl
Harddrives
Find partition:
df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 6.3G 2.3M 6.3G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 98G 36G 58G 38% /
tmpfs 32G 36M 32G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/md126p2 2.0G 428M 1.4G 24% /boot
/dev/md126p1 1.1G 6.1M 1.1G 1% /boot/efi
smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
In the case of my test server below we have a virtual machine - that, obviously, does not have access to the underlying HDD hardware /dev/sda1
:
smartctl -a /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-97-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD10EFRX-68FYTN0
Serial Number: WD-WCC4J4NHYJJ2
LU WWN Device Id: 5 0014ee 269c5648a
Firmware Version: 82.00A82
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 9 09:21:06 2024 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (14100) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 160) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 169 133 021 Pre-fail Always - 2516
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 63
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 075 075 000 Old_age Always - 18640
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 62
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 59
193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 5458
194 Temperature_Celsius 0x0022 121 094 000 Old_age Always - 22
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 18594 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
NVME Drives
Additional NVME drive:
df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 32G 2.9M 32G 1% /run
/dev/mapper/ubuntu--vg--1-ubuntu--lv 438G 81G 338G 20% /
tmpfs 63G 914M 62G 2% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/nvme0n1p2 2.0G 304M 1.5G 17% /boot
/dev/nvme0n1p1 1.1G 6.1M 1.1G 1% /boot/efi
tmpfs 13G 4.0K 13G 1% /run/user/1000
smartctl --scan
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-91-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: SanDisk Extreme Pro 500GB
Serial Number: 212181449612
Firmware Version: 111130WD
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 8215
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 4a4944db11
Local Time is: Sat Mar 9 09:45:02 2024 CET
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 88 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 5.50W - - 0 0 0 0 0 0
1 + 3.50W - - 1 1 1 1 0 0
2 + 3.00W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 4000 10000
4 - 0.0035W - - 4 4 4 4 4000 40000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 51 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 3%
Data Units Read: 7,580,500 [3.88 TB]
Data Units Written: 37,759,770 [19.3 TB]
Host Read Commands: 38,034,128
Host Write Commands: 1,870,013,477
Controller Busy Time: 279
Power Cycles: 25
Power On Hours: 13,542
Unsafe Shutdowns: 17
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
When you see that the SMART support is:
disabled run the following command to enable it:
smartctl -s on -a /dev/sda1
CPU Temperature
sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 1: +42.0°C (high = +100.0°C, crit = +100.0°C)
Core 2: +46.0°C (high = +100.0°C, crit = +100.0°C)
Core 3: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 5: +49.0°C (high = +100.0°C, crit = +100.0°C)
Core 6: +44.0°C (high = +100.0°C, crit = +100.0°C)
Core 7: +47.0°C (high = +100.0°C, crit = +100.0°C)
nvme-pci-0900
Adapter: PCI adapter
Composite: +48.9°C (low = -5.2°C, high = +83.8°C)
(crit = +87.8°C)
acpitz-acpi-0
Adapter: ACPI interface
temp1: +16.8°C (crit = +20.8°C)
temp2: +27.8°C (crit = +105.0°C)
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A
enp7s0-pci-0700
Adapter: PCI adapter
PHY Temperature: +47.9°C
MAC Temperature: +48.5°C
nvme-pci-0300
Adapter: PCI adapter
Composite: +50.9°C (low = -5.2°C, high = +83.8°C)
(crit = +87.8°C)
Zabbix
Preparing Zabbix-Agent2
Add the zabbix
agent to Sudoers:
nano /etc/sudoers
# Zabbix user SMART control
zabbix ALL=(ALL) NOPASSWD:/usr/sbin/smartctl
# Zabbix user LM Sensors
zabbix ALL=NOPASSWD:/bin/sensors
Let's start by allowing the Zabbix server to execute ANY script (parental supervision advised):
nano /etc/zabbix/zabbix_agent2.conf
### Option: AllowKey
# Allow execution of item keys matching pattern.
# Multiple keys matching rules may be defined in combination with DenyKey.
# Key pattern is wildcard expression, which support "*" character to match any number of any characters in ce>
# Parameters are processed one by one according their appearance order.
# If no AllowKey or DenyKey rules defined, all keys are allowed.
#
# Mandatory: no
AllowKey=system.run[*]
Direct CLI Command Execution
Now prepare a few sensor/smarttools commands to extract single values of interest:
smartctl -a /dev/sda | grep Temperature_Celsius | awk {'print $10'}
23
smartctl -a /dev/nvme0 | grep Temperature | awk {'print $2'} | grep -o '[0-9]\+'
51
sudo sensors | grep 'Core 0' | awk -F'[+|.]' {'print $2'}
30
We can add these scripts to our Zabbix Server Scripts config:
As a manual script item we can now execute those scripts directly from our global dashboard:
If you run into the following error message Cannot execute script. Unknown metric system.run you skipped the step above of adding the zabbix agent to your host sudoers
- or forgot to restart the Zabbix Agent service:
If everything is set up right your server should now be able to retrieve the Temperature value from your host system:
Working with Shell Scripts
To replace the nasty wildcard execution permission we can now replace the direct commands with a shell script. Just add all CLI commands you want to execute to separate shell files in a directory accessible to the Zabbix Agent:
/opt/zabbix/temp_sda.sh
#!/bin/bash
sudo smartctl -a /dev/sda | grep Temperature_Celsius | awk {'print $10'}
/opt/zabbix/temp_nvme0.sh
#!/bin/bash
smartctl -a /dev/nvme0 | grep Temperature | awk {'print $2'} | grep -o '[0-9]\+'
/opt/zabbix/temp_core0.sh
#!/bin/bash
sudo sensors | grep 'Core 0' | awk -F'[+|.]' {'print $2'}
and so on...
Now replace the wildcard with the explicit script file calls to exclude any script not specifically defined by you:
nano nano /etc/zabbix/zabbix_agent2.conf
### Option: AllowKey
# Allow execution of item keys matching pattern.
# Multiple keys matching rules may be defined in combination with DenyKey.
# Key pattern is wildcard expression, which support "*" character to match any number of any characters in ce>
# Parameters are processed one by one according their appearance order.
# If no AllowKey or DenyKey rules defined, all keys are allowed.
#
# Mandatory: no
AllowKey=system.run[sh /opt/zabbix/temp_nvme0.sh]
AllowKey=system.run[sh /opt/zabbix/temp_nvme1.sh]
AllowKey=system.run[sh /opt/zabbix/temp_sda.sh]
AllowKey=system.run[sh /opt/zabbix/temp_sdb.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core0.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core1.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core2.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core3.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core4.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core5.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core6.sh]
AllowKey=system.run[sh /opt/zabbix/temp_core7.sh]
Now change the scripts accordingly on the Zabbix server:
Verify that it is still working: