Node Down Because Of /var Reaching 100% Disk Space

Overview

You may have noticed some alarms related to an STP node being down. The problem occurs when /var reaches 100% on disk space. On syslog, you may find these two errors:

Audit daemon is low on disk space for logging
The audit daemon is now halting the system

Solution

The audit daemon is low on disk space for logging because the /var file system does not have enough free space. The audit daemon is halting the server because of this. You may start by removing unwanted files from this partition to free up some space.

The audit daemon is probably flooding the /var/log/messages file, consuming space on the /var partition, and leading to the issue. If that is the case, stop the audit service temporarily in all the servers to prevent it from shutting down any new node. For that, run the following commands as root:

systemctl stop auditd
systemctl disable auditd

Please note that no alarm is raised for the low disk space issue once it reaches a critical threshold. That must be done at the OS level, and the application does not manage OS traps. In case you want to create the OS trap manually, please follow these steps:

Update /etc/sysconfig/snmpd as follows:

OPTIONS="-Ln -I -smux -p /var/run/snmpd.pid -c /etc/snmp/snmpd.conf udp:11114 udp:161"

Edit /etc/sysconfig/snmptrapd as follows:

OPTIONS=-Lsd -m ALL -F "%02.2h:%02.2j:%02.2k TRAP%w.%q %v from %B" 11173 162

Make sure of the disableAuthorization value on /etc/snmp/snmptrapd.conf
```
disableAuthorization yes
```

Edit the /etc/snmp/snmpd.conf file as follows, so the traps do not appear only in the /var/log/messages file in the localhost (replace [remote_IP_address] with the IP of the node where the trap should be sent and set the threshold to the desired value):

###########################################################################
snmpd.conf
###########################################################################

rocommunity public
rwcommunity private

syslocation test
syscontact mbalance

sysservices 78
trapsink 127.0.0.1 public

trapcommunity public

authtrapenable 1

createUser internal MD5 "snmp1234"
iquerySecName internal
rouser internal

defaultMonitors yes
monitor -r 1 -u internal -o dskErrorMsg "Disk Usage Error" dskErrorFlag != 0

trap2sink [remote_IP_address] public 162

includeAllDisks 70%  (The value that you want)

Restart the SNMPD service: systemctl restart snmpd
Restart the SNMPTrapD service: systemctl restart snmptrapd

Make sure the OS MIBs listed below are installed on the supervision system:

HOST-RESOURCES-MIB.txt
HOST-RESOURCES-TYPES.txt
UCD-DISKIO-MIB.txt
RFC1213-MIB
UCD-SNMP-MIB
DISMAN-EVENT-MIB

The MIB files can be extracted directly from the servers under /usr/local/share/snmp/mibs. This configuration will send an SNMP trap if the disks usage exceeds 70% (or the value you specified)

Choose files or drag and drop files

Tags:

Was this article helpful?

Yes

Priyanka Bhotika
Posted

Comments

Please sign in to comment

Node Down Because Of /var Reaching 100% Disk Space

Overview

Solution

Priyanka Bhotika

Comments