Issue
Files or directories going missing or deleted unexpectedly, and the root cause of deletion is unknown.
Use Case Implementations
When we encounter unexpected missing files or folders, and we cannot ascertain the root cause from any available logs or live observations of the system, the most robust tool available at the time of writing is auditd
. This comes default installed on almost all linux systems with the exception being that it is not readily installed on Kubernetes pods.
Files going missing can be some of the most challenging encounters you will face, especially if the source of missing files is outside the JVM and is unable to be traced from within the JVM.
Since all file operations are done in kernel space, auditd
can be configured to watch and report with very great granularity and effectiveness.
-
NOTE: This article will not address any potential underlying issues such as networking or hardware based issues with the underlying storage device (SAN, NAS, local, etc).
-
NOTE: For the purposes of this article it is assumed that all connectivity and storage solutions are operating without error and you can effectively read or write to
${JENKINS_HOME}
or other target location on a valid filesystem this will be configured for. For any hardware concerns please see yourdmesg
logs or command output, which is beyond the scope of this article.
Resolution
The resolution for troubleshooting missing files/folders is to install and configure auditd to monitor the appropriate location for modification or removal; This article is meant as a quick start guide to get setup and is by no means exhaustive. It is highly encouraged to read through the RHEL Documentation for Auditd.
Installation:
Configuration:
Prerequisites :
For our configuration example, let us assume the following:
-
We have a CloudBees CI Controller, any version;
-
We have numerous jobs, and for some unknown reason, the config.xml files keep going missing;
-
We have looked through our
/etc/cron*
folders and files for anything that may be scheduled to clean up these files inadvertently; -
We have looked through our SCM and
${JENKINS_HOME}/groovy.init.d/
scripts to ensure we do not have any cleanup tasks that are running at startup; -
We have audited our
Approved Scripts
for any possibility of custom scripts being run that could remove files -
We have installed/configured and reviewed
audit-trail-plugin
and/oraudit-log-plugin
and do not see the root cause.
-
-
item
in this case can be any configurable item that generates a config.xml file in Jenkins, eg. Job, Folder, Cloud, Agent, etc.
At this point, we suspect some external factor is present and need to configure auditd
to find out what is removing the config.xml
files from our item
;
Information Needed:
auditd
can generate an extensive volume of data and it is important that we do not create a secondary issue with exhausting disk space while we are troubleshooting.
Thus, it is important to know the following details about your system:
-
Disk Free:
df -h ${Target Storage Device}
-
NOTE: /var/log is default location for auditd logs and is often part of the root filesystem
/
;
-
-
Disk IO is important as well, we do not want to run auditd on a system that is already taxed on it’s disk io activity
-
See Required Data: IO Issues on Linux for more details on testing your disk io
-
-
How long do we want to keep audit logs for?
-
How large do we want the files to be before auditd rotates them?
The Rules:
In this case we will assume we have enough space to store 10GB of audit log files;
Our Setup:
-
CloudBees CI Traditional;
-
/var/log
has 20GB free storage space; -
${JENKINS_HOME}
is on an NFS storage mounted to/var/lib/jenkins_home/
;
Configuration of auditd
:
Since auditd
is a system level auditing package, it is necessary for root
or sudo
level access to the system
To control disk space and location of the logging (in cases where we need to relocate to available storage), we edit /etc/audit/auditd.conf
The variables we need to edit are:
log_file = /var/log/audit/audit.log ### Ensure any changes here that the target has sufficient available storage, as above requisites; max_log_file = 100 ### Default: 8 num_logs = 10 ### Default: 5; max_log_file * num_logs = Total Storage in MB;
Define the rules:
The following commands run from a terminal with a user having sudo
access will establish temporary rules:
sudo auditctl -D ### Clear Rulesets sudo auditctl -a always,exit -F arch=b64 -S rename,rmdir,unlink,unlinkat,renameat -F dir=/var/lib/jenkins_home/ -F key=jenkins_home ### Set the rules for the directory specifications to watch sudo auditctl -w /var/lib/jenkins_home -p rwxa -k jenkins_home ### Set the watch, recursively
To make them persistent to survive a system reboot we need to backup and output our rules to the /etc/audit/audit.rules
file:
cp /etc/audit/audit.rules{,.bak} auditctl -l >> /etc/audit/audit.rules
Search the Results:
The following command will enable us to filter the audit logs for specific files or folders:
sudo ausearch -k jenkins_home -f config.xml ### Or any file or folder name in place of `config.xml`
Since there are going to be many logs, it will behoove you to become familiar with bash builtin functions/procedures such as for
or while
loops and sed
/awk
/grep
commands or any combination you prefer to search the log files.
An example used to search for specific rm
commands could be:
From within the auditd log_file
folder:
$ for file in *.log*; do sudo ausearch -if "${file}" -k jenkins_home -f config.xml | awk '{if ($26~/rm/){print $2}}'; done
Example Troubleshooting:
Assuming all settings have been configured:
We have a folder called testing
at /var/lib/jenkins_home/testing/
It’s config.xml
is removed; Lets find what did it:
[root@localhost testing]# ausearch -k jenkins_home -f config.xml | grep rm type=SYSCALL msg=audit(1679595989.055:1202): arch=c000003e syscall=263 success=yes exit=0 a0=ffffff9c a1=55c84b152640 a2=0 a3=0 items=2 ppid=221912 pid=225396 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=8 comm="rm" exe="/usr/bin/rm" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="jenkins_home" [root@localhost testing]# aureport -f | grep rm 276. 03/23/2023 14:26:29 /var/lib/jenkins_home/testing 263 yes /usr/bin/rm 1000 1202
In this the pertinent data we’re looking for is ppid=48551 pid=49569
which gives us the parent id
and the process id
of the command run to remove a file;
We know this is a config.xml file because the -f config.xml
flag isolates to that file.
Given this we can see that success=yes
means it was successful in removing the file;
Further, we have the user id and group id: uid=0 gid=0
; in this case it was done by root who own’s those id’s;
From there you can work with SysAdmins to determine the process and monitor for reactionary responses to files being removed and to scour the system for what’s causing the removal;
This isn’t limited to just rm
commands, you can filter for anything you can imagine with simple greps or with more complex algorithms limited only by your creativity.