Tracing Missing Files with auditd

Issue

Files or directories going missing or deleted unexpectedly, and the root cause of deletion is unknown.

Environment

Use Case Implementations

When we encounter unexpected missing files or folders, and we cannot ascertain the root cause from any available logs or live observations of the system, the most robust tool available at the time of writing is auditd. This comes default installed on almost all linux systems with the exception being that it is not readily installed on Kubernetes pods.

Files going missing can be some of the most challenging encounters you will face, especially if the source of missing files is outside the JVM and is unable to be traced from within the JVM.

Since all file operations are done in kernel space, auditd can be configured to watch and report with very great granularity and effectiveness.

NOTE: This article will not address any potential underlying issues such as networking or hardware based issues with the underlying storage device (SAN, NAS, local, etc).
NOTE: For the purposes of this article it is assumed that all connectivity and storage solutions are operating without error and you can effectively read or write to JENKINS_HOME or other target location on a valid filesystem this will be configured for. For any hardware concerns please see your dmesg logs or command output, which is beyond the scope of this article.

Resolution

The resolution for troubleshooting missing files/folders is to install and configure auditd to monitor the appropriate location for modification or removal; This article is meant as a quick start guide to get setup and is by no means exhaustive. It is highly encouraged to read through the RHEL Documentation for Auditd.

Installation:

RHEL Based Distributions:

auditd is provided from the BaseOS Repo, eg. Repo: rhel-8-for-x86_64-baseos-rpms … from the yum provides auditd command output;

sudo yum install auditd▼

Debian Based Distributions:

auditd is provided from the main package repo of your distro, from the apt-cache policy auditd command on Ubuntu 22.04;

sudo apt install auditd▼

Configuration:

Prerequisites :

For our configuration example, let us assume the following:

We have a CloudBees CI Controller, any version;
We have numerous jobs, and for some unknown reason, the config.xml files keep going missing;
- We have looked through our /etc/cron* folders and files for anything that may be scheduled to clean up these files inadvertently;
- We have looked through our SCM and JENKINS_HOME/groovy.init.d/ scripts to ensure we do not have any cleanup tasks that are running at startup;
- We have audited our Approved Scripts for any possibility of custom scripts being run that could remove files
- We have installed/configured and reviewed audit-trail-plugin and/or audit-log-plugin and do not see the root cause.
item in this case can be any configurable item that generates a config.xml file in Jenkins, eg. Job, Folder, Cloud, Agent, etc.

At this point, we suspect some external factor is present and need to configure auditd to find out what is removing the config.xml files from our item;

Information Needed:

auditd can generate an extensive volume of data and it is important that we do not create a secondary issue with exhausting disk space while we are troubleshooting. Thus, it is important to know the following details about your system:

Disk Free: df -h ${Target Storage Device}
- NOTE: /var/log is default location for auditd logs and is often part of the root filesystem /;
Disk IO is important as well, we do not want to run auditd on a system that is already taxed on it’s disk io activity
- See Required Data: IO Issues on Linux for more details on testing your disk io
How long do we want to keep audit logs for?
How large do we want the files to be before auditd rotates them?

The Rules:

In this case we will assume we have enough space to store 10GB of audit log files;

Our Setup:

CloudBees CI Traditional;
/var/log has 20GB free storage space;
JENKINS_HOME is on an NFS storage mounted to /var/lib/jenkins_home/;

Configuration of `auditd`:

Since auditd is a system level auditing package, it is necessary for root or sudo level access to the system To control disk space and location of the logging (in cases where we need to relocate to available storage), we edit /etc/audit/auditd.conf The variables we need to edit are:

    log_file = /var/log/audit/audit.log     ### Ensure any changes here that the target has sufficient available storage, as above requisites;
    max_log_file = 100                      ### Default: 8
    num_logs = 10                           ### Default: 5;  max_log_file * num_logs = Total Storage in MB;▼

Define the rules:

The following commands run from a terminal with a user having sudo access will establish temporary rules:

sudo auditctl -D       ### Clear Rulesets
sudo auditctl -a always,exit -F arch=b64 -S rename,rmdir,unlink,unlinkat,renameat -F dir=/var/lib/jenkins_home/ -F key=jenkins_home   ### Set the rules for the directory specifications to watch
sudo auditctl -w /var/lib/jenkins_home -p rwxa -k jenkins_home     ### Set the watch, recursively▼

To make them persistent to survive a system reboot we need to backup and output our rules to the /etc/audit/audit.rules file:

cp /etc/audit/audit.rules{,.bak}
auditctl -l >> /etc/audit/audit.rules▼

Search the Results:

The following command will enable us to filter the audit logs for specific files or folders:

sudo ausearch -k jenkins_home -f config.xml  ### Or any file or folder name in place of `config.xml`▼

Since there are going to be many logs, it will behoove you to become familiar with bash builtin functions/procedures such as for or while loops and sed/awk/grep commands or any combination you prefer to search the log files. An example used to search for specific rm commands could be:

From within the auditd log_file folder:

$ for file in *.log*; do sudo ausearch -if "${file}" -k jenkins_home -f config.xml | awk '{if ($26~/rm/){print $2}}'; done▼

Example Troubleshooting:

Assuming all settings have been configured:

We have a folder called testing at /var/lib/jenkins_home/testing/ It’s config.xml is removed; Lets find what did it:

[root@localhost testing]# ausearch -k jenkins_home -f config.xml | grep rm
type=SYSCALL msg=audit(1679595989.055:1202): arch=c000003e syscall=263 success=yes exit=0 a0=ffffff9c a1=55c84b152640 a2=0 a3=0 items=2 ppid=221912 pid=225396 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=8 comm="rm" exe="/usr/bin/rm" subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="jenkins_home"

[root@localhost testing]# aureport -f | grep rm
276. 03/23/2023 14:26:29 /var/lib/jenkins_home/testing 263 yes /usr/bin/rm 1000 1202▼

In this the pertinent data we’re looking for is ppid=48551 pid=49569 which gives us the parent id and the process id of the command run to remove a file; We know this is a config.xml file because the -f config.xml flag isolates to that file. Given this we can see that success=yes means it was successful in removing the file; Further, we have the user id and group id: uid=0 gid=0; in this case it was done by root who own’s those id’s;

From there you can work with SysAdmins to determine the process and monitor for reactionary responses to files being removed and to scour the system for what’s causing the removal; This isn’t limited to just rm commands, you can filter for anything you can imagine with simple greps or with more complex algorithms limited only by your creativity.

Conclusion:

From this example we can see the power and capabilities in the tools available to troubleshoot beyond the JVM or to provide evidence based root cause to bring to CloudBees for further troubleshooting.

This article is part of our Knowledge Base and is provided for guidance-based purposes only. The solutions or workarounds described here are not officially supported by CloudBees and may not be applicable in all environments. Use at your own discretion, and test changes in a safe environment before applying them to production systems.