KBEC-00476 - FAQ: Data Management considerations in CloudBees CD

Problem

What are some considerations for using the Data Management feature of CloudBees CD?

Solution

As of v9.1, CloudBees CD offers a built-in solution for setting up data retention/management rules.

Here are some best practice considerations around managing runtime data:

The output here will mostly focus on jobs in the examples, but all implications apply to other objects managed through the Data Management feature (Releases, pipelines, deployments).

1) Overall system performance

Not surprisingly, a relational database will perform better when there is less data in the system to be searched. Another way to state this: there is some infinitessimally small effect on performance for every object that you add to the database. So the inverse of this is also true - removing data can help give back some performance improvement. It’s with these realities in mind that we encourage all customers to be implementing data management rules that can work best for their system performance needs and requirements. This is best achieved through a combination of trial and error explorations, team negotiations and any regulatory commitments.

Defining blanket rules like "Keep all data for 1 year" or "3 months" are good starting points, but every company has to weigh the desire to have historical data accessible, against the perfomance impact of keeping that data inside their CD database.

The Data Retention feature (first released in v9.1) offers a means to either delete or to archive your data.

Archiving gives you the potential to review that data at some later date, whereas outright deletion will simply make this information unavailable for future reference. That said, the ability to re-load archived data back into a CloudBees CD system is a future consideration that is not yet implemented.

2) When should we start applying retention rules?

It’s best to try and define rules sooner rather than later. Creating an initial default "system-wide" baseline rule is important.
Then project-based rules can be setup as new projects get created and grow.

Prior to v9.1, data retention was handled through a custom (re: not supported) plugin. As such, some customers were not always aware of the plugin and would let data growth continue for a number of years before facing issues with either DB performance, or disk space. Since CloudBees CD often becomes a critical element to your software development lifecycle, setting up retention rules early can help to mitigate risks where unexpected slowdowns create a crisis that can impact your users.

Ideally you want your rules to work so that you reach some form of equilibrium. That is to say, that the number of new jobs being run every day are similarly being removed by your data retention rules. This will generally keep the DB size somewhat stable (ignoring additional usage growth and system definitions).

For example, if I setup a rule for keeping 30 days of data for Project X, which typically runs ~500 jobs per day, then we can approximate that the database would likely be holding ~15K jobs for this project.
Consider that the assumed equilibrium point. If a year down the road, the project has grown to ~1000 jobs per day, then the equilibrium point would be ~30K.

Summing such project equilibrium levels could help in understanding whether the total job counts in the database are in line with expectations.

3) What are some recommended rules to setup for our system?

The Data Retention feature allows for multiple rules to be setup. You can set rules that are tied to particular projects, or rules that can apply across all projects. There are also separate rules for data types like jobs, pipelines etc…

The combination of rules to use, or the number of such rules, will vary for each customer.

As an example, we would encourage you to setup an over-arching "default" type rule (for each object type) that limits ALL jobs to be retained. Perhaps this will be set at 1 year. Once that is in place, you may start to look at setting more restrictive rules for specific projects.

Some projects may require longer retention rules, others significantly less. For example, jobs responsible for performing testing work might only require 1-2 weeks of history, whereas some project that has audit requirements may prefer to not have any unique rules, and will rely on your custom "default" of 1 year. Meanwhile, some other project may feel that 2 months of data are all that they require. These rule decisions should be negotiated with each team involved to help keep your overall system DB perfomance as effective as possible.

As use of your system spreads to more teams and more overall data is being generated, it may be necessary at a point in the future to revisit these rules for specific teams to re-negotiate what may be appropriate.

4) Is there any impact to running cleanup rules?

Possibly.

It’s first important to understand that deletions inside CloudBees CD are not "immediate". Whether using these rules, the API, or the UI to delete an object, the "deletion" will only toggle a setting to mark the object in the database to get deleted later. The activation of any rule will query the database to find the set of objects that meet the rule criteria, and then mark this set of objects for deletion. Once marked, it’s then up to the "background deleter process" to work on removing these objects from the DB. Deletions are handled as a multi-stage process because many of the items involved (jobs, releases, pipelines, …) have numerous children objects that all need to be removed together. Consider, for example, a job deletion will require the removal of jobSteps, properties, notifiers and more.

With this in mind, recognize that the removal of such a large block of items can tax the database.
It may also be competing with other work in progress trying to add new items to the same tables. If a rule is launched to clear out a few jobs every day, that won’t have much impact, or will be very brief (10’s of ms). If instead, you are running 1000s of jobs every day, and are also then trying to deleting 1000s of jobs at some point in the day, there will likely be some temporary impact on DB performance from running the cleanup. Knowing the rate of jobs being launched and their associated removal, as well as what times of the day new jobs are typically being launched, (and when other data rules are scheduled), can guide you on selecting better times in the day to setup your rules.

5) Is there some limit on the number of objects that will be removed by a rule?

Yes - see your Admin settings for these system values:

Data Retention Management batch size = 100
Maximum iterations in a Data Retention Management cycle = 10
Data Retention Management service frequency in minutes = 1440 (Repeat every day)

This means that 100x10 = 1000 items is the current default limit on the number of items of one type can be removed on a single rule with the Data Retention feature.
And this rule will be repeated every 1440 minutes, or every 1 day.

The time when all rules will kick off the first time will be 1 day after the last system restart.

6) What should we do if our system runs more work than 1000 jobs per day?

Your choices are to either setup multiple rules to clean things up a few times through the day, or to adjust these default settings (see (5)) to allow more work to be removed on each pass. (see (7) below)

Say, for example, that you typically run ~1200 jobs per day on weekdays, and ~200 jobs per day on weekends. This would sum to around 6400 jobs per week being launched, while the default rules would be able to clear up to 7000 jobs. Therefore a rule for 1000 jobs/day default will still work out for you, since the weekend cleanups will help to catch up on items you may miss through the week. However, if you are getting to this point, you should really take this as an alert to revisit your rules/settings choices. Otherwise your creation rate may soon exceed your deletion rate and your DB growth will slowly inch upwards.

Another solution for this example is to change the rate of running from 1440 minutes to 720 minutes, so that the rule will run 2x on any given day.
This will double the volume of what can be deleted on any given day, and depending on growth patterns, this may be a good safety net that lasts many months.

The other approach would be to adjust how many objects can be deleted on each pass by modifying the settings shown above. But you don’t want these values to become too large due to the impact that the background deleter effort can have on the DB. (see (1)) So this is where self-experimentation may be necessary to understand the impact on users for deciding what rule combinations and settings look best for your needs, environment and existing average workloads.

7) Are there risks to running cleanup rules at different times of the day?

A traditional approach might be to launch such work at an early-morning hour, like 2:00am. In a global world, work may well be running through the CloudBees CD system at all hours of the day. This can make the question of when to set data retention rules a little more tricky. A single rule that removes a large block of sub-items may take minutes for the background deleter to complete which might place a temporary drag on overall performance. Meaning that new work being launched or users using the UI may notice some slight delays.

Thus it is important that your administrators are aware of the times of day when rules are being activated. as well as to try and set times for these rules that will minimize impact on all uesrs.

A future version of the product will allow you to set the start times for rules explicity.

8) Is there a way to limit the number of deletions on a given rule, without altering the system defaults?

Yes. The Data Management rules allow for both hours-based and minute-based granularity. This aspect can be helpful in splitting your clearing requests to contain smaller chunks of data without altering the default settings.

For example, let’s say there are 1000 jobs run every day for a procedure that has many steps and children objects. This means that if you have a single rule that removes jobs after, say 30 days, there is an expectation for ~1000 jobs to cleared out when this rule is triggered. Each deletion for such large jobs may be "expensive" in terms of Database times needed to handle all the child objects. If you wanted to cut the impact of your intended rule in half, then you could distribute this effort by creating 2 rules, both using hours-based granularity.

For example, use 720 hours(eg: 30days x 24 hours = 720 hours) to run 2 times per day, say at 3:00am and 3:00pm. The 3:00am run would pick up jobs that ran 30 days ago between 3:00pm and 3:00am, and likewise, the 3:00pm run would pick up the job that ran 30 days ago in the subsequent 12 hours time block.

9) We have been using CloudBees CD for months now, but are only now learning about Data Management - can I just turn on rules now to help remove this "backlog"?

Proceed with caution as this could be a dangerous assumption…

If your team are running less than 1000 jobs per day, a single rule run every day to remove 1000 items would likely ensure some equilibrium after a number of days.

However if your DB already has a large history, say 100K jobs stored, then using such a rule won’t help much to attack your backlog.
Assuming you gain some clawback on weekends and holidays, this probably would take months to reach an expected equilibrium state. On the other hand, if you are already running more than 1000 jobs per day and are not realizing that your rule will only clear 1000 jobs per day with the default settings, then adding a single rule will only be slowing down the DB growth, and your backlog will keep expanding.

The best approach for attacking an existing backlog is to aim to get to equilibrium using an interative plan.

Say you have >100K jobs in your backlog, attempting to remove all those at one time, by adjusting the default settings, could bog down your DB and impact regular work for hours. Instead it is best to setup a series of rules and watch over their progress to get a sense for the impact each cleanup pass may be having. Typically, working on iterating over such cases during a weekend, or at known low-usage time windows will help to limit risk. eg:

Customer used CloudBees CD for 20 months without any management rules being set
They have about 30K jobs per month (1000 per day) generated,
They decide to target keeping 1 year worth of data, or ~360K jobs worth of history (menaing ~240K jobs need removal)
Simply setting a single rule like "delete all items older than 1 year", would result in only ~2K jobs per week being cleaned, which would take years to reach equilibrium
So they might choose to set the Retention frequency to run 4x per day (setting of 360), and change that on the weekends to run more frequently, say 12x per day (setting of 120 minutes).
This would result in ~40K jobs/week (+3K/day during weekdays, +12K/day during weekends) being cleared, meaning that equilibrium would arrive at around the 6 week point.

It is important to recognize that removing the first X jobs from your backlog will take the DB longer to process than the last X jobs, as less re-indexing and faster queries will happen as the DB content is reduced. So the time each cycle takes to complete may be important before you start setting your repetition rules too closely. A repetition setting of 120 minutes would give the background deleter 2 hours to clean up the 1000 items, which is likely safe, but this is where local usage makes a difference.

10) We just went through some cycles to get our database trimmed down, but this didn’t appear to reclaim any disk space. Why is that?

You should really speak with a DBA about this concern. This behaviour will be dependent on how your DB claims disk space. Databases, like MySQL, will take up blocks of disk space as new space is required, but when data is removed, won’t return those blocks. Subsequent new data will simply use up this previously alotted space. If you ever needed to reclaim actual disk space, you’ll need to look into what best practices you can use with your selected database to make that happen. Saving a dump and reloading the data to a fresh DB is one such technique.

11) Can retention rules also manage workspace data?

The option exists to have your rule also try and remove workspace artifacts tied to the steps/tasks being removed (job, pipeline etc…) This effort will be granular, and aim to remove workspace data on a per-step basis. This requires communicating with the same agent that was responsible for running the associated step or task (as it will have the workspace connection), which adds more time to the effort of performing the deletions. If the agent in question is very busy, or completely offline, then this may temporarily block progress that can slow down the completion of this rule.

The workspace cleanup option is OFF by default, and because of the impact mentioned, it may often be more effective to keep your workspace cleanup effort outside of your data retention rules. Disk space tends to be fairly cheap these days, so wiping out workspace history can often use less restrictive rules as disk cleanups can be achieved with appropriate "rm" or "del" rules. For example, if your default Retention rule removes jobs after 1 year, then setting up a self-created montly schedule that runs regularly to handle workspace cleanup on files older than 18 months would achieve the same outcome and perform more effeciently than using the granular cleanup employed by the Data Retention system.

12) Can we also set rules to help remove older or stale procedures, processes and pipelines?

The data management feature is currently only focusing on run-time data. Presently the use of programmed items is not being tracked in a way that can identify thinly used definitions. As such, there is no built-in means of trimming older definitions using some automated set of rules.

Another point of consideraation on this question is how your code is being managed.
The best practice recommendation is to keep a Development CD system separate from your Production CD system. With that in place, the desire would be to keep all code iterations and experiments on the DEV system until proven safe before being placed onto the DEV environment. As such, the need to retire old code from your Production environment doesn’t need to come into play.

That said, it would remain a question on your Development system that requires some plan.
But if regular updates of your code are being saved and stored on your SCM, then no older version can be lost.