‘What do you mean they physically delete the data!!!’

Introduction

I have heard myself say these words many times, and yet it always comes as a surprise, surely people know that deleting data is a bad thing, apparently not. For the sake of this example I will talk about Purchase Orders (POs) but you can replace this with Home Loan Application, Insurance Claim, Invoice, Client Record, etc.

Of course, with my Data, Info and Analytics glasses on, I would never dream of deleting data, I mean how can I answer questions around voided Purchase Order for heavens sake? that is assuming that someone would ever ask such a pointless question.

Our frontend friends often don’t have the same reservations, I mean if the Purchase Order was only ever in a Draft Status, who cares if they physically Delete it, to them it never ‘really’ existed anyway!!

And so these are the two paradigms, an analytics system that keeps copies of everything, just in case, and a source system that wants to get rid of the junk so it can remain lean and performant.

So why is this a constant surprise, surely I have learnt my lessons by now? Well the issue is I do always ask, ‘so any deletes in the source?’ to which I am invariably told, ‘No, no, no!!! we do soft deletes’. The nuances are:

  • Soft deletes are only done after the Purchase order is Approved, before that they can be deleted without a hesitation to get rid of the ‘junk’.
  • Complex systems like CRM or ERP will have many developers working to customise the solution for a specific client. Some of the developers may not have got the email about physical deletes being bad news.
  • So source system vendors physically move the record elsewhere and refer to it as ‘archiving’, or may only introduce archiving several months after going live.

Preparation

So let’s imagine we are about to embark of a new project, what do we do? I suggest you:

  • At the kick off the meeting, share with your client the risks that source system deletes do sometimes happen. Even if they tell you it never happens, at least you’ve raised the issue.
  • Suggest that the testing team, assuming you are lucky enough to have one, look out for this scenario. They should have a test case at least.
  • Push for the inclusion of a reconciliation feature for whatever you are building. It will make sure that target doesn’t diverge from the source system. It doesn’t need to be fancy, it can be a weekly check of row counts and totals that give IT and the business confidence
  • Be prepared that the test data you use during development will not have the variety of changes you can expect in production when you have gone live. This means you may only experience deletes in BAU\DevOps mode so look out for them!!
  • If you have access to a Change Data Capture (CDC) tool then think about using it as it will capture deletes, however, this comes with an overhead. You will get a whole load of noise and you will need to filter out inserts/updates if you don’t care about them.
  • If there is a choice between a traditional RDBMS like SQL Server and a cloud native DB like SnowFlake, push for SnowFlake, or similar, as they have CDC features out of the box.

Solutions

Ok, so what to do when you don’t have CDC, the client and source system vendor have told you physical deletes don’t happen but you get a service ticket raised to tell you that a report still contains a purchase order that no longer exists in the source?

Ask yourself:

  1. If deletes are only being done of POs in a draft state, can we exclude them from the pipeline and stop the data ever being loaded in into the analytics repository?
  2. Does the Fact, Table or Dim, that contains the deleted record identified in the service ticket, actually need history? I know, heresy, but for simple Data Mart solutions just go and get all the current POs.
  3. Do the deletes happen within a time window (i.e. an archive process that runs every month)? If this is the case delete and reload all POs created in the last months.
  4. If the data is too large, use a pipeline that runs nightly upserts of the Dim/Fact/Table and a separate one to carry out a weekly bulk reload of all POs.
  5. Does your database have inherent functionality to merge changes I.e INSERT, UPDATE and DELETE like SQL Server? If so this is an easy option but you will need to stage a complete set of the Post before merging them.
  6. If you have a lot of data, and want to retain history, use a pipeline that runs nightly upserts of the Dim/Fact/Table and a separate one to carry out a nightly bulk reload of all PO business keys into a staging area (e.g. stg.PO_BK). The staged table will be quick to load as it will be very narrow and then can be used to identify all POs that exist in the Dim/Fact/Table that are no longer present in the stage table (i.e. st.PO_BK). Once you have identified the rogue POs, do a soft delete in the appropriate Dim/Fact/Table.

Conclusion

It can be hard, if not impossible, to identify upfront this type of issue, often you will not be able to get evidence of this issue until the solution has been running for weeks, if not months. And please don’t suggest it is a data quality issue, it’s not, as analytical types we simply care about different things to the source system guys.

Being prepared for deletes is the secret with an integration pattern that can be quickly applied should get you out of trouble, hopefully this post has provided some options.

Image courtesy of pixabay.com

Leave a Reply

Your email address will not be published. Required fields are marked *