Data Latency

Data Latency

(lat’n-se)

n. pl. la·ten·cies

The time interval between initiating a query, transmission, or process, and detecting an outcome.

The importance of timely data delivery continues to grow. More specifically the latency of the data and consequently whether it is fit for the purpose of making decisions. These decisions can be based on what has just happened or modelling a scenario that is about to happen.

This blog explorers the challenges that arise when delivering Business Intelligence (BI) data, I have deliberately avoided discussing specific technologies in any detail, instead, my intention is to keep the discussion high level.

You will notice that I have so far avoided using the term ‘real time’. This is because the typical response when anyone expresses a need for ‘real time’ data is ‘what do you mean by real-time?’ which isn’t particularly helpful. Other terms that are frequently used, and are similarly ambiguous, include:

  • Near Real Time 
  • Micro Batch 
  • Trickle Feed

I think the easiest way to start the discussion on data latency is to look at typical subject areas that are delivered in BI. Hopefully, this provides some context to this topic.

AreaExample DataTypical Latency Expectation
Budget/ForecastingMonthly Sales Targets, Annual Operating Plans.10 to 30 seconds during Business Planning or Forecasting phases.
Not applicable outside these times.
Human ResourcesLeave Balance, Sick Leave Without a Certificate12 to 24 hours
Material HandlingStock Ageing12 to 24 hours
FinanceP and L, GL Transactions12 to 24 hours when not month end.
15 minutes to 30 minutes at month end or year end
Operational
SCADA SystemsCircuit Breaker Operationssub second
Point of SaleSales Transactions1 to 5 minutes
Vehicle MovementsLongitude and Latitude Measurements10 to 30 seconds
Financial MarketsEquity, Options Tradessub second
LogisticsTrain, Crane or Container Movements1 to 5 minutes
Typical Latency

As you can see there are a range of latencies driven by what the business requires.

If I were a business user with little knowledge of the technical challenges in delivering reliable data with a low latency I would quickly become frustrated with subtleties and simply say to my Data and Analytics department ‘just make everything real-time!’ without realising that this will come at a price based on the:

  • Definition of Real-Time
  • Technology Required
  • Support, Service Level Agreements
  • Hardware Required

I recommend that:

  • You ask your business user for their definition of acceptable data latency, be prepared to guide them and challenge where necessary. Don’t go near the term real time.
  • If the business define acceptable latency as the data being less than 30 minutes old, challenge them. What decisions are they going to make using their data and do these decisions justify the requirement? I assume that you’re not working in an industry where the need is blatantly obvious like financial markets.

Okay, let’s assume you and the business have convinced each other that ‘low’ latency data is necessary, consider:

  • If the source is an AS400, ERP system, or similar that needs to run an overnight batch processes (e.g. to calculate loan balances, leave balances, stock levels, etc.) there is little point delivering data to a user in real-time as it is not ready for consumption until the nightly batch process has been run in the source system. 
  • Is the data quality so poor that the cleansing, matching, de-duplicating and other time expensive tasks will need to be conducted and therefore prevent data latency objective expressed by the business being met. 
  • Is the data volume so high as to give the system administrators or network guys a headache if the data integration occurs during business hours? 

Now that we know the need is justified, the source and the infrastructure suitable, or can be made suitable:

  • Consider what sort of analytics you are delivering. If they are operational reports, why not create them directly off the source or use an approach like database replication or materialised views to avoid reports being run off the source system but delivering acceptable latency. 
  • Make a decision on what you are building as a data repository. Is it a data lake, data vault, data mart, data warehouse, Operational Data Store (ODS), data vault, 3rd normal form database as part of an application, etc.? 
  • Decide what is the minimum number of architectural layers that you can get away with? Less is more when reducing latency (e.g. land, stage, key generation, presentation). 
  • Decide how you want to structure the data for presentation, the old traditional ways (e.g. dimensional) may not be the most appropriate. Other models like the ensemble approach can offer advantages in parallel data processing and therefore speed. 
  • Consider how you will achieve the initial historical load of data into your chosen repository. If you need to load several years of data before you start to capture changes, this will impact on the solution you implement. And don’t be tempted to think ‘I will never need to reload the data from source’, more often than not you will, so don’t burn your bridges. 
  • Think about how you are going to test the solution being requested. A low latency stream of data can be a real challenge to test if you are not able to re-run the test scenario based on a known dataset. 
  • Evaluate which of the many techniques and technology combinations will meet the unique data latency requirements you have agreed with your business users, for example: 
  • Enterprise Service Bus (ESB) with the required performance and throughput. 
  • Change Data Capture (CDC) replication between the source and destination data repository. 
  • Database Trigger replication if you can convince the DBA that this is a good idea. 
  • Trickle Feed/Micro Batch via Extract, Transform and Load (ETL). 
  • Batch via Extract, Load and Transform (ELT) where the data is exposed via a Web Service. 
  • Reporting directly from source if you can convince the DBA that this is a good idea. 
  • Materialised Views with an appropriate refresh rate. 
  • Event Stream Processing (ESP) or Complex Event Processing (ECP) technologies. 
  • Etc. 

Even if the topic of data latency has not raised its head, in general I recommend that you build solutions that:

  • Have intraday integration capabilities by default, even if they are not immediately used. 
  • Have a parallel data processing behaviour by default. Just because you have been given a ‘load window’ or three hours per-night, don’t use it all as a result of lazy programming or sequencing. 
  • Minimise source system impact. Implement a solution that avoids locks on source tables and takes the minimum amount of time to extract the data. 
  • Do not ‘shoe horn’ a technology (e.g. a micro batch ETL solution with queueing) into meeting the data latency requirements if it isn’t natively able to. If there is an off the shelf application that meets the needs then use it. You will save a lot of pain.

Leave a Reply

Your email address will not be published. Required fields are marked *