Risks when working with data

Data Risk

Every IT project has a risk profile that we need to manage in order to increase the likelihood of success.

So I have decided to list the most frequent risks I face in delivering projects that involve data.

This means I have deliberately excluded those generic IT project risks like, Executive Sponsorship, Change Management, Unrealistic Expectations, etc. as you can come across these in any project.

So, without further a do, here they are with their mitigation:

Data Source

Risk

Absence of one of the following with regards to the data source(s):

  • Subject Matter Expertise (SME).
  • Entity Relationship Diagram (ERD).
  • Data Dictionary.

These are vital to ensure that the source data is well understood from the perspective of:

  • Which tables and columns contain the data required.
  • How the tables are populated.
  • What the relationships are between the tables.
  • When the data is ready for extraction (i.e. between 2:00am and 6:00am)
  • What are the inherent Business Rules that exist within the data.

Consequence

As the data source(s), and business rules are discovered during the life of the project it will result in:

  • Reworking of the Reports, Datasets, Dashboards, etc.
  • Loss of confidence in the project and team by sponsors.
  • Additional costs and effort required.

Mitigation

This risk is not easy to manage, and it is likely that it will only be partially mitigated by:

  • Use of a data profiling tool to learn about the business rules and patterns.
  • Reverse engineering the source ERD using Erwin, PowerDesigner, Visio, etc..
  • Requesting access to the source system front end so that you can interact with screens and learn about the data that way.
  • Reverse engineering existing source reports, if they are available. This will help you discover the data relationships and rules embedded within them.
  • Scheduling time in the project plan for data discovery. Focus on nightly batch processes, month ends, year ends, etc.
  • Communicating with your sponsors to keep them aware of the problems being faced and ask them to escalate the issue(s) if necessary. Vendors are much more forthcoming with an ERD when the CIO is on the phone!

Data Quality

Risk

Data quality issues in the source data making it unfit for the intended purpose.

Consequence

As issues are slowly discovered during the project, they will result in:

  • Additional effort for the project team to diagnose the issues and arrange to have them addressed at the source if possible.
  • The project team turn themselves inside out trying to code for data anomolies.
  • Loss of confidence in the project and team by the business sponsors.
  • Reworking of the Integration, Reports, Datasets, Dashboards, etc.

Mitigation

Prior to starting the project, evaluate the data quality (DQ) via:

  • Interviews with source system administrators, data owner/steward.
  • Workshops with users to obtain their opinion.
  • Use of a DQ Profiling tool to measure the actual quality.

If the quality is considered ‘poor’, then:

  • Allocate additional time to addressing DQ by the project team, or.
  • Setup a separate project to address DQ prior to commencing the BI project.
  • Keep your sponsors in the loop, without pointing too many fingers at the poor source system administrator.
  • Create simple dashboards so that the actual DQ can be monitored and a sense of ownership can be created with the source system owner(s). See my blog for an example.

Architecture and Technology

Risk

This is a topic that is dear to my heart, as I think it can add huge risk to projects. Examples of issues relating to architecture include:

  • Excessive architectural layers, each of which result in more integration.
  • Integration technology that requires hand coding and does not complement the architecture chosen.

Consequence

The consequences are:

  • Water fall type project delivery, even if you are aiming for agile.
  • Slow progress as each layer of the architecture requires more ‘plumbing’
  • Frustration from the sponsor as they very little tangible progress. All the effort is behind the scenes.
  • Additional costs and effort required.

Mitigation

The following steps should be taken:

  • Adopt an architecture that leverages a Data Lake layer where data can be rapidly ingested and exposed to end users. This will enable a BIModal delivery of information to the end user.
  • Make sure that ‘hand coding’ is kept to a minimum. Chose technology that generates as much of the integration via meta data.
  • Keep the number of architectural layers to a minimum. Each one can introduce significant cost.

Scheduling and Delivery

Risk

This risk relates to the delivery approach chosen (e.g. iterative, agile, etc.) and commencing report or dashboard build when the source system or data foundation is not mature. This is typically due to having:

  • Too many resources, too early in the project life and therefore struggling to keep them busy.
  • Too aggressive a timeline.
  • Too many architectural layers for the data to go through before it reaches a consumer.
  • Building an data solution at the same time as the source system is being built.

Consequence

The consequences are:

  • Rework of the reports, dashboards, etc. as the data and data model continues to evolve.
  • Team tension and dysfunction as they trip over each other’s activities.
  • Additional costs and effort required.

Mitigation

The following steps should be taken:

  • Resource the project team with multi skilled staff that can prepare data or present data. This way there is a simple transition of the team through the various project phases and much easier tasks allocation.
  • Adopt an architecture that leverages a Data Lake layer where data can be rapidly ingested and exposed to end users. This will enable a BI Modal delivery of information to the end user.
  • Wait until the source system has been built, and is live, before you start building the analytics (e.g. an EDW) that leverage the new source system (e.g. CRM)

Evolving Requirements

Risk

Requirements changing when users finally see the reports and data they asked for. You may think this is true of all projects but I believe it is more significant with data as it is always difficult for end users to imagine the data, and the message it delivers, until they actually see it.

Consequence

  • Scope creep and/or numerous change requests.
  • Tension between the project team and the business.
  • Rework of the reports, dashboards, etc. as they evolve.
  • Additional costs and effort required.

Mitigation

  • ‘Show and tell’ frequently and include budget for making incremental changes.
  • Adopt an architecture that leverages a Data Lake layer where data can be rapidly ingested and exposed to end users. This will enable a BI Modal delivery of information to the end user.
  • Provide users ad hoc access to the data via PowerBI, Tableau, etc. This will allow them to become familiar with the data and make adjustments earlier in the project.
  • Schedule and estimate for report refinements after/during UAT.

Data Volume and Latency

Risk

Large data volumes or ‘real time’ data expectations

Consequence

  • More data needs more time to load or reload. It’s that simple!
  • Greater demands are placed on the technology available which may not be met by the available stack.
  • Reduced ability to be iterative in applying changes requested as it takes too long to reload the data when required.
  • Greater difficulty to load the data within the window available.
  • Additional costs and effort required.

Mitigation

  • Only load a subset of data during the initial phases. This is when the largest number of data model changes will be occurring. Don’t make the mistake of being blindsided by performance issues in the presentation when the full dataset is loaded.
  • Identify the attributes in the source that are guaranteed to identify new or changed records. Make sure they are updated by a screen change or an internal source process running.
  • Use a Change Data Capture (CDC) technology, if available, to identify new or changed records.
  • Really ask the question, ‘do I need real time data’? We feel we need it but so often, it is not needed.
  • Use cloud based storage and integration that can scale easily as the data volume or latency requirements change

Hopefully you can use these this blog to augment your project specific risks and improve the chance of a successful outcome.

Leave a Reply

Your email address will not be published. Required fields are marked *