Cross-system source-data discovery and data mapping are some of the most important, and the most overlooked, steps in a master data management (MDM) implementation. Before you can populate a new master with “trusted” data, you need to analyze the potential source-data systems individually and you must perform cross-system analysis to determine the survivorship rules that will determine which data, under which circumstances, will be used to populate the master. In addition, once the master is populated, you must map the new master to the downstream “consuming” applications -- yet another level of cross-system data analysis. The bottom line is that cross-system data analysis and mapping are part of the critical path for most MDM projects, often consuming over 40 percent of the time and effort in an MDM deployment.
Most companies underestimate the difficulty of this work and tend to rely on the corporate memory of subject matter experts (SMEs). However, most SMEs only know their specific system and don’t know how the data in their system relates to data in other systems. Determining the trusted source of data for a specific attribute becomes a matter of the SMEs’ opinions and, depending on those opinions, may lead to questionable quality for the data in the master and increase risk for the overall deployment success of the project.
Fortunately, there are new best practices in data analysis that automate single-system data profiling, cross-system source-data discovery, and detailed data mapping between systems. This article discusses each type of analysis and the appropriate use of each throughout an MDM project.
Single-System Data Profiling: A Necessary but not Sufficient First Step
When it comes to data analysis, single-system data profiling is what data warehousing practitioners are familiar with. Data profiling software scans the data values within the tables in a single data source and, based on those values, generates statistics about the data in each column, including but not limited to, data types, value frequency, length, precision, scale, format, cardinality, mean, median, mode, minimum, and maximum.
Profiling will identify outliers for a given column. For instance, if most of the values for a column are integers from 1 to 100, a profiling tool will identify and report any values that are characters (a, b, c, etc.), for example. Some of the better profiling tools will also identify and validate primary/foreign-key relationships that are useful in understanding the structure of individual source systems.
Profiling is a necessary first step in almost any data analysis. In fact, both the cross-system source-data discovery and the data mapping sections rely on profiling as a first step. However, by itself, profiling only provides information about individual data sources. MDM by definition is about driving consistency across multiple data sources. As a result, after profiling, most of the hard work is still left as an exercise for the user to perform manually. This is why profiling only provides about 5 percent of the analysis you will need for your MDM project and why automating cross-source data discovery and data mapping (both described below) are critical for accelerating an MDM deployment.
No comments:
Post a Comment