Part of the the series: Doing Power BI the Right Way
Moving important business data into a data model for analytic reporting can often be a two-edge sword. Data retrieval is fast and can support all kinds of analytic trending and comparisons. But, data in the model may be one or two layers away from the original source data, making it more challenging to compare with familiar user reports. Often the first validation effort after transforming and loading data into the model and then visualizing the initial results is having a business user say “yeah, that looks about right.” Then, sometime later after more development and extensive user testing, the feedback might be “hmm, that seems a bit off.” …not exactly scientific.
I have been doing a lot of data validation work lately – both formally and informally. Informally: Validating calculation results from a Power BI data model is just a normal part of the development process. Formally: After developing and delivering an entire business dashboard solution, a formal validation process is used to validate the ongoing results after future data refresh cycles and to certify reports so that business leaders know they can trust them.
Particularly for database developers, making the leap to using reports built on a tabular data model instead of the source database is a mental transition. The developer has to stop thinking in SQL and start thinking in model mechanics. This can pose a challenge when needing to validate numbers in a report, which should be done using the data model and not only by running by a SQL query.
Validation 101: Create & Use COUNT Measures
When designing a new data model, this is typically the first thing I do… For every fact table and for each large dimension table, I create a measure that returns the record count for that table. Users normally think about their data in business terms (like sums, ratios and averages) and not about how many records there are in a table. Record counts are a convenient sanity check for record completeness and distribution across groups; and may also be a good indicator for model size and performance.
Let’s say I have a data model containing the tables: Sales, Orders, Customers, Products and Date. At minimum, I will create the following measures with code similar to this:
Sales Row Count = COUNTROWS( ‘Sales’ )
Orders Row Count = COUNTROWS( ‘Orders’ )
Customer Count = COUNTROWS( ‘Customers’ )
These measures allow you to quickly troubleshoot problems that might otherwise go unnoticed – with model relationships, key matching, orphaned records and missing records. Just drag-and-drop the measure and a field to group the values. For example, if I drag the Sales Row Count measure and Year from the Date table onto the report canvas, I can immediately see sales records across years. If I do the same thing with the customer region field, I might be able to spot a gap in the region distribution of records over time. If every column in a column chart (created by default when dragging a numeric field) shows 100%, then I know that I either forget to create a relationship, have a mismatched data type or forgot the “Set as Date Table” setting for the Date table. Easy peasy.
Can’t I just count any field value rather that creating explicit measures? – Yes but don’t. Coming from an SSAS multidimensional background, I learned to define measures for every useful numeric metric. If we needed to show the SUM or the COUNT of anything, we needed to define a measure for that. In Power BI and SSAS Tabular, you can simply use implicit measures – meaning you can just drag-and-drop any field and then indicate that you want to count each occurrence or sum up the numeric value of the field. For quick-and-dirty Power BI reports, this is a convenient way to get simple aggregate results but not a best practice in my opinion because of inconsistent behavior and other limitations. There is a growing consensus in the Power BI community that using explicit measures should be the norm and not the exception. If you always use explicit measures rather than aggregating numeric columns, users and report developers will always have a consistent experience.
Validation 201: Measure Dependency Chains
This is a fairly normal pattern:
- Create a simple base measure using a simple expression, let’s say:
- Create additional measures that reference and then perform some logic on the first measure. For example:
[Employee Count Females] and [Employee Count Males]
- Layer another new measure on the second layer of measures. Like:
[Female to Male Employees Ratio]
- Create another measure to apply special business logic, let’s say to conditionally apply row-level security and non-visual totals:
[Female to Male Employees Ratio (RLS Adjusted)]
- ..and so on
This is a simple example but it is not unusual. Reusing measures helps to separate reusable calculation logic but it can also create deep dependency chains. At some point, you will inevitably discover a calculation error that you will need to track-down and debug. I can’t remember my own name some mornings so I certainly cannot work through the mental gymnastics needed to keep track of every dependent measure.
When I have a visual that is not displaying the expected results, I usually do the following:
- Duplicate the report page and prefix the page name with TEST: (something meaningful)
- Delete visuals not needed for testing
- Convert the visual to a table so I can just see the numbers grouped by a field
- Drag the referenced measures into the table to create multiple columns. Validate each dependent measure.
This helps to break the problem down into manageable steps and components. If you know that the first measure is always behaving as expected, move on to the next measure that depends on it. If the problem isn’t immediately apparent, recreate the same conditions with slicers. By isolating the problem, I can find and fix the issue every time.
How to Track Object Dependencies
Keeping track of dependencies between layers of measures can be challenging. I keep a scratchpad on my desk to make lists of things I need to remember. Sometimes I don’t even save my notes but it just helps me organize my thoughts. I use a whiteboard for a similar purpose, diagramming dependencies between measures and other objects with boxes and lines. Just standing and diagramming on the whiteboard is a very effective problem-solving technique for me. Electronic tools like OneNote and Microsoft Whiteboard can be just as effective if you are in the habit of using them.
The Object Dependencies feature in Tabular Editor is pure magic and a tool I am using more frequently to test, debug and to document my data models.
On a different but related topic… Create separate measures for calculations that will be used in visuals and reused by other measures but having too many measures can be difficult to manage. Strike a balance between code reuse and a more succinct, easier-to-manage portfolio of measures. If you have to jump through seven hoops (e.g. trace multiple dependent layers of measures) every time you need to debug or validate a calculated value, you probably have too many dependent measures.
Organize measures but don’t over-organize them… I prefer to put all my measures in a central table or at least assign them to just a few subject area fact tables. Don’t leave measures in different tables, scattered throughput the data model. Although I’m a fan of using display folders, it can be more work to assign measures to several different folders than it is just to use meaningful naming conventions. The implementation of display folders in Power BI Desktop in my opinion is an unfinished feature. If you want to use display folders to organize measures within container tables, use Tabular Editor. Name measures so they’re predictable and easy to find in a sorted list. For example, if you will have a series of measures based on “Sales Amount“, don’t name the variations “MTD Sales” or “Ratio of Sales to Prior Year“. Instead name them with the same prefix, like “Sales Amount“, “Sales Amount MTD“, “Sales Amount YTD“, “Sales Amount Ratio to Prior Year” and “Sales Amount Region % of All Regions“. This way, they all get sorted and grouped together in the field list. The theme here should be finding the right balance to organize your objects while keeping things simple.
Validation 301: Creating a Validation Report Solution, Part 1
This is a big topic and I have a learned some great techniques which will be covered in a follow-up post. So, let’s call this teaser.
How do users and business leaders know with certainty that they can trust the information in a Power BI report? There are a variety of factors that can affect data reliability. If the data source is known to be correct, there could be issues in the extraction, staging, loading, transformation or the calculations used to present the information in an analytical report. Testing the report with data at a point in time is a good place to start but it doesn’t absolutely guarantee accuracy down the road. If we had a report with a simple set of aggregate values read directly from the business source of truth alongside calculated values from the data model after it has been refreshed, this can provide a litmus test to check and instill congoing confidence. When (I said “when” not “if”) a user tells us that report information isn’t right, we can confidentially and easily prove that analytic report information from the data model is consistent with the source. This is what a validation report is all about.
To set the stage for a longer post on this topic, think about this typical scenario:
Every month we load fresh manufacturing production volume and forecast records into our data mart or data lake, and then into the Power BI data model. The results of the data model can be used as a data source for the validation report. Using the XMLA endpoint, you can run a DAX query against the published data model to get the values shown in a report visuals complete with filters and slicers. For comparison, you can load data from a report produced by the business system, load those values into a the validation data model and compare the monthly totals. We can ne flag the data that does and doesn’t match, and calculate the differences.