Doing Power BI the Right Way: 2. Preparing, shaping & transforming source data

Paul Turley

4 years ago

Part of the the series: Doing Power BI the Right Way (link)

In a business intelligence solution, data must be shaped and transformed. Your data is rarely, if ever, going to be in the right format for analytic reporting. It may need to be consolidated into related fact and dimension tables, summarized, grouped or just cleaned-up before tables can be imported into a data model for reporting. If source data is in a relational database, chances are that it is optimized for transactional processing or application development. The column names, data types and formats are likely not suitable nor user-friendly for simple reporting. If source data exists in files or spreadsheets extracted from business reports, without data preparation you are likely just to reproduce the existing reports without adding business value.

This important topic raises one of the most critical questions to consider when designing a Power BI project:

Where should I shape and transform data?

Data can be shaped and transformed in at least three different layers in the business intelligence process:

At the source
In Power Query
In the BI data model

Your options for transforming data at the source depend on the capabilities of the data source. In general, a query running closer to a data source will be more efficient than one that must transfer a set of data, cache and manipulate it downstream. If you are importing text files or Excel files, then your only choice is to import an entire file and then potentially throw away what you don’t need – which is fine for small-ish data sets.

The right side of the previous diagram depicts data transformed in the data model. Some data morphing can be performed in DAX using calculated columns or calculated tables. In rare cases, there may be good reasons to do this but if you can perform row-level calculations in a database view or in Power Query, you will be better off. It is also easier to maintain the solution if you can keep all that query logic in the same layer so you are only using one design tool.

Data shaping at the Source

If we break the first choice down a little further, we could prepare data in an Extract, Transform & Load (ETL) process and populate a new table in the source database with correctly-shaped and formatted data for BI reporting. This is the approach used in traditional data warehouse solutions, typically under the control of an IT-managed data warehouse development group.

Data Warehouse, Data Mart or Data Lake

Back in the day, the conventional approach was to move data from sources into a staging database, and then to move it again into a dimensional database that was structured to support analytic reporting. I say this in past-tense because that was the way we always did it 10-20 years ago. Now, in the age of self-service analytics, it has become more common to build reporting projects on-the-fly and to transform data without moving all the data from one location to another before it arrives in the BI data model. However, there are limits to just how much data we can transform and how much complex transformation can be done in an ad-hoc fashion. The water here is muddy and it is difficult to say that one approach is always preferable to another but one thing is very clear: a Data Warehouse is very necessary in cases where data governance, high volume and centralized management are required.

If requirements are well defined and unlikely to change, building a data warehouse or dimensional data mart might be the ideal approach. However, data warehouse projects can get stuck in long cycles of over-planning and requirement gathering which has led many organizations to think about about BI as an extension of business rather than IT processes. Most organizations have come to grips with the reality that the vast majority of BI projects are conducted after databases and data warehouses have been completed. Over decades, we have seen hundreds of companies chasing the Utopian dream of the “perfect data warehouse” along with Master Data Management, data cataloging, cleansing, governance and stewardship. Each of these goals can be achievable but are potential rabbit holes. Experienced data practitioners have learned to apply their experience to time-boxed and resource constrained projects with the goal to create “good enough” solutions that don’t completely drain IT budgets and resources.

The data lake has become a viable option for staging and storing data before moving it into a structured model because relational storage can be expensive. Data lakes provide a low-cost means to store a lot of data in the cloud. We usually think of the “modern data warehouse” platform consisting of a data lake and dedicated data warehouse database. Flavors of the lake/warehouse approach could include using a lake AS a warehouse, using a standard relational database to store a small/medium scale warehouse or using a dedicated parallel processing DB platform (e.g. Synapse, Snowflake or Cosmos DB) to store a large-scale data warehouse.

Power BI and the Data Warehouse

Bringing the discussion back to focus on Power BI queries, data models and reports; let me put a bottom line on this multifaceted topic. I believe that there are three data transformation scenarios related to Power BI design:

Business-Centric, Informal and Small Model/Report Solutions
Power BI makes it easier to transform data as you go. In small projects, using this approach saves time and money, allowing business professionals to meet your own reporting needs. End of story… win-win.
Interim POC Solutions
Without a formal data warehouse, Power BI can be used to munge-together data from multiple sources and discover requirements that a data warehouse solution would provide. Use Power BI to build a prototype or proof-of-concept solution that can eventually evolve into a formal solution. Accept the follies and trade-offs of ad hoc data models and use them to evolve toward the right solution using iterative design.
Use Power Query to perform all necessary transformations and cleansing operations.
Formal Data Model & Report Solutions
Once properly governed and secured data warehouse/marts exist, rebuild Power BI data models and reports. The “rebuild” effort could encompass anything from switching data sources in an existing report project or starting from scratch. Either way, you are saving time by applying learnings and design elements. If additional query logic is needed, use database views to promote query folding in Power Query.

The simple fact is the making such a transition is never “done”. It is a journey that requires iterative thinking and management. Accepting and embracing this reality will save everyone a lot of angst. With data cleansing and transformation applied to ingested data warehouse data using tools like Azure Data Factory (which may utilize Power Query or SSIS), SSIS or Databricks, data sources for the Power BI project may simply be tables or views.

Where and what is Power Query?

Power Query is the data transformation technology built into Power BI. It is an extremely flexible and extensible data shaping platform that evolved from tools like SSIS, DTS, Excel, Microsoft Query and Access. When you choose Get Data or Transform Data from Power BI Desktop, this opens the Power Query editor. The Power BI service includes an online implementation of Power Query called Dataflows which allows queries and entities to be share between Analysts creating different data models. Power Query is also now integrated with Excel (when you select Get Data), the Visual Studio SSAS Tabular model designer and Azure Data Factory.

Learn to leverage Power Query to perform data transformation efficiently to perform all row-level transformation and calculations. In general, Column naming and data type changes should be performed in Power Query rather than embedded SQL or DAX.

Understand Query Folding

This is a critical concept to understand about Power Query. For supported data sources and whenever possible, queries you create in the Power BI query editor will be converted to native queries to be executed at the source. This means that if you are using SQL Server, rather that transferring all the data from the database engine to the Power BI service before performing transformations, steps may be converted to an optimized TSQL query. This is an ideal scenario for high-volume fact tables.

Query folding only works when Power Query can select records from a single database object like a database table, view or table-valued user-defined function.

Hard pills to swallow

For relational database professionals, there are a few realities about Power BI that can be a shock to their pattern for solving data problems. Without getting into all the details about why things are the way they are, it is very important to understand and embrace these points:

SQL queries do not fold – If you hand-write a SQL query or copy and paste SQL script into a new query within Power BI, Power Query cannot fold – and therefor cannot optimize – that query. This doesn’t mean that this is wrong or necessarily a bad practice. If you need to populate a relative small table in the data model that doesn’t need to be filtered, grouped, sorted or otherwise transformed in ways that could otherwise be optimized; this is probably fine.

Selectable database objects can be folded – If you can use a single database table or create a view in the database, particularly for high-volume fact tables, these queries will be faster and will consume fewer server resources.

Stored procedures – Power Query doesn’t include specific optimizations or support for stored procedures. I have successfully used stored procs as data sources, at the insistence of DBAs, but this is not a recommended design practice.

Parameterized filters

Queries for large tables should be designed with filters so you can easily control the records imported – both at design time and when you need to populate the data model with a full set of records. Even if you don’t need to manage millions of records or load a table incrementally, this is just a good idea and will save a lot of redesign later on.

I use parameters to manage anything that might change, like folder paths and server names.

The example shown here conforms to the requirements needed to implement the incremental refresh feature if Power BI:

The RangeStart and RangeEnd parameters are then used in a Filter Rows step in the query to limit the imported records within the selected date range.

Incremental refresh

One of the perks of using a date range filter is that it enables incremental refresh. This is one of the coolest features to come along in a long while and now it works with both Premium and share capacity licensing. If queries support query folding and use a set of date range filters to limit the records into a table in the data model, Power BI will automatically generate table partitions. For example, if I were to partition a table by month, each partition includes records for only one month and data in each monthly partition is is only loaded when new records are available or when data has changed.

I design my fact tables to support incremental refresh as a standard design pattern whether we have a requirement to implement it or not. This gives me the flexibility to only load a small set of data I need for development. Publishing a smaller PBIX file is faster and then more data can be loaded after the dataset has been published to the service.

If you don’t need it, throw it away

The presence of every single column in your data model comes at a cost. It uses memory, storage space and can potentially impact performance. If you aren’t absolutely certain that you need a column, just omit it from the table in Power Query. If you or you users change their minds, columns can easily be added.

Before you ask, this is not a picture of my garage but according to my wife, it could be. Keeping all the columns in your data that you “might need someday” is a lot like all that crap I’ve been collecting in my garage and on my workbench that I tell myself might be useful some day.

Error on the side of omitting columns that you don’t need to meet immediate reporting requirements. The difference between the cluttered garage analogy and reality of Power BI design is that you immediately benefit by getting rid of unneeded data right now and it costs you little or nothing to go back and add one of those columns later, after your business stakeholders decide they need it.

The easiest way to manage this is to use the Choose Columns button on the Home ribbon, and then just deselect the columns you don’t need. In the future, use the gear icon next to the generated step to make changes to the column selection.

Column naming

Please read this aloud and commit it to memory: Rename all columns that will be visible in the data model using concise, self-describing, meaningful and user-friendly names. I can’t emphasize how important this is. Not only is this good practice in BI solutions, but there is a psychology to the way business users perceive what they see in reports. When they see cryptic labels and field names, they become uncomfortable and may distrust the information.

Your data model should be designed for users and not for developers to design reports and consume. Even if technical professionals are designing reports, field names are going to show up as labels and titles. Consider the following examples:

Friendly Column Name	Technical Name
Customer Number Customer Nbr Customer #	customerNumber CustomerNbr CUST_NBR
Product Brand Name	productBrandName ProdBrandNm PRD_BRAND_NM

Technical naming conventions are deeply engrained into the culture of application and database design and for good reason. For developers and database professionals, certain naming patterns like Camel Case (camelCase) and capitalized word abbreviations separated by underscore characters are almost a right of passage; but object names in this format should not exist in a data model for analytic reporting. Spaces in column and measure names are welcome, as are punctuation and other special characters like #, $, %, etc.

Table and fields should be named so they are familiar and easy to understand. When you or a user designing a self-service report drags and drops a field or measure into a report visual, these names are used as labels and titles that should also be meaningful and self-explanatory.

Field names should also completely describe the object. Consider that if the Product and Customer tables both have “Name” columns, if you were to drag the Name field from the Customer table into one visual and the Name field from the Product table into another visual, each would only be labeled “Name”. It would be more appropriate to name these fields “Customer Name” and “Product Name” in their respective tables.

Naming & annotating steps

Rename the steps in Power Query so that they are self-describing (like the following example). Your future self will thank you when you revisit the project in six months. This is done by right-clicking the step and choosing Rename from the menu.

A new feature in Power Query displays a tooltip with annotations you add to the M script or the text you enter in the Description property of the step.

The names of each step are applied as variable names in the Power Query “M” language script. Variables that contain spaces or special characters are proceeded with # and are encapsulated with double quotes. The comments in the code I added with // prefixes result in the tooltip annotations I mentioned in the previous paragraph.

A recent suggestion from one of the blog readers is to use underscores rather than spaces in these step names. This allows you to double-click the variable name and copy it to the clipboard.

Consolidating steps

Building queries using linear steps helps break-down complicated logic but it can become inefficient as the project evolves. In the course of query design, you are likely to repeat steps like those in the second example below where I removed three different columns after using them in other steps. I also changed the data type each time I created a custom or duplicated column.

If you select an existing step and then perform the same action on a different column or columns, Power Query will add the new transformation to the selected step. In the query on the left, I consolidated all these steps, removing columns and changing data types only once.

The optimized query on the left runs in about one-third the time is the original query on the right.

Power Query Best Practice Guidelines

Here is a summary of best practice recommendations:

Set EVERY column to the right data type. Watch for untyped columns with “ABC123” in the column heading.
Rename all columns that won’t be hidden in data model (keys and sort columns are OK).
Consolidate redundant steps.
Rename steps to document transformations.
Perform row-level calculations in Power Query instead of DAX calculated columns.
Use database views & avoid SQL statements so Power Query can implement query folding.
Use parameters & range filters to reduce working set for development, reduce PBIX file size, speed up desktop refresh and reduce time to publish dataset.
Use date range parameters in service to expand dataset size during data refresh & optionally implement incremental refresh.