Doing Power BI the Right Way: 2. Preparing, shaping & transforming source data

Part of the the series: Doing Power BI the Right Way

In a business intelligence solution, data must be shaped and transformed. Your data is rarely, if ever, going to be in the right format for analytic reporting. It may need to be consolidated into related fact and dimension tables, summarized, grouped or just cleaned-up before tables can be imported into a data model for reporting. If source data is in a relational database, chances are that it is optimized for transactional processing or application development. The column names, data types and formats are likely not suitable nor user-friendly for simple reporting. If source data exists in files or spreadsheets extracted from business reports, without data preparation you are likely just to reproduce the existing reports without adding business value.

This important topic raises one of the most critical questions to consider when designing a Power BI project:

Where should I shape and transform data?

Data can be shaped and transformed in at least three different layers in the business intelligence process:

  1. At the source
  2. In Power Query
  3. In the BI data model

Your options for transforming data at the source depend on the capabilities of the data source. In general, a query running closer to a data source will be more efficient than one that must transfer a set of data, cache and manipulate it downstream. If you are importing text files or Excel files, then your only choice is to import an entire file and then potentially throw away what you don’t need – which is fine for small-ish data sets.

The right side of the previous diagram depicts data transformed in the data model. Some data morphing can be performed in DAX using calculated columns or calculated tables. In rare cases, there may be good reasons to do this but if you can perform row-level calculations in a database view or in Power Query, you will be better off. It is also easier to maintain the solution if you can keep all that query logic in the same layer so you are only using one design tool.

Data shaping at the Source

If we break the first choice down a little further, we could prepare data in an Extract, Transform & Load (ETL) process and populate a new table in the source database with correctly-shaped and formatted data for BI reporting. This is the approach used in traditional data warehouse solutions, typically under the control of an IT-managed data warehouse development group.

If requirements are well defined and unlikely to change, this might be the ideal approach. However, data warehouse projects can get stuck in long cycles of over-planning and requirement gathering which has led many organizations to think about about BI as an extension of business rather than IT processes. Most organizations have come to grips with the reality that the vast majority of BI projects are conducted after databases and data warehouses have been completed.

Where and what is Power Query?

Power Query is the data transformation technology built into Power BI. It is an extremely flexible and extensible data shaping platform that evolved from tools like SSIS, DTS, Excel, Microsoft Query and Access. When you choose Get Data or Transform Data from Power BI Desktop, this opens the Power Query editor. The Power BI service includes an online implementation of Power Query called Dataflows which allows queries and entities to be share between Analysts creating different data models. Power Query is also now integrated with Excel (when you select Get Data), the Visual Studio SSAS Tabular model designer and Azure Data Factory.

Learn to leverage Power Query to perform data transformation efficiently to perform all row-level transformation and calculations. In general, Column naming and data type changes should be performed in Power Query rather than embedded SQL or DAX.

Understand Query Folding

This is a critical concept to understand about Power Query. For supported data sources and whenever possible, queries you create in the Power BI query editor will be converted to native queries to be executed at the source. This means that if you are using SQL Server, rather that transferring all the data from the database engine to the Power BI service before performing transformations, steps may be converted to an optimized TSQL query. This is an ideal scenario for high-volume fact tables.

Query folding only works when Power Query can select records from a single database object like a database table, view or table-valued user-defined function.

Hard pills to swallow

For relational database professionals, there are a few realities about Power BI that can be a shock to their pattern for solving data problems. Without getting into all the details about why things are the way they are, it is very important to understand and embrace these points:

SQL queries do not fold – If you hand-write a SQL query or copy and paste SQL script into a new query within Power BI, Power Query cannot fold – and therefor cannot optimize – that query. This doesn’t mean that this is wrong or necessarily a bad practice. If you need to populate a relative small table in the data model that doesn’t need to be filtered, grouped, sorted or otherwise transformed in ways that could otherwise be optimized; this is probably fine.

Selectable database objects can be folded – If you can use a single database table or create a view in the database, particularly for high-volume fact tables, these queries will be faster and will consume fewer server resources.

Stored procedures – Power Query doesn’t include specific optimizations or support for stored procedures. I have successfully used stored procs as data sources, at the insistence of DBAs, but this is not a recommended design practice.

Parameterized filters

Queries for large tables should be designed with filters so you can easily control the records imported – both at design time and when you need to populate the data model with a full set of records. Even if you don’t need to manage millions of records or load a table incrementally, this is just a good idea and will save a lot of redesign later on.

I use parameters to manage anything that might change, like folder paths and server names.

The example shown here conforms to the requirements needed to implement the incremental refresh feature if Power BI:

The RangeStart and RangeEnd parameters are then used in a Filter Rows step in the query to limit the imported records within the selected date range.

Incremental refresh

One of the perks of using a date range filter is that it enables incremental refresh. This is one of the coolest features to come along in a long while and now it works with both Premium and share capacity licensing. If queries support query folding and use a set of date range filters to limit the records into a table in the data model, Power BI will automatically generate table partitions. For example, if I were to partition a table by month, each partition includes records for only one month and data in each monthly partition is is only loaded when new records are available or when data has changed.

I design my fact tables to support incremental refresh as a standard design pattern whether we have a requirement to implement it or not. This gives me the flexibility to only load a small set of data I need for development. Publishing a smaller PBIX file is faster and then more data can be loaded after the dataset has been published to the service.

If you don’t need it, throw it away

The presence of every single column in your data model comes at a cost. It uses memory, storage space and can potentially impact performance. If you aren’t absolutely certain that you need a column, just omit it from the table in Power Query. If you or you users change their minds, columns can easily be added.

Before you ask, this is not a picture of my garage but according to my wife, it could be. Keeping all the columns in your data that you “might need someday” is a lot like all that crap I’ve been collecting in my garage and on my workbench that I tell myself might be useful some day.

Error on the side of omitting columns that you don’t need to meet immediate reporting requirements. The difference between the cluttered garage analogy and reality of Power BI design is that you immediately benefit by getting rid of unneeded data right now and it costs you little or nothing to go back and add one of those columns later, after your business stakeholders decide they need it.

The easiest way to manage this is to use the Choose Columns button on the Home ribbon, and then just deselect the columns you don’t need. In the future, use the gear icon next to the generated step to make changes to the column selection.

Column naming

Please read this aloud and commit it to memory: Rename all columns that will be visible in the data model using concise, self-describing, meaningful and user-friendly names. I can’t emphasize how important this is. Not only is this good practice in BI solutions, but there is a psychology to the way business users perceive what they see in reports. When they see cryptic labels and field names, they become uncomfortable and may distrust the information.

Your data model should be designed for users and not for developers to design reports and consume. Even if technical professionals are designing reports, field names are going to show up as labels and titles. Consider the following examples:

Friendly Column NameTechnical Name
Customer Number
Customer Nbr
Customer #
Product Brand NameproductBrandName

Technical naming conventions are deeply engrained into the culture of application and database design and for good reason. For developers and database professionals, certain naming patterns like Camel Case (camelCase) and capitalized word abbreviations separated by underscore characters are almost a right of passage; but object names in this format should not exist in a data model for analytic reporting. Spaces in column and measure names are welcome, as are punctuation and other special characters like #, $, %, etc.

Table and fields should be named so they are familiar and easy to understand. When you or a user designing a self-service report drags and drops a field or measure into a report visual, these names are used as labels and titles that should also be meaningful and self-explanatory.

Field names should also completely describe the object. Consider that if the Product and Customer tables both have “Name” columns, if you were to drag the Name field from the Customer table into one visual and the Name field from the Product table into another visual, each would only be labeled “Name”. It would be more appropriate to name these fields “Customer Name” and “Product Name” in their respective tables.

Naming & annotating steps

Rename the steps in Power Query so that they are self-describing (like the following example). Your future self will thank you when you revisit the project in six months. This is done by right-clicking the step and choosing Rename from the menu.

A new feature in Power Query displays a tooltip with annotations you add to the M script or the text you enter in the Description property of the step.

The names of each step are applied as variable names in the Power Query “M” language script. Variables that contain spaces or special characters are proceeded with # and are encapsulated with double quotes. The comments in the code I added with // prefixes result in the tooltip annotations I mentioned in the previous paragraph.

A recent suggestion from one of the blog readers is to use underscores rather than spaces in these step names. This allows you to double-click the variable name and copy it to the clipboard.

Consolidating steps

Building queries using linear steps helps break-down complicated logic but it can become inefficient as the project evolves. In the course of query design, you are likely to repeat steps like those in the second example below where I removed three different columns after using them in other steps. I also changed the data type each time I created a custom or duplicated column.

If you select an existing step and then perform the same action on a different column or columns, Power Query will add the new transformation to the selected step. In the query on the left, I consolidated all these steps, removing columns and changing data types only once.

The optimized query on the left runs in about one-third the time is the original query on the right.

Power Query Best Practice Guidelines

Here is a summary of best practice recommendations:

  • Set EVERY column to the right data type. Watch for untyped columns with “ABC123” in the column heading.
  • Rename all columns that won’t be hidden in data model (keys and sort columns are OK).
  • Consolidate redundant steps.
  • Rename steps to document transformations.
  • Perform row-level calculations in Power Query instead of DAX calculated columns.
  • Use database views & avoid SQL statements so Power Query can implement query folding.
  • Use parameters & range filters to reduce working set for development, reduce PBIX file size, speed up desktop refresh and reduce time to publish dataset.
  • Use date range parameters in service to expand dataset size during data refresh & optionally implement incremental refresh.

19 thoughts on “Doing Power BI the Right Way: 2. Preparing, shaping & transforming source data

  1. Hey Paul, great article. Can’t say I agree with PQ being the right place to do most of your data manipulation and cleansing though. Really still prefer an underlying cleaned up data model. Makes it far easier when you want to connect other reporting tools, Excel, etc. etc. and it’s easier to cleanse data there anyway, as apart from simple logic, doing so will often require other data from a DB.

    • Hi Greg, I couldn’t agree more. My point, which I may not have made affectively, was that many BI professionals don’t have rights or the opportunity to build a proper ETL process, data mart or data warehouse. In a proper, formal IT-driven BI solution, I would advocate for loading fact and dimension tables with cleansed data, properly shaped for reporting. Secondly, I would want to use views containing only the necessary rows and columns. But for non-IT analysts who don’t have the opportunity to work with data before loading or who are working with non-relational sources, Power Query might be the only transformation tool they can use.

  2. Great article Paul! Already looking forward to the next one. Regarding selecting columns in Query Editor. Would you prefer to do it query editor instead of modifying a database view to include/exclude whatever columns you need? I am thinking that if a db view returns all columns and you want to exclude many of them, that will lead to bad performance in query editor as all data need to be loaded first and then removed, and it would be more efficient to change the db view instead.

    Also. I’d like your opinion on when to use dataflows as the data source instead of connecting to e.g. database tables/views.

  3. Hi,
    Thanks for good article! Do you have any recommendations on where I can find instruction on how to get started with using a database instead of using Excel sheets/Queries as data source for PowerBI?

    • If you mean you are using PowrQuery in Excel to first download data, then import the data into PowerBI, you CAN bypass Excel. PowerBI also has PowerQuery, so you could directly connect to the DB using PowerQuery in PowerBI

      • Hi Ron, I’m not sure where this impression would come from. I’m speaking of only using Power Query within Power BI. I wouldn’t advocate using Excel to extract and transform data unless the goal were to only consume the results in the Excel workbook.

  4. Fully agree with all of that Paul, good stuff

    Under the Properties of a step I also encourage people to document the “Why I did this….” that’s often the part that is left out and is the hardest to work out…

  5. Excellent article, Paul. I am one of those people who still keep using technical ColNames instead of Column Names with spaces. I am also learning to use Query Parameters, but here you’re using CamelCase for the ParameterNames. Could I be using spaced names for these parameter names, or is there an advantage for having them in CamelCase?

  6. Paul, what have you been doing in my garage?
    Great article, bizarrely I have just been through an exercise of reviewing a Power BI architecture at a new contract and this has been helpful, having picked up a few extras. Look forward to your next posts

  7. Pingback: Doing Power BI the Right Way | Paul Turley's SQL Server BI Blog

  8. All fantastic advice. I would go further with your step naming point. I always use underscores instead of spaces to name steps – if a user ever gets to manipulate the code directly in the Advanced Editor, or an external editor, then the ability to double-click a spaceless step name to find and/or replace other instances, and the removal of the unneccessary ‘fog’ of three extra characters (a # and two sets of quotes) per step is very useful.

    • Lee, that is an excellent suggestion. I am constantly struggling with highlighting the M variable names to copy and paste them into the next step, and often miss the “#” at the beginning.

  9. Pingback: Preparing Source Data in Power BI – Curated SQL

  10. I agree, renaming and annotating query steps is a good idea for long term maintenance. Rename is easy enough to figure out, right click on step …
    In each of the sections, if you don’t say how to, could you please make it a point to include a link to “how to” instructions ie
    . * a link to an article about annotating/commenting steps (simple to do: Right click > select “Properties” for both rename and comment ).

Leave a Reply