Doing Power BI the Right Way: 2. Preparing, shaping & transforming source data

Part of the the series:  Doing Power BI the Right Way (link)

In a business intelligence solution, data must be shaped and transformed. Your data is rarely, if ever, going to be in the right format for analytic reporting. It may need to be consolidated into related fact and dimension tables, summarized, grouped or just cleaned-up before tables can be imported into a data model for reporting. If source data is in a relational database, chances are that it is optimized for transactional processing or application development. The column names, data types and formats are likely not suitable nor user-friendly for simple reporting. If source data exists in files or spreadsheets extracted from business reports, without data preparation you are likely just to reproduce the existing reports without adding business value.

This important topic raises one of the most critical questions to consider when designing a Power BI project:

Where should I shape and transform data?

Data can be shaped and transformed in at least three different layers in the business intelligence process:

  1. At the source
  2. In Power Query
  3. In the BI data model

Your options for transforming data at the source depend on the capabilities of the data source. In general, a query running closer to a data source will be more efficient than one that must transfer a set of data, cache and manipulate it downstream. If you are importing text files or Excel files, then your only choice is to import an entire file and then potentially throw away what you don’t need – which is fine for small-ish data sets.

The right side of the previous diagram depicts data transformed in the data model. Some data morphing can be performed in DAX using calculated columns or calculated tables. In rare cases, there may be good reasons to do this but if you can perform row-level calculations in a database view or in Power Query, you will be better off. It is also easier to maintain the solution if you can keep all that query logic in the same layer so you are only using one design tool.

Data shaping at the Source

If we break the first choice down a little further, we could prepare data in an Extract, Transform & Load (ETL) process and populate a new table in the source database with correctly-shaped and formatted data for BI reporting. This is the approach used in traditional data warehouse solutions, typically under the control of an IT-managed data warehouse development group.

Data Warehouse, Data Mart or Data Lake

Back in the day, the conventional approach was to move data from sources into a staging database, and then to move it again into a dimensional database that was structured to support analytic reporting. I say this in past-tense because that was the way we always did it 10-20 years ago. Now, in the age of self-service analytics, it has become more common to build reporting projects on-the-fly and to transform data without moving all the data from one location to another before it arrives in the BI data model. However, there are limits to just how much data we can transform and how much complex transformation can be done in an ad-hoc fashion. The water here is muddy and it is difficult to say that one approach is always preferable to another but one thing is very clear: a Data Warehouse is very necessary in cases where data governance, high volume and centralized management are required.

If requirements are well defined and unlikely to change, building a data warehouse or dimensional data mart might be the ideal approach. However, data warehouse projects can get stuck in long cycles of over-planning and requirement gathering which has led many organizations to think about about BI as an extension of business rather than IT processes. Most organizations have come to grips with the reality that the vast majority of BI projects are conducted after databases and data warehouses have been completed. Over decades, we have seen hundreds of companies chasing the Utopian dream of the “perfect data warehouse” along with Master Data Management, data cataloging, cleansing, governance and stewardship. Each of these goals can be achievable but are potential rabbit holes. Experienced data practitioners have learned to apply their experience to time-boxed and resource constrained projects with the goal to create “good enough” solutions that don’t completely drain IT budgets and resources.

The data lake has become a viable option for staging and storing data before moving it into a structured model because relational storage can be expensive. Data lakes provide a low-cost means to store a lot of data in the cloud. We usually think of the “modern data warehouse” platform consisting of a data lake and dedicated data warehouse database. Flavors of the lake/warehouse approach could include using a lake AS a warehouse, using a standard relational database to store a small/medium scale warehouse or using a dedicated parallel processing DB platform (e.g. Synapse, Snowflake or Cosmos DB) to store a large-scale data warehouse.

Power BI and the Data Warehouse

Bringing the discussion back to focus on Power BI queries, data models and reports; let me put a bottom line on this multifaceted topic. I believe that there are three data transformation scenarios related to Power BI design:

  • Business-Centric, Informal and Small Model/Report Solutions
    Power BI makes it easier to transform data as you go. In small projects, using this approach saves time and money, allowing business professionals to meet your own reporting needs. End of story… win-win.
  • Interim POC Solutions
    Without a formal data warehouse, Power BI can be used to munge-together data from multiple sources and discover requirements that a data warehouse solution would provide. Use Power BI to build a prototype or proof-of-concept solution that can eventually evolve into a formal solution. Accept the follies and trade-offs of ad hoc data models and use them to evolve toward the right solution using iterative design.
    Use Power Query to perform all necessary transformations and cleansing operations.
  • Formal Data Model & Report Solutions
    Once properly governed and secured data warehouse/marts exist, rebuild Power BI data models and reports. The “rebuild” effort could encompass anything from switching data sources in an existing report project or starting from scratch. Either way, you are saving time by applying learnings and design elements. If additional query logic is needed, use database views to promote query folding in Power Query.

The simple fact is the making such a transition is never “done”. It is a journey that requires iterative thinking and management. Accepting and embracing this reality will save everyone a lot of angst. With data cleansing and transformation applied to ingested data warehouse data using tools like Azure Data Factory (which may utilize Power Query or SSIS), SSIS or Databricks, data sources for the Power BI project may simply be tables or views.

Where and what is Power Query?

Power Query is the data transformation technology built into Power BI. It is an extremely flexible and extensible data shaping platform that evolved from tools like SSIS, DTS, Excel, Microsoft Query and Access. When you choose Get Data or Transform Data from Power BI Desktop, this opens the Power Query editor. The Power BI service includes an online implementation of Power Query called Dataflows which allows queries and entities to be share between Analysts creating different data models. Power Query is also now integrated with Excel (when you select Get Data), the Visual Studio SSAS Tabular model designer and Azure Data Factory.

Learn to leverage Power Query to perform data transformation efficiently to perform all row-level transformation and calculations. In general, Column naming and data type changes should be performed in Power Query rather than embedded SQL or DAX.

Understand Query Folding

This is a critical concept to understand about Power Query. For supported data sources and whenever possible, queries you create in the Power BI query editor will be converted to native queries to be executed at the source. This means that if you are using SQL Server, rather that transferring all the data from the database engine to the Power BI service before performing transformations, steps may be converted to an optimized TSQL query. This is an ideal scenario for high-volume fact tables.

Query folding only works when Power Query can select records from a single database object like a database table, view or table-valued user-defined function.

Hard pills to swallow

For relational database professionals, there are a few realities about Power BI that can be a shock to their pattern for solving data problems. Without getting into all the details about why things are the way they are, it is very important to understand and embrace these points:

SQL queries do not fold – If you hand-write a SQL query or copy and paste SQL script into a new query within Power BI, Power Query cannot fold – and therefor cannot optimize – that query. This doesn’t mean that this is wrong or necessarily a bad practice. If you need to populate a relative small table in the data model that doesn’t need to be filtered, grouped, sorted or otherwise transformed in ways that could otherwise be optimized; this is probably fine.

Selectable database objects can be folded – If you can use a single database table or create a view in the database, particularly for high-volume fact tables, these queries will be faster and will consume fewer server resources.

Stored procedures – Power Query doesn’t include specific optimizations or support for stored procedures. I have successfully used stored procs as data sources, at the insistence of DBAs, but this is not a recommended design practice.

Parameterized filters

Queries for large tables should be designed with filters so you can easily control the records imported – both at design time and when you need to populate the data model with a full set of records. Even if you don’t need to manage millions of records or load a table incrementally, this is just a good idea and will save a lot of redesign later on.

I use parameters to manage anything that might change, like folder paths and server names.

The example shown here conforms to the requirements needed to implement the incremental refresh feature if Power BI:

The RangeStart and RangeEnd parameters are then used in a Filter Rows step in the query to limit the imported records within the selected date range.

Incremental refresh

One of the perks of using a date range filter is that it enables incremental refresh. This is one of the coolest features to come along in a long while and now it works with both Premium and share capacity licensing. If queries support query folding and use a set of date range filters to limit the records into a table in the data model, Power BI will automatically generate table partitions. For example, if I were to partition a table by month, each partition includes records for only one month and data in each monthly partition is is only loaded when new records are available or when data has changed.

I design my fact tables to support incremental refresh as a standard design pattern whether we have a requirement to implement it or not. This gives me the flexibility to only load a small set of data I need for development. Publishing a smaller PBIX file is faster and then more data can be loaded after the dataset has been published to the service.

If you don’t need it, throw it away

The presence of every single column in your data model comes at a cost. It uses memory, storage space and can potentially impact performance. If you aren’t absolutely certain that you need a column, just omit it from the table in Power Query. If you or you users change their minds, columns can easily be added.

Before you ask, this is not a picture of my garage but according to my wife, it could be. Keeping all the columns in your data that you “might need someday” is a lot like all that crap I’ve been collecting in my garage and on my workbench that I tell myself might be useful some day.

Error on the side of omitting columns that you don’t need to meet immediate reporting requirements. The difference between the cluttered garage analogy and reality of Power BI design is that you immediately benefit by getting rid of unneeded data right now and it costs you little or nothing to go back and add one of those columns later, after your business stakeholders decide they need it.

The easiest way to manage this is to use the Choose Columns button on the Home ribbon, and then just deselect the columns you don’t need. In the future, use the gear icon next to the generated step to make changes to the column selection.

Column naming

Please read this aloud and commit it to memory: Rename all columns that will be visible in the data model using concise, self-describing, meaningful and user-friendly names. I can’t emphasize how important this is. Not only is this good practice in BI solutions, but there is a psychology to the way business users perceive what they see in reports. When they see cryptic labels and field names, they become uncomfortable and may distrust the information.

Your data model should be designed for users and not for developers to design reports and consume. Even if technical professionals are designing reports, field names are going to show up as labels and titles. Consider the following examples:

Friendly Column NameTechnical Name
Customer Number
Customer Nbr
Customer #
customerNumber
CustomerNbr
CUST_NBR
Product Brand NameproductBrandName
ProdBrandNm
PRD_BRAND_NM

Technical naming conventions are deeply engrained into the culture of application and database design and for good reason. For developers and database professionals, certain naming patterns like Camel Case (camelCase) and capitalized word abbreviations separated by underscore characters are almost a right of passage; but object names in this format should not exist in a data model for analytic reporting. Spaces in column and measure names are welcome, as are punctuation and other special characters like #, $, %, etc.

Table and fields should be named so they are familiar and easy to understand. When you or a user designing a self-service report drags and drops a field or measure into a report visual, these names are used as labels and titles that should also be meaningful and self-explanatory.

Field names should also completely describe the object. Consider that if the Product and Customer tables both have “Name” columns, if you were to drag the Name field from the Customer table into one visual and the Name field from the Product table into another visual, each would only be labeled “Name”. It would be more appropriate to name these fields “Customer Name” and “Product Name” in their respective tables.

Naming & annotating steps

Rename the steps in Power Query so that they are self-describing (like the following example). Your future self will thank you when you revisit the project in six months. This is done by right-clicking the step and choosing Rename from the menu.

A new feature in Power Query displays a tooltip with annotations you add to the M script or the text you enter in the Description property of the step.

The names of each step are applied as variable names in the Power Query “M” language script. Variables that contain spaces or special characters are proceeded with # and are encapsulated with double quotes. The comments in the code I added with // prefixes result in the tooltip annotations I mentioned in the previous paragraph.

A recent suggestion from one of the blog readers is to use underscores rather than spaces in these step names. This allows you to double-click the variable name and copy it to the clipboard.

Consolidating steps

Building queries using linear steps helps break-down complicated logic but it can become inefficient as the project evolves. In the course of query design, you are likely to repeat steps like those in the second example below where I removed three different columns after using them in other steps. I also changed the data type each time I created a custom or duplicated column.

If you select an existing step and then perform the same action on a different column or columns, Power Query will add the new transformation to the selected step. In the query on the left, I consolidated all these steps, removing columns and changing data types only once.

The optimized query on the left runs in about one-third the time is the original query on the right.

Power Query Best Practice Guidelines

Here is a summary of best practice recommendations:

  • Set EVERY column to the right data type. Watch for untyped columns with “ABC123” in the column heading.
  • Rename all columns that won’t be hidden in data model (keys and sort columns are OK).
  • Consolidate redundant steps.
  • Rename steps to document transformations.
  • Perform row-level calculations in Power Query instead of DAX calculated columns.
  • Use database views & avoid SQL statements so Power Query can implement query folding.
  • Use parameters & range filters to reduce working set for development, reduce PBIX file size, speed up desktop refresh and reduce time to publish dataset.
  • Use date range parameters in service to expand dataset size during data refresh & optionally implement incremental refresh.

Paul Turley

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

34 thoughts on “Doing Power BI the Right Way: 2. Preparing, shaping & transforming source data

  1. Hi Paul! Thank you for your very useful article! You ¨wrote: “This gives me the flexibility to only load a small set of data I need for development. Publishing a smaller PBIX file is faster and then more data can be loaded after the dataset has been published to the service.” It seems I don’t quite understand this point. Can you clarify how exacly you manage to load only a small set and load the bulk of data after publishing the report?

    1. Set the date range parameters to a small window in Desktop and refresh the model to bring in a small set of data. Then, after publishing, change the date range parameters in the service, and then refresh again to import a larger set of data.

  2. As always, packed full of good advice AND great comments. I love Lee S comment where s/he advises to use underscores instead of spaces for ease of renaming/editing the step name. Great tidbit.

  3. Fantastic series Paul, thanks for sharing. I’d be interested in yours and others views on how you respond to a client who just ‘wants it all’ in the warehouse. I understand why you’d want nicely designed star schemas for core KPIs and high level metrics, but what about when you want to ask questions across datasets? e.g. Rental Income in one dataset, Property Repairs done in another dataset. Each has their own KPIs, but what if the user wants to ‘ask questions’ of the data across both schemas? Does it always involve starting with a new set of requirements? Some of my clients see PowerBI has being a window in to the data that they can self-serve at will? Do you have a ‘standard’ approach to this kind of thing? Thanks again for a really informative series. Looking forward to the next post.

  4. Article explained a lot for me, but I am still struggling with a data modeling question. I have a data warehouse with the dimensional model that has everything I need and only has columns that I care about. I want to move this data mart to a Power BI data model. This data mart has 1 fact table of about 1 million rows and has 30 dimensions. What are the pros and cons of importing all 31 of the tables and creating relationships in PBI vs importing a single view that uses SQL to join the fact table to all the dimensions I need and flattening the data? Which would you recommend?

    Thanks.

      1. Thanks for the quick reply and the reference article and a connected article, both of which helped solidify your advice about using a star schema in the data model side.

  5. Grey article! In your code consolidation step, I noticed that you removed the columns and filtered the rows late in the query. From a performance standpoint, I found it to be better done in the beginning. What is the reason for doing it this way?

  6. Hey Paul, great article. Can’t say I agree with PQ being the right place to do most of your data manipulation and cleansing though. Really still prefer an underlying cleaned up data model. Makes it far easier when you want to connect other reporting tools, Excel, etc. etc. and it’s easier to cleanse data there anyway, as apart from simple logic, doing so will often require other data from a DB.

    1. Hi Greg, I couldn’t agree more. My point, which I may not have made affectively, was that many BI professionals don’t have rights or the opportunity to build a proper ETL process, data mart or data warehouse. In a proper, formal IT-driven BI solution, I would advocate for loading fact and dimension tables with cleansed data, properly shaped for reporting. Secondly, I would want to use views containing only the necessary rows and columns. But for non-IT analysts who don’t have the opportunity to work with data before loading or who are working with non-relational sources, Power Query might be the only transformation tool they can use.

  7. Great article Paul! Already looking forward to the next one. Regarding selecting columns in Query Editor. Would you prefer to do it query editor instead of modifying a database view to include/exclude whatever columns you need? I am thinking that if a db view returns all columns and you want to exclude many of them, that will lead to bad performance in query editor as all data need to be loaded first and then removed, and it would be more efficient to change the db view instead.

    Also. I’d like your opinion on when to use dataflows as the data source instead of connecting to e.g. database tables/views.

    1. If you have the freedom to create and manage database views, I would return only the columns needed. However, if a view exists with a superset of the columns needed for a Power Query source, choosing a subset of the columns will probably be efficient.

  8. Hi,
    Thanks for good article! Do you have any recommendations on where I can find instruction on how to get started with using a database instead of using Excel sheets/Queries as data source for PowerBI?

    1. If you mean you are using PowrQuery in Excel to first download data, then import the data into PowerBI, you CAN bypass Excel. PowerBI also has PowerQuery, so you could directly connect to the DB using PowerQuery in PowerBI

      1. Hi Ron, I’m not sure where this impression would come from. I’m speaking of only using Power Query within Power BI. I wouldn’t advocate using Excel to extract and transform data unless the goal were to only consume the results in the Excel workbook.

  9. Fully agree with all of that Paul, good stuff

    Under the Properties of a step I also encourage people to document the “Why I did this….” that’s often the part that is left out and is the hardest to work out…

  10. Excellent article, Paul. I am one of those people who still keep using technical ColNames instead of Column Names with spaces. I am also learning to use Query Parameters, but here you’re using CamelCase for the ParameterNames. Could I be using spaced names for these parameter names, or is there an advantage for having them in CamelCase?

  11. Paul, what have you been doing in my garage?
    Great article, bizarrely I have just been through an exercise of reviewing a Power BI architecture at a new contract and this has been helpful, having picked up a few extras. Look forward to your next posts

  12. All fantastic advice. I would go further with your step naming point. I always use underscores instead of spaces to name steps – if a user ever gets to manipulate the code directly in the Advanced Editor, or an external editor, then the ability to double-click a spaceless step name to find and/or replace other instances, and the removal of the unneccessary ‘fog’ of three extra characters (a # and two sets of quotes) per step is very useful.

    1. Lee, that is an excellent suggestion. I am constantly struggling with highlighting the M variable names to copy and paste them into the next step, and often miss the “#” at the beginning.

  13. I agree, renaming and annotating query steps is a good idea for long term maintenance. Rename is easy enough to figure out, right click on step …
    .
    In each of the sections, if you don’t say how to, could you please make it a point to include a link to “how to” instructions ie
    . * a link to an article about annotating/commenting steps (simple to do: Right click > select “Properties” for both rename and comment ).

Leave a Reply to Fredrik VestinCancel reply

Discover more from Paul Turley's SQL Server BI Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading