The December 2021 Power BI Desktop update introduced a long-awaited upgrade to the partitioning and Incremental Refresh feature set. The update introduces Hybrid Tables, a new Premium feature that combines the advantages of in-memory Import Mode storage with real-time DirectQuery data access; this is a big step forward for large model management and real-time analytic reporting.
Partitions allow us to break large tables in a data model into smaller, manageable storage units for a variety of reasons. Until recently, each table in the data model could be stored only in Import storage mode or as a DirectQuery table, which stored no data in the model. Generally, Import mode is fast and full-featured but large tables take up memory, while DirectQuery uses no memory for storage but is typically slower and limits many reporting features. Incremental Refresh can automatically break a table down so that only partitions containing new or changed records can be refreshed. There are two scenarios where combining Import and DirectQuery mode partitions might be optimal: 1. providing real-time access to recently inserted records, or: 2. proving access to infrequently accessed old records that take up a lot of storage space. This demonstration addresses the first scenario.
Hybrid tables, an extension to the Incremental Refresh feature set, can consist of both Import Mode and DirectQuery partitions, so that some data is cached in memory and some records are read directly from the data source as users interact with the data model. Setting up Incremental Refresh is the same as it was when introduced a few years ago, but the options have been enhanced.
Let’s take a look at the updated Incremental Refresh dialog for a fact table. I will demonstrate with three years of historic fact data so you can see the results. I’ve already setup the required query parameters and filter step needed to enable the feature, that I described in this post. Choosing the Incremental Refresh option from the ellipsis menu for the Online Sales table opens this dialog. There’s a lot going on in this window but pay attention to the following options that I have called-out in the graphic:
Enable automatic partitioning by switching on “Incrementally refresh this table”
Archive data starting 3 Years before refresh date
Incrementally refresh data starting 12 Months before refresh date
Enable Get the latest data in real time with DirectQuery
For simplicity, I have not set the Detect data changes option, which is something I typically would also do but we’re just keeping this demonstration focused on the topic of how partitions are created.
Note the graphic added to the bottom of the dialog window, which displays a graphical representation of the results. I will break this down into more detail using the following graphic. Let’s see the effect of these settings The graphic depicts the partitioning scheme for the Online Sales fact table in this data model. Bases on the settings in this dialog, the Power BI service will generate partitions using the RangeStart and RangeEnd query parameters to inject dates into the WHERE clause of each partition query. A few query examples are shown for demonstration purposes. The dark-colored boxes represent the Import Mode partitions that cache a copy of data in the model, and the light-colored box represents a DirectQuery partition that runs in real time when the model is used for reporting. Note that I have abbreviated things for illustration purposes, and there will actually be more monthly partitions than those represented here.
Referencing the earlier Incremental Refresh settings dialog, the Archive date starting option specifies three years, which generates three yearly partitions, each with a relative date range. The Incrementally refresh data starting option specifies 12 months, which generates 12 monthly partitions (only 3 are illustrated here). The example query shows a range of one month per partition query. Lastly, a DirectQuery partition uses a range of dates to include all records in the current month and possible future dates.
These partitions are created by the Power BI service the first time the dataset is processed, after the data model is deployed to a Premium Capacity or Premium Per User workspace.
Let’s take a look at the actual partitions. Using the XMLA endpoint, I can use the workspace URI address to connect to the service using SQL Server Management Studio and show the partitions for the Online Sales table. With 15 partitions, it takes longer initially, to process the dataset than it would if it were not partitioned. This is because the database engine only processes a few queries at a time. However, subsequent refresh cycles should be much faster since only a few partitions will need to be processed with additions and changes.
Connecting to the published dataset using Tabular Editor shows even more partition details. Here, you see the monthly partition for November 2021. The Start and End dates for this partition are 11/1/2021 and 12/1/2021 (US date format), respectively.
The last partition, which is in DirectQuery mode has a Start date of 12/1/2021 and an End date several years in the future. This range is a catch-all for records that could potentially have future dates.
What are the performance implications of a hybrid table?
How can you add an historic partition that uses DirectQuery mode?
I will address these questions in a later post. Please share your experience using the comments for this post.
The documentation on this feature is very brief right now, but here are some additional resources:
Last year I wrote this post about the new composite model feature in Power BI that enables datasets to be chained together using a featured called “DirectQuery for PBI Datasets and AS“. The prospect of creating a data model that virtually layers existing data models together without having to repeat the design, sounds like nothing less than Utopia. We can leverage existing datasets, with no duplicate models, no duplication of business logic, and no duplication of effort. The promise of this capability is that data models may be referenced from other data models without duplicating data. So, is this really possible?
The feature we are discussing in this post is not for beginners and it is in Preview, with some important limitations that should be considered in the overall solution architecture. Make sure you understand these trade-offs before using this feature.
When the new composite model feature (using DirectQuery for AS) was released to Preview last year, I gave it a thorough test as did others in the community. It was cool and I was able to whip up a few demos, but right from the gate, I could see that it wasn’t ready for production use. The product engineering teams have been working on this dream feature for years, and there are several dependencies in the Power BI service (Gen2 architecture), the VertiPaq engine, and the Power BI Desktop model designer that needed to be orchestrated. It is complex and will be slow to mature.
The most significant shortcomings, in my opinion, were that in the local model (where you add DirectQuery connections to existing, published data models), you couldn’t add objects with duplicate names. If any measures with the same name exist in the remote data model, you couldn’t add a reference to the remote model. If you could get past resolving the measure name conflicts, when connecting to a remote data model, you get every single table in the remote model; thus, compounding the duplicate object name issue. Cool technology with lots of promise but there were significant implementation blockers from the get go.
The product teams were notably quiet over the past year, with very little news about enhancements to this feature. With all the fanfare about the Preview launch last year, I was looking for more news about critical feature updates. There have been a lot of organization changes and feature reassignments at Microsoft this year, so maybe the work on this feature just hasn’t been visible externally. But alas, there is good news! As of the October Desktop update, the composite model design experience has improved significantly. Any duplicate measure names are now resolved by the designer by adding numbers to the end. You can now select only the tables you want to bring into the local model after connecting to a remote model.
New Concepts and Terminology
Gen2 composite models introduce some new concepts and terms that are important to understand. Before you can grasp these new composite model concepts, you must understand the fundamental data modeling architecture in the SSAS Tabular/VertiPaq modeling engine, which is utilized within Power BI.
Direct Connect and DirectQuery
The first thing is that when you create a new report connected to a published Power BI dataset, this uses “Live Connect” which is the default mode for Power BI reports connected to a published dataset. To enable a Gen2 composite model, this connection must be changed to use DirectQuery, using the XMLA endpoint, to connect to the remote published data model as an Analysis Services connection.
A data model that references another data model is referred to as a chain of models. Currently, up to three levels of chaining are supported. This means that if you have a data model that imports or connects to data sources, that published dataset can be referenced by another data model. The second model can then be published as a dataset. The latter dataset may reference multiple published datasets but it cannot be used as a DirectQuery source for a third data model. The product documentation uses the following illustration to demonstrate the limits of a three-level chain. In this example, the Sales data model is an Azure Analysis Services data model referenced as a remote dataset by the Sales and Budget dataset.
Note: The Sales AAS model in this example could just as well be a published Power BI dataset because the XMLA endpoint sees them as the same type of SSAS Tabular model connection.
The Sales and Budget dataset, in turn, is referenced as a remote dataset by the Sales and Budget Europe dataset. As illustrated by the Budget Excel source, any of these data models can have a variety of connectors used to import or connect to various data sources.
You can create relationships between the tables from different sources as long as they don’t violate the rules for existing relationships, either in the local data model or relationships that exist in the underlying remote data model. Since the local model developer may not have visibility to the entire schema for the remote model, this is a complexity to consider when using composite models. This is an example of why composite models are not for beginners.
Remote data model
From the perspective of the “local” data model that is open in the Desktop designer, when a published dataset is referenced by another data model, it is referred to as a Remote dataset.
Local data model
Any published dataset (within the limits of this feature) can be used in a remote connection. The data model that connects to a remote dataset is referred to as the Local dataset.
Relationships may be created between tables from different remote datasets. These relationships should generally behave like they do in standard data models but there are some limits and subtle behavioral differences. There are still features of Power BI and Azure Analysis Services features that are unsupported in Gen2 composite models, so use this with caution and consult the documentation for details: Using DirectQuery for datasets and Azure Analysis Services (preview) – Power BI | Microsoft Docs.
Remote table relationships
What are those funny icons on the relationship lines between local and remote tables? These denote that the relationship is “limited”, otherwise known as a “weak” relationship. This means, in simple terms, that since the relationship is defined using columns from tables in different DirectQuery sources, the VertiPaq engine cannot guarantee that the usual cardinality and key uniqueness rules can be enforced. These rules are inferred by the modelling engine but could potentially be violated and produce inconsistent results. By defining a relationship, as the model author, you are essentially stating that you expect the data to conform to cardinality and key-matching rules without the ability to force the model to behave that way. Alberto Ferrari wrote an extensive article explaining these relationship types in detail.
New Composite Model Features
I’m not going to walk through an entire demo because this would repeat the same steps as my previous post from December 2020, but I will get started (again). If you are interested in stepping through the process, I suggest following the instructions in that post. There are a few significant improvements from the previous design experience:
Duplicate measure names are automatically resolved
Individual tables can be selected when adding a new remote dataset connection
Tables can be added and removed after the initial design
Objects can be renamed after connecting
To start using this feature, make sure you have the DirectQuery for PBI datasets and AS switched on in the Power BI Desktop Options dialog. Restart Desktop to apply any changes.
The Power BI tenant must have support for XMLA endpoints enabled. This is managed in the Admin Portal under Tenant settings > Integration settings.
After all datasets that you intend to use in remote connections have been published and access is granted from the respective workspaces, open or create a new PBIX file in Power BI Desktop. Here are the two published datasets I’ll use in my new composite model:
Use Get Data and connect to a published Power BI dataset. After the live connection is made to the remote dataset, the message shown below will appear in the status bar. This is subtle and easy to miss.
Click the link titled Make changes to this model to update the data model and convert the connection to use DirectQuery for AS.
You will be prompted to confirm the change.
It is a good idea to make a backup of the PBIX before using a preview feature but there is little risk since this is a brand-new file for experimentation. However, keep this in mind for complex projects.
After making this change, using Get Data to connect additional datasets will result in DirectQuery connections. For each, you are prompted to select only the tables you want to include in the local data model.
If any objects in the remote model have the same name as those in the existing local model, those names will have a number appended to ensure uniqueness. This is true for tables and measures.
In the model diagram view, tables from the same remote model are displayed with a unique color in the heading. The default colors for the first two remote datasets are blue and purple, shown here. These colors are similar so watch carefully.
Just like a like a standard Import model, you can create relationships between any of the tables by dragging and dropping the key fields. Always double-check the relationship type and cardinality. Note that the relationship I just created between the Store and Geography 2 tables is incorrect (1:1 and bidirectional) which I will correct by editing the relationship, changing it to be one-to-many and a single-direction filter.
You cannot create a relationship between two tables (whether local or remote), if it conflicts with an existing relationship. This is bound to become a challenge in more complex composite models, but it is a reality in dimensional model design.
Objects can be added, removed and renamed
This is a huge improvement over the previous design experience. Tables can be removed one-by-one from the data model and additional tables from the same source can be added later by repeating the steps using the Get Data options. Tables, fields and measures can also be renamed. In short, I think we’re in a good place now, and this is a big improvement.
What About Performance?
There are many factors that can adversely affect the performance of reports, calculations and user interactions with a data model. Categorically, composite data models are bound to be less performant than a simple Import data model having all data residing within the same memory space. That’s just a fact. Experience has shown that DirectQuery has performance impacts just due to the nature of query mechanics, and interactions between different databases, and network latency. In many cases, performance will be a trade-off for the convenience of data model consolidation using DirectQuery and composite models in general. Optimization techniques like aggregations will help. Time will tell with this new capability. With this incredibly powerful new tool there will be many opportunities to implement poor design choices and to take Power BI to – and beyond – its limits.
There are several methods to tune and optimize models to run faster and perform well. I’m personally anxious to start using production-scale data with Gen2 composite models to test the water beyond small demo datasets. Please share your experience and I’ll continue to share mine.
You may not know that I have been travelling the country interviewing community leaders from the Microsoft data platform, on a new video blog: Data On The Road (dataontheroad.blog). Earlier this year, we travelled across the US in our motorhome, “The Data Bus” and stopped in several cities to chat with friends and colleagues who speak at conferences, write books, lead user groups and teach people about thriving in the data field. I had a chance to chat with folks like Bob Ward, Buck Woody, Adam Saxton, Kelly Kaye, Chris Finlan, Matthew Roche, Kellyn Pot’Vin Gorman, Leslie Andrews, Oz Du Soliel and many others. We also talk about how to work effectively from home and remotely, the importance of community and how these intense data experts unplug to balance work and life.
Seven days and counting… For the past eighteen years, thousands of data tech professionals would travel to the PASS Summit conference held somewhere in the US, paying for airfare, hotel and conference admission. This year, the PASS organization (Professional Association for SQL Server) is rebooted as the PASS Data Community, led by Red-Gate Software, and the virtual summit is free to attend online. Hundreds of speakers will deliver training sessions and informative lectures on a variety of important data tools and industry skills. This year, I am speaking about Paginated Reports in the Power BI service. I will reflect on the progression of SQL Server Reporting Services (SSRS) through past PASS presentations and then talk about how Paginated Reports integrates with the Power BI platform.
Two of my colleagues from 3Cloud are also presenting at the PASS Community Summit next week. There is just no good reason not to attend the summit his year, so register and make time to attend selected sessions. To reach the global audience, some of the keynotes and sessions are presented live. Others have recorded content followed by a live Q&A chat with the presenter.
May I remind you again that the online summit is FREE, so participate, volunteer and attend! There is so much to learn from world-renowned experts. PASS Community Summit speakers are the subject matter experts who write the books, teach the classes, provide expert-level consulting and actually develop many of these products and Azure services.
I just wanted to give a big big Thank You to Adam Saxton and Patrick Leblanc at GuyInACube.com for inviting me to participate in the Power BI Livestream yesterday. It was a blast and an honor to hang with you and the Saxton clan (even the sheep and the great dane!). You can watch the replay for the Saturday morning YouTube livestream here:
I’m trying to minimize excessive cross-posting between the my long-time blog, SQLServerBI.blog and the new blog: DataOnTheRoad.blog, but we are still getting started and just wanted to make sure you all got the word. Please subscribe to DataOnTheRoad.blog if you are interested in following the data community leader interviews along our journey in the Data Bus. Regular technical blog posts and videos will continue to be here on SqlServerBI.blog.
Here are a few quick shots from yesterday. Interviews with Adam will follow in the next few days.
After the livestream, Adam and I sat down and talked about the genesis and evolution of the Guy In A Cube effort. In our interview, he told me about how he got started in his 16 year career with Microsoft, and his love of data, community, gadgets, games and video. You can watch the interview in this post: An Interview from The Cube, with Adam Saxton – Data On The Road
Only the most dedicated data professionals are willing to invest their personal time and spend any part of a weekend improving their skills. Having organized many SQL Saturday events, I know that there is a big difference between those who punch out on the proverbial time clock on Friday and those who love working with data so much that they will invest their personal time on a Saturday. One of the greatest examples of this kind of dedication is the group who participate in the weekly Power BI Live Stream from Guy In A Cube.
When Adam invited me to participate in the weekly Guy In A Cube Power BI livestream on Oct 23, I was elated. I am honored to join. If you are not already familiar with the Saturday morning live stream, you can join by subscribing to the Guy In A Cube YouTube channel. The channel is member-supported. Patrick and Adam do all of this outside of their day jobs at Microsoft and then member donations help cover their costs. I encourage you to support the channel with a donation but it is not a requirement. Even without supporting the channel, you can attend the hour-long livestream. Everyone can chat and submit questions during the first half and then the second half of the session is open for members to submit their questions.
So, why am I going to be “in the cube” rather than joining remotely? Well, Houston is one of the stops in our DataOnTheRoad journey in the Data Bus! I’m interviewing Adam and other data professionals as we tour the country, working on the road, on the way to the Orlando Live! conference in November.
There e are two conferences in November where I am thrill to be presenting sessions. The second annual PASS Data Community Summit is on November 8-12. Like last year, this is an online event and IT IS FREE! I will be presenting a session titled Power BI Paginated Reports: the New Old Operational Reporting Platform. Join me on Thursday, Nov 11 at 12:30 PM Eastern Time.
The Orlando Live conference is an in-person event, and in-case the name of the event didn’t make it obvious, it will be on Orlando, Florida the week of November 15-19. This event is multiple co-located conferences in one, for attendees from multiple technical disciplines.
I will be presenting Reporting Services and Paginated Report Recipes and Doing Power BI The Right Way For Database Developers. Sessions are on November 17 at 8:00 AM and 4:00 PM Eastern, respectively.
We’re heading to Orlando in the Data Bus now and taking our time to get there. To follow our journey, visit DataOnTheRoad.blog.
It occurred to me that we have put so much effort into promoting best practices and proper design that there is far less information about how to create bad reports and data models. In that light, the purpose of this article is to talk about what to do if you want things to go poorly and make sure your projects fail – if not immediately, then sometime in the future – and if not for you then for whoever inherits the work that you have done.
I welcome your comments and thoughts about the worst practices you have seen in Power BI. Post your ideas in the comments below. What examples have you encountered of sure-fire ways to make projects go badly?
Keep in mind that the rules of “good” and “bad” design might be different for small, informal projects vs larger-scale, formal projects where longevity and scale are more important. Arguably, bad practices for enterprise-scale, multi-user report projects might be just peachy keen for small, informal desktop reports for one user with limited data. This article is written from the perspective of formal, larger-scale projects.
So, you’ve taught yourself how to use Power BI! You’ve Googled some questions and let YouTube show you how to cobble together a little data from different places, wire-up a few tables and drag-and-drop fields to make some attractive visual reports. It wasn’t all that hard to do, right? You’ve demonstrated your new Power BI design chops and now all the managers in the company want more cool charts and reports. Since the first round of reports were so well received, the company bought licenses so everyone in the organization can use Power BI to run their business. Welcome to the big leagues, folks! Just import data from all the corporate systems, desktop Excel files and whatever web feeds you can find. If it’s on the Internet it MUST be right! Munge all this data together and create dashboards to help the executives run the business!
This guide demonstrates common examples of how enterprise Power BI projects are bound to fail using self-taught undisciplined techniques to build larger, enterprise solutions. Most, if not all of the examples in this and the posts that follow, are examples that my colleagues and I encounter when we go in the fix failed Power BI projects.
Bad practice #1: Keep PBIX files on your local computer rather than shared storage
Many Power BI projects evolve through stages of prototypes, so they might create multiple PBIX files. After some trial-and-error, solution developers often save-off multiple files with specially-encoded names including things like the date, their initials and other information each one with different features and techniques. One example might be: FinancialAccountingReport_20210615_BobR_TestDisconnectedSlicer and another one might be FinancialAccountingReport_20210618_BobR_FixedForecastMeasureButNowItRunsSuperSlow. Reminds me of when I was eight. We would write notes with cyphers that only the writer and recipient would understand. There is nothing wrong with making a few backup copies of your work in a convenient place where you will find it. someone has to keep track of it all and recover work when there is a staff change.
Without a deliberate effort to exercise discipline, delete or archive the old files and then store the most current copies in a central storage location that gets backup and synced to the cloud; you are sure to have a mess on your hands. Worse than that, if the project ever needs to be turned over to someone else, they wouldn’t have a clue about how to continue the work. I always ask myself what would happen if I had to hand-over my work to another developer tomorrow? Would they have what they need to continue and would they know where to pick things up? Some transition documentation is good to have in this case, but make sure the master files are in a safe location. Better yet, keep them in a source code repository like Azure DevOps or GitHub, or in OneDrive or SharePoint with automatic file versioning.
Oh, wait… this article is about creating bad projects! Bad practice is to just leave files on your local drive. If you want things to go badly, that’s what you should do. When you leave this mess for someone else, they will have no idea about what you did or how to continue the work.
Bad practice #2: Keep the only copy of a report and dataset published in the online workspace to download when you need it
Today, we’re all about using the cloud to safely store files. This is a good practice. You can store all your PBIX and other important files in a OneDrive folder where they are backed up to the cloud and the files are even versioned so you can revert back to an earlier version if you do something silly and need to recover. However, relying on the option to store your one-and-only copy of a report or data model file in a Power BI workspace is optimistic at best.
There are several conditions that will disable to ability to download a PBIX file. The thing is that enterprise-scale data models need to take advantage of features that are not supported in Power BI Desktop (and probably never will be). This is by design. Desktop is primarily a self-service tool and most of the feature we use in enterprise-scale projects require the model definition file to either be saved to a JSON model definition file (e.g. “Model.BIM”) or to a PBIX file with an internal model version number unusable by Power BI Desktop. I’ve heard from numerous novice developers who were accustomed to publishing their latest PBIX file to the service, only to discover that after enabling some advanced feature; they could no longer download the file.
If you want to shoot yourself in the foot, do not keep a master copy of your model and report files (which you are managing separately, right?) in OneDrive and roll the dice that you will be able to download a copy when you need to edit the model or report in Desktop.
Bad practice #3: Import source data files from your local file system and then publish the report to the service
If your source data files are on your local computer, after you publish the data model to the service as a dataset; you will not be able to refresh the dataset from the service unless you take a few more steps. There are a few options to remedy this situation, such as:
Move the source file folder to a shared OneDrive folder or SharePoint library, and then use that connectors to refresh the dataset from the service.
Move the source file folder to a network share and then setup the enterprise gateway to import and refresh the published dataset from the service.
Install and setup a personal gateway to fresh files from your local computer.
Option #3 is a sure fire way to run into problems down the road. This might work for a small project managed by a single developer. But, if you have source files on your laptop which happens to be asleep or shut off during a scheduled refresh event, that’s not gunna work, and the refresh will fail.
Another common gotcha is that if you import source files that are stored in a OneDrive folder, and use a local syched folder as the source; the connection is to the local file system (like “C:\Users\Paul\OneDrive\…”) and not to the OneDrive service in the cloud. In this case, the data refresh will try to read from the local folder which is inaccessible from the Power BI service.
Bad practice #4: Use hand-written SQL queries to import large database tables
SQL is the tool of choice for most database professionals, so it would stand to reason that you should write a SQL query when connecting to a relational database (like SQL Server or Oracle.) Right? Sure, if you want things to be slow and inefficient. Best practice is to connect your queries in Power BI to read from a database table or view, and not to use SQL statements so that Power Query can employ a method called “query folding”. This means that the Power Query processing steps, where possible, will be translated into SQL or the native query language for the data provider. There are advanced techniques to promote query folding over a SQL statement, but this is the exception to the rule. If you need to write SQL logic to join, filter, group or sort data from multiple tables before it hits Power Query, write a view and store it in the source database.
Simple answer: If you want things to go badly, write SQL queries in your Power Query queries using Power BI Desktop.
Bad practice #5: Use DirectQuery for “real-time” analytics
Power BI includes an awesome capability called “DirectQuery” that enables certain data source engines to handle query operations as user interacts with report visual, rather than using the VertiPaq in-memory analytic engine, to process queries. DirectQuery doesn’t use the VertiPaq storage engine, which is really the heart and soul of Power BI and SSAS Tabular data models. With DQ, records are read live from the source. The downside to DQ is huge. Most interactive report visuals can take several seconds to load, and some DAX functions simply can’t be used due to resource overhead and performance cost.
Database professionals often consider DQ to be a cure-all solution because it enables familiar database technologies. Everything is managed centrally, there is no record duplication in the data model and users see the current state of a record when they interact with reports. That’s all fine and dandy but compared to Import mode, DirectQuery performance can be several orders of magnitude slower.
Import mode is fast because because data sits in memory, compressed by column. DQ is slow because queries must be translated and pushed down to the data store, then results must be sent back. DirectQuery is a powerful feature when used strategically to address a few specific problems. It is a good solution for massive fact tables that would otherwise use a lot of memory. Just filter large tables and don’t do a lot of grouping and aggregation over a large set of records.
In a Composite data model, with most tables using Import mode, and one or a few detail tables using DirectQuery, you can a achieve “best-of-both-worlds” solution by allowing users to drill to details against a filtered set a records.
If you want your visual, analytic reports to be really slow, use DirectQuery for all the tables.
Bad practice #6: Combine source data and load one big table into the Power BI data model
People who approach data problems in Power BI the same way they do in Excel, tend to combine source tables into a large spreadsheet-like table. That approach creates an seemingly easy-to-navigate grid of data with a lot of columns; but you can lose a lot of functionality and create a very inefficient model this way. The VertiPaq, Tabular modeling engine was architected around the premise of “star schema” dimensional models.
It’s not that Power BI won’t work with flattened data, or that you “have to” build a star schema, with separate dimension tables for every attribute! The the natural rules of data behavior just cause Power BI to work better when data is modeled in related fact and dimension tables. There are several DAX functions that expect records to be filtered through a relationship. One of the most common examples is a date dimension table, used to filter a fact table to perform time-series calculations. But that same behavior can exist for Products, Customers, Geography and other filtering or grouping attributes that can be used to filter a related table of numeric business facts.
If you want to struggle with DAX calculations and limit users’ options to interact with your data, pound your data a large, flat table, rather than individual dimension or look-up tables.
This question comes up every few years in SQL Server Reporting Services. Of course, in Power BI Paginated Reports, we have the same features. A couple of days ago, Karthik posted this question as a comment to my post titled Chapter 7 – Advanced Report Design:
I am working on a SSRS report where the grouping is done to group the records in to multiple tabs/worksheets. When the report is exported to excel, the worksheets has the default name (Sheet1, Sheet2, Sheet3,…). Here I want to override the default worksheet name with (Tab1, Tab2, Tab3, …). The number of tabs/worksheets varies each time depending on the parameter selection by the user. How to address this? any suggestions please.
There are a couple of requirements to set this up. First, you must drive report pagination using a Group and then set a Page Break on the Group. This figure shows that I am grouping my table data region on the Brand field and then setting a page break on that group. You can get to the Page Break settings by using the little arrow next to the group item in the Row Groups pane or you can use the properties window like I’m doing here.
Next, you want to give the page a name under the PageBreak properties. Expand PageBreak and then use the Expression builder in the PageName property to use the grouped field name in an expression. In this case, I am referencing the Brand field.
The result is shown here when I open the file resulting from exporting this report to Excel. You can see that the worksheet tabs are titled with each Brand value and each sheet contains that product information group by the corresponding brand.
That’s right! I’m starting a new blog site today. Just to be clear, it will not replace this one. I will continue to post here at SqlServerBi.blog about everything related to Business Intelligence best practices, Power BI, Paginated Reports, SQL and the Azure Data Platform.
The new blog is called Data On The Road and it is dedicated to connecting with the professional community while on the road as a working technology professional. In May of 2021, my wife and I decided we needed an adventure so we transitioned our living space into a new Tiffin Class A motorhome. We had it outfitted with a private office and tech so we could stay connected while working fulltime from the road. We call it the “Data Bus”.
We’re just getting started and taking short trips to practice and work-out the kinks – but it has been great so far. We hit the road for good in the Fall to go experience the US from the Southwest, the South, the East coast – and who knows where?
We’re still assembling the tech and working out the logistics to visit and interview friends and colleagues along the way. In the database and BI community, we’re a dedicated and social bunch so let’s talk about work and about fun! We might chat a little tech but the purpose of this blog – and this adventure – is to talk about work/life balance. In our jobs, we work hard and play hard – so we want to know what our community friends and leaders do when they need to get away. How do they balance their careers with other activities?
Please follow both of my blogs but head on over to DataOnTheRoad.blog and follow us on this new adventure!
Later in this post, you will find a 27 minute framed video including a series of interviews and discussions with a few of my colleagues. They provide perspectives and recent experiences related to this topic. The best approach to design and maintain large data models depends on different factors. However, many years of experience have proven that there are best practices to embrace and common mistakes to avoid.
I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.
This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.
One of the great challenges with Power BI is that it is so easy to build simple data models and reports quickly. But, often when those projects need to grow-up, they aren’t ready for enterprise use. This isn’t a problem with the tool, it is a problem with the way it is often used. If someone says “Hey, Paul, I need you to create a report quickly. Can you import data from these three sources into Power BI and create a quick dashboard report?” I might say “sure, let’s just do this:”
Merge records from multiple files and queries into a single table in the model.
Use in-line SQL queries in Power Query to join source tables together and apply transformation logic.
In reports, summarize numeric columns from a table in the data model, rather than defining DAX measures.
Reload the entire model when source data changes.
There is nothing inherently wrong or bad about using these quick-and-easy techniques in a smallish project, but if it needs to grow, scale and be future-proof, you may need to follow a longer path to achieve a more durable result.
The answer to a question we often ask at the beginning of a BI project: “what information do we need include in the data model?”, is often the same. I think Freddie Mercury said it best: “I want it all and I want it now!” More specifically, stakeholders might say that they want:
A lot of rows of detail or historical fact records, (lets say tens of millions or billions of rows.)
Many columns from the source table(s) so they can slice, filter or group values on anything imaginable.
Reports must perform well and return results quickly.
…to these requests, we might say “pick any two but you can’t have all three”. Maybe you can have all three to a certain point but not to an extreme.
Tall and Wide tables
A “tall” table contains many rows or records. A million might be a substantial number of fact table rows in a desktop Power BI data model, and row counts in the billions are not out of reach in an enterprise-scale solution built in either Power BI or Analysis Services. The VertiPaq analytic data processing engine (the data modeling engine behind SSAS, AAS, Power Pivot and Power BI) is very good at handling very tall tables when the columns are conservative, numeric data types; especially when they contain repeating, non-unique values to aid columnar compression. In rare cases, Power BI data models can even read and process PetaBytes of data! The point is that tall tables are very achievable.
Tables with many columns can be managed by the modeling engine but tall (many rows) and wide (many columns) are often not a good combination. The fact is that analytic reporting solutions shouldn’t include the whole kitchen sink. The problem often stems from developers not building data models based on on business requirements but simply exposing all the available data just in case someone might find it valuable. This is just not the way BI works. If you don’t need certain columns to meet the current report requirements, leave them out. Columns can usually be added later when they are needed. Additional tables can be also added until the model exceeds its design goals. Herein lies the need for balance.
The thing is that simple data models can be relatively easy to build and often don’t take up a lot of space. It may be more cost-effective in the long run to create another data model with some duplicate design elements, than to spend all of your time and energy trying to craft an all-encompassing data model.
Large Data and Big Data
By “Large Data”, we generally mean that tables contain a high number of rows. Conversely, “Big Data” refers to a set of technologies that were engineered to manage large volumes of data records using distributed storage and processing.
In the first wave of cloud computing, we saw the introduction of “Big Data” which was a new concept for data professionals. Technologies like Hadoop and Apache Spark use distributed computing to process larges sets of data in parallel, using highly scaled-out architecture. The term “Big Data” doesn’t just mean that there is a lot of data but it really means “distributed data processing” which can potentially be used to handle very large data sets. Like any architecture, there are strengths and weaknesses. Big Data implementations like Spark and Databricks can transform and process an enormous amount of information very quickly with portions of the data processed on distributed nodes.
Conversely, the VertiPaq engine at the center of a Power BI and SSAS tabular data model, is not a distributed architecture. Because VertiPaq keeps all of the data in memory on a single computing node, it can perform calculations extremely fast. For this to work, data must reside in contiguous memory on a single computer. So, Power BI is not Big Data but it can be used in conjunction with Big Data sources when you use these respective technologies the way they were designed and architected to work.
Unique problems require unique data model designs. Sometimes a Power BI report and accompanying data model just need to address a few business reporting objectives. As time goes by, additional requirements are introduced and the model grows. Along the way, certain data challenges might require additional data massaging or new tables to bridge tables and address special needs. Over time, a model can get complicated. This often raises the question: should we create one big, complex data model to address many requirements or multiple, less complicated data models? There are trade-off decisions either way. It is often possible to engineer larger, high-value models. However, you have to weight the cost to maintain a complex model with the relative simplicity and cost savings of less-complicated models. Sometimes starting over is cheaper than layering more messiness onto a data model with a shaky foundation.
Once considered an essential activity in the design of any data model, partitioning breaks a table into multiple storage units that can be managed and loaded separately. Partitions containing unchanged records don’t need to be reprocessed, and the smaller partitions containing new records or those that have changed, can be processed quickly and in parallel, so they don’t block other operations.
This video runs about 27 minutes and includes a series of interviews and discussions with some of my colleagues, with their thoughts and experiences on this topic.
This is going to be a very busy week for presentations Iam presenting with five sessions scheduled on the topic of data modeling!
Data modeling is the heart and foundation of a useful Power BI solution. If you can get the data model right, most everything else falls into place. Correct data modeling is relatively simple on the surface but can be challenging in practice. Evidence of the interest and importance of this topic is that I submit abstracts on a few different topics for these events and this is the one keeps getting selected.
Silicon Valley Microsoft Data Platform Meetup Tuesday, April 20 at 6:00 PM Pacific Time
Power BI Summit April 19-23 – multiple sessions and times
The session will be repeated three times for attendees in different time zones.
Supporting materials are available on the Presentations page on my blog here: Presentations | Paul Turley’s SQL Server BI Blog. In addition to to the material and demonstrations I present in these sessions, I wanted to share some additional thoughts which I have recorded in this short video:
Navigation between reports is the hallmark of an interactive reporting solution, enabling the ability to drill-through and see relevant details and contextual filtered information in a target report. Power BI and Paginated Reports both offer navigation capabilities suited for different purposes. Having the ability to navigate between reports can provide a tremendous amount of power and flexibility.
This is simple math: with two different report types, there are four different possible report navigation combinations. They are:
If you have worked with the native SSRS report drill-through using report actions, that capability is not yet baked into the product. It probably will be some day but likely not soon. That’s OK because this technique, using URL navigation, works quite well. There is one other twist which is that the URL parameters are a little different when comparing on-premises Paginated (SSRS) reports to Paginated reports in the Power BI service. The technique in in this recipe uses an on-prem target report. You can apply the same pattern in the cloud with sight modifications and I will post an example in the near future.
This image depicts the source Power BI report with links that navigate to the target paginated report, showing filtered details:
The mechanism to enable navigation is a report URL produced with a DAX measure. The code emits a URL with parameters that preserve the row and filter context of the selected item or record. This URL measure is exposed as a link in a table or matrix visual in the Power BI report, optional with a URL “link” icon:
This series of posts introduces the recipes in the Paginated Report Recipes eBook. In recipe #3, “Reusable Report Template”, Kathi Kellenberger demonstrates how to create a universal report template that you can use to create standardized reports that have consistent styling, branding and look-and-feel. This is an effective best practice to ensure that all reports meet organizational standards and conform to a standard style and format. Kathi’s approach provides tremendous flexibility by using parameters as variables to change several report styling elements. Thank you, Kathi, for contributing this recipe to the collection.
The Reusable Report Template recipe allows you to build Paginated reports and SQL Server Reporting Services reports in Visual Studio, starting with a new report template that allows you to easily switch-out a logo image, background, border and line colors and the report title in the header.
Back when I was doing a lot of SSRS project work, I used techniques similar to those described in this recipe to create “starter” report templates. At the beginning of a new project, I would create a standard template report and then create new reports from the template. I also used parameters and expressions to define dynamic styling properties. However Kathi has taken all of these tricks to the next level by building a comprehensive report template with all of these techniques built-in. At the center of the template is a custom parameter handler function that helps manage several styling properties in a single parameter and then maps them to various report settings.
Use the detailed instructions in this recipe to define your own standard report template with your corporate branding and style, and then use the advanced coding technique to swap-out styling property values. The report recipe is available here. The sample files and other resources related to this recipe are available here.
First and most importantly, I want to acknowledge and thank these contributing authors who wrote report recipes and working samples included in this eBook:
As I am working on posting the initial report recipes, I wanted to share some comments about the first two recipes. These recipes are step-by-step instructions that lead through the start-to-finish process using provided sample database, and can be used with SQL Server Reporting Services (SSRS), Paginated Report Server or Paginated Reports published to a Power BI Premium capacity workspace. Reports may be authored with Power BI Report Builder, Report Builder or SQL Server Data Tools/Visual Studio. Additional recipes will follow.
The eBook with all of the recipes will be accessible from the main menu here on the blog. Each recipe is published as a page rather than a post, so you don’t have to search and try to figure out when each one was originally posted. As of tonight, I have published eight of twelve recipes that have been completed thus far. I’ll get others posted in the near future. Please enjoy these and watch for more paginated report recipes here: Paginated Report Recipes eBook | Paul Turley’s SQL Server BI Blog
The first two recipes in the Paginated Report Recipe eBook are about applying alternate row and group shading. Dating back to the early days of printed reports, this style of reports were called “greenbar reports” because they were literally printed on continuous, pin-fed paper that had alternating shaded green and white rows to make it easier for someone reading the report to follow-along.
As the industry evolve from scrolling paper reports to laser-printed pages and then to on-screen reports, the greenbar effect was still useful. Recipe #1 introduces expression-based styling by changing the background color using the modulus (divisional remainder) of the row number in a table data region. If the row number is on odd number, set the background color to a shading color. Otherwise, set it to white or transparent.
In recipe #2, contributing author Kirill Perian takes this technique to the next level by maintaining the alternate shading across column and row groups in a matrix data region. The method is a little more involved than just using the row number. He uses a hidden column to track the column group and then an expression to reference that value to manage the background shading for each row.
The end results looks similar to the simple table report using in recipe #1 but the problem is a little more complex in a matrix report because of the column groups. The advantage of this technique is that it will work across row and column groups at multiple levels.