The December 2021 Power BI Desktop update introduced a long-awaited upgrade to the partitioning and Incremental Refresh feature set. The update introduces Hybrid Tables, a new Premium feature that combines the advantages of in-memory Import Mode storage with real-time DirectQuery data access; this is a big step forward for large model management and real-time analytic reporting.
Partitions allow us to break large tables in a data model into smaller, manageable storage units for a variety of reasons. Until recently, each table in the data model could be stored only in Import storage mode or as a DirectQuery table, which stored no data in the model. Generally, Import mode is fast and full-featured but large tables take up memory, while DirectQuery uses no memory for storage but is typically slower and limits many reporting features. Incremental Refresh can automatically break a table down so that only partitions containing new or changed records can be refreshed. There are two scenarios where combining Import and DirectQuery mode partitions might be optimal: 1. providing real-time access to recently inserted records, or: 2. proving access to infrequently accessed old records that take up a lot of storage space. This demonstration addresses the first scenario.
Hybrid tables, an extension to the Incremental Refresh feature set, can consist of both Import Mode and DirectQuery partitions, so that some data is cached in memory and some records are read directly from the data source as users interact with the data model. Setting up Incremental Refresh is the same as it was when introduced a few years ago, but the options have been enhanced.
Let’s take a look at the updated Incremental Refresh dialog for a fact table. I will demonstrate with three years of historic fact data so you can see the results. I’ve already setup the required query parameters and filter step needed to enable the feature, that I described in this post. Choosing the Incremental Refresh option from the ellipsis menu for the Online Sales table opens this dialog. There’s a lot going on in this window but pay attention to the following options that I have called-out in the graphic:
Enable automatic partitioning by switching on “Incrementally refresh this table”
Archive data starting 3 Years before refresh date
Incrementally refresh data starting 12 Months before refresh date
Enable Get the latest data in real time with DirectQuery
For simplicity, I have not set the Detect data changes option, which is something I typically would also do but we’re just keeping this demonstration focused on the topic of how partitions are created.
Note the graphic added to the bottom of the dialog window, which displays a graphical representation of the results. I will break this down into more detail using the following graphic. Let’s see the effect of these settings The graphic depicts the partitioning scheme for the Online Sales fact table in this data model. Bases on the settings in this dialog, the Power BI service will generate partitions using the RangeStart and RangeEnd query parameters to inject dates into the WHERE clause of each partition query. A few query examples are shown for demonstration purposes. The dark-colored boxes represent the Import Mode partitions that cache a copy of data in the model, and the light-colored box represents a DirectQuery partition that runs in real time when the model is used for reporting. Note that I have abbreviated things for illustration purposes, and there will actually be more monthly partitions than those represented here.
Referencing the earlier Incremental Refresh settings dialog, the Archive date starting option specifies three years, which generates three yearly partitions, each with a relative date range. The Incrementally refresh data starting option specifies 12 months, which generates 12 monthly partitions (only 3 are illustrated here). The example query shows a range of one month per partition query. Lastly, a DirectQuery partition uses a range of dates to include all records in the current month and possible future dates.
These partitions are created by the Power BI service the first time the dataset is processed, after the data model is deployed to a Premium Capacity or Premium Per User workspace.
Let’s take a look at the actual partitions. Using the XMLA endpoint, I can use the workspace URI address to connect to the service using SQL Server Management Studio and show the partitions for the Online Sales table. With 15 partitions, it takes longer initially, to process the dataset than it would if it were not partitioned. This is because the database engine only processes a few queries at a time. However, subsequent refresh cycles should be much faster since only a few partitions will need to be processed with additions and changes.
Connecting to the published dataset using Tabular Editor shows even more partition details. Here, you see the monthly partition for November 2021. The Start and End dates for this partition are 11/1/2021 and 12/1/2021 (US date format), respectively.
The last partition, which is in DirectQuery mode has a Start date of 12/1/2021 and an End date several years in the future. This range is a catch-all for records that could potentially have future dates.
What are the performance implications of a hybrid table?
How can you add an historic partition that uses DirectQuery mode?
I will address these questions in a later post. Please share your experience using the comments for this post.
The documentation on this feature is very brief right now, but here are some additional resources:
Last year I wrote this post about the new composite model feature in Power BI that enables datasets to be chained together using a featured called “DirectQuery for PBI Datasets and AS“. The prospect of creating a data model that virtually layers existing data models together without having to repeat the design, sounds like nothing less than Utopia. We can leverage existing datasets, with no duplicate models, no duplication of business logic, and no duplication of effort. The promise of this capability is that data models may be referenced from other data models without duplicating data. So, is this really possible?
The feature we are discussing in this post is not for beginners and it is in Preview, with some important limitations that should be considered in the overall solution architecture. Make sure you understand these trade-offs before using this feature.
When the new composite model feature (using DirectQuery for AS) was released to Preview last year, I gave it a thorough test as did others in the community. It was cool and I was able to whip up a few demos, but right from the gate, I could see that it wasn’t ready for production use. The product engineering teams have been working on this dream feature for years, and there are several dependencies in the Power BI service (Gen2 architecture), the VertiPaq engine, and the Power BI Desktop model designer that needed to be orchestrated. It is complex and will be slow to mature.
The most significant shortcomings, in my opinion, were that in the local model (where you add DirectQuery connections to existing, published data models), you couldn’t add objects with duplicate names. If any measures with the same name exist in the remote data model, you couldn’t add a reference to the remote model. If you could get past resolving the measure name conflicts, when connecting to a remote data model, you get every single table in the remote model; thus, compounding the duplicate object name issue. Cool technology with lots of promise but there were significant implementation blockers from the get go.
The product teams were notably quiet over the past year, with very little news about enhancements to this feature. With all the fanfare about the Preview launch last year, I was looking for more news about critical feature updates. There have been a lot of organization changes and feature reassignments at Microsoft this year, so maybe the work on this feature just hasn’t been visible externally. But alas, there is good news! As of the October Desktop update, the composite model design experience has improved significantly. Any duplicate measure names are now resolved by the designer by adding numbers to the end. You can now select only the tables you want to bring into the local model after connecting to a remote model.
New Concepts and Terminology
Gen2 composite models introduce some new concepts and terms that are important to understand. Before you can grasp these new composite model concepts, you must understand the fundamental data modeling architecture in the SSAS Tabular/VertiPaq modeling engine, which is utilized within Power BI.
Direct Connect and DirectQuery
The first thing is that when you create a new report connected to a published Power BI dataset, this uses “Live Connect” which is the default mode for Power BI reports connected to a published dataset. To enable a Gen2 composite model, this connection must be changed to use DirectQuery, using the XMLA endpoint, to connect to the remote published data model as an Analysis Services connection.
A data model that references another data model is referred to as a chain of models. Currently, up to three levels of chaining are supported. This means that if you have a data model that imports or connects to data sources, that published dataset can be referenced by another data model. The second model can then be published as a dataset. The latter dataset may reference multiple published datasets but it cannot be used as a DirectQuery source for a third data model. The product documentation uses the following illustration to demonstrate the limits of a three-level chain. In this example, the Sales data model is an Azure Analysis Services data model referenced as a remote dataset by the Sales and Budget dataset.
Note: The Sales AAS model in this example could just as well be a published Power BI dataset because the XMLA endpoint sees them as the same type of SSAS Tabular model connection.
The Sales and Budget dataset, in turn, is referenced as a remote dataset by the Sales and Budget Europe dataset. As illustrated by the Budget Excel source, any of these data models can have a variety of connectors used to import or connect to various data sources.
You can create relationships between the tables from different sources as long as they don’t violate the rules for existing relationships, either in the local data model or relationships that exist in the underlying remote data model. Since the local model developer may not have visibility to the entire schema for the remote model, this is a complexity to consider when using composite models. This is an example of why composite models are not for beginners.
Remote data model
From the perspective of the “local” data model that is open in the Desktop designer, when a published dataset is referenced by another data model, it is referred to as a Remote dataset.
Local data model
Any published dataset (within the limits of this feature) can be used in a remote connection. The data model that connects to a remote dataset is referred to as the Local dataset.
Relationships may be created between tables from different remote datasets. These relationships should generally behave like they do in standard data models but there are some limits and subtle behavioral differences. There are still features of Power BI and Azure Analysis Services features that are unsupported in Gen2 composite models, so use this with caution and consult the documentation for details: Using DirectQuery for datasets and Azure Analysis Services (preview) – Power BI | Microsoft Docs.
Remote table relationships
What are those funny icons on the relationship lines between local and remote tables? These denote that the relationship is “limited”, otherwise known as a “weak” relationship. This means, in simple terms, that since the relationship is defined using columns from tables in different DirectQuery sources, the VertiPaq engine cannot guarantee that the usual cardinality and key uniqueness rules can be enforced. These rules are inferred by the modelling engine but could potentially be violated and produce inconsistent results. By defining a relationship, as the model author, you are essentially stating that you expect the data to conform to cardinality and key-matching rules without the ability to force the model to behave that way. Alberto Ferrari wrote an extensive article explaining these relationship types in detail.
New Composite Model Features
I’m not going to walk through an entire demo because this would repeat the same steps as my previous post from December 2020, but I will get started (again). If you are interested in stepping through the process, I suggest following the instructions in that post. There are a few significant improvements from the previous design experience:
Duplicate measure names are automatically resolved
Individual tables can be selected when adding a new remote dataset connection
Tables can be added and removed after the initial design
Objects can be renamed after connecting
To start using this feature, make sure you have the DirectQuery for PBI datasets and AS switched on in the Power BI Desktop Options dialog. Restart Desktop to apply any changes.
The Power BI tenant must have support for XMLA endpoints enabled. This is managed in the Admin Portal under Tenant settings > Integration settings.
After all datasets that you intend to use in remote connections have been published and access is granted from the respective workspaces, open or create a new PBIX file in Power BI Desktop. Here are the two published datasets I’ll use in my new composite model:
Use Get Data and connect to a published Power BI dataset. After the live connection is made to the remote dataset, the message shown below will appear in the status bar. This is subtle and easy to miss.
Click the link titled Make changes to this model to update the data model and convert the connection to use DirectQuery for AS.
You will be prompted to confirm the change.
It is a good idea to make a backup of the PBIX before using a preview feature but there is little risk since this is a brand-new file for experimentation. However, keep this in mind for complex projects.
After making this change, using Get Data to connect additional datasets will result in DirectQuery connections. For each, you are prompted to select only the tables you want to include in the local data model.
If any objects in the remote model have the same name as those in the existing local model, those names will have a number appended to ensure uniqueness. This is true for tables and measures.
In the model diagram view, tables from the same remote model are displayed with a unique color in the heading. The default colors for the first two remote datasets are blue and purple, shown here. These colors are similar so watch carefully.
Just like a like a standard Import model, you can create relationships between any of the tables by dragging and dropping the key fields. Always double-check the relationship type and cardinality. Note that the relationship I just created between the Store and Geography 2 tables is incorrect (1:1 and bidirectional) which I will correct by editing the relationship, changing it to be one-to-many and a single-direction filter.
You cannot create a relationship between two tables (whether local or remote), if it conflicts with an existing relationship. This is bound to become a challenge in more complex composite models, but it is a reality in dimensional model design.
Objects can be added, removed and renamed
This is a huge improvement over the previous design experience. Tables can be removed one-by-one from the data model and additional tables from the same source can be added later by repeating the steps using the Get Data options. Tables, fields and measures can also be renamed. In short, I think we’re in a good place now, and this is a big improvement.
What About Performance?
There are many factors that can adversely affect the performance of reports, calculations and user interactions with a data model. Categorically, composite data models are bound to be less performant than a simple Import data model having all data residing within the same memory space. That’s just a fact. Experience has shown that DirectQuery has performance impacts just due to the nature of query mechanics, and interactions between different databases, and network latency. In many cases, performance will be a trade-off for the convenience of data model consolidation using DirectQuery and composite models in general. Optimization techniques like aggregations will help. Time will tell with this new capability. With this incredibly powerful new tool there will be many opportunities to implement poor design choices and to take Power BI to – and beyond – its limits.
There are several methods to tune and optimize models to run faster and perform well. I’m personally anxious to start using production-scale data with Gen2 composite models to test the water beyond small demo datasets. Please share your experience and I’ll continue to share mine.
You may not know that I have been travelling the country interviewing community leaders from the Microsoft data platform, on a new video blog: Data On The Road (dataontheroad.blog). Earlier this year, we travelled across the US in our motorhome, “The Data Bus” and stopped in several cities to chat with friends and colleagues who speak at conferences, write books, lead user groups and teach people about thriving in the data field. I had a chance to chat with folks like Bob Ward, Buck Woody, Adam Saxton, Kelly Kaye, Chris Finlan, Matthew Roche, Kellyn Pot’Vin Gorman, Leslie Andrews, Oz Du Soliel and many others. We also talk about how to work effectively from home and remotely, the importance of community and how these intense data experts unplug to balance work and life.