Use Paginated Reports to Create a Gantt Chart

A paginated report has variable height that can be scrolled vertically on the screen or split across multiple pages when printed or saved to different output formats. Power BI has become the go-to business tool for modern reporting, but sometimes we need information displayed in the more traditional paginated format. A colleague asked if a Gantt chart could be created, and I thought this was an excellent example of where Paginated Reports & SSRS were an ideal choice for the purpose.

A Gantt chart is a running list of activities with the duration for each displayed as a horizontal bar depicting the beginning and ending day along a horizontal scale. The challenge is that this is not a standard chart type in either Power BI or SSRS/Paginated Reports. Furthermore, project planners may prefer to see activities as rows in the format of a printed page, as you can see in this example:

Let me just get this part out of the way before we proceed. I’ve been using SQL Server Reporting Services (now called “Paginated Reports” in the Power BI service) for most of my 20+ year career in BI and data analytics. It is a very flexible tool that can, at times, be tedious and time-consuming for a report developer to use beyond its basic features. I’ve used it, and seen it used in very creative ways, but some things take patience to work through the trial-and-error design process. Once you know what you are doing, it can be a wonderfully powerful and flexible design tool.

Source Data

The source for this report is a list of project tasks, each having a start date and an end date. For demonstration purposes, I used the Enter Data feature to populate a portable dataset, but you can use a database or any other supported data source. Since the data in my sample is stored as text, I do need to perform some extra data type conversion in some of the expressions. In some cases, this won’t be necessary when using strongly-type data sources like SQL Server.

A Gauge in a Table

Most advanced Paginated reports are built by using a data region, like a table, matrix or rectangle; to repeat other report items in data groups. To achieve the Gantt chart effect, I use a table data region to repeat a linear gauge for each row in the dataset. The table is sorted by the Order column. You could also use the DateFrom column for sorting (again, my data is stored as text which doesn’t naturally sort in date order).

I’ve added two linear gauges to the table: one in the details row of the table with the numeric scale hidden, and one in the header with the range (or bar) removed, and the scale displayed. This image shows the report in design view after the following steps have been completed. You can look though the sample RDL file I’ve provided, which includes a completed report based on the steps outlined below.

Make sure you have the Properties pane displayed in report Builder by checking the Properties box in the View ribbon.

To set the “window” of dates shown in the chart, I’ve created two date-type parameters named ChartRangeStart and ChartRangeEnd, and assigned default values that include the records in the source data, 3 months apart.

The gauge visual doesn’t naturally work with dates, so we need a couple of calculated fields: StartNumber and EndNumber. These are created in the properties for the dataset using the following two expressions, respectively to define these two calculated fields (no line returns, the text wraps):

=DATEDIFF("d", Parameters!ChartRangeStart.Value,
IIF( Fields!DateFrom.Value="", Parameters!ChartRangeStart.Value, Fields!DateFrom.Value )
)

=DATEDIFF("d", Parameters!ChartRangeStart.Value,
IIF( Fields!DateTo.Value="", Parameters!ChartRangeStart.Value, Fields!DateTo.Value )
)

For reference, this is the Dataset Properties dialog after adding these two new fields:

Choose Gauge from the Insert ribbon. Select a Bullet Graph from the Select Gauge Type dialog and then drop it onto a blank space on the report canvas.

When working with an intricate visual report item like a gauge, I make it a point to place it outside the data region, make it as big as possible to setup all the properties. Then, I cut and paste the gauge into the table cell.

Mysteries of the Gauge Visual

This is where things get tricky. The Gauge, Databar, Sparkline and Indicator visualizations were all developed by Dundas Software. They had their own charting product and Microsoft purchased the code base to integrate with Reporting Services back in 2010. There is a trade-off when using these components; they erred on the side of giving us lots and lots of wonderful options to build complex charting visuals, rather than making them uber simple to use. You can spend lifetimes digging through these object collections and properties. Believe me – I’ve spent much of mine doing just that! Anyway, speaking from experience, check your OCD and don’t stay up too late at night trying to figure out every nuance because this can become a Pandora’s Box of knobs and levers that control all the intricate features of these visualizations.

To unravel the Gauge objects, start clicking-around to select objects in the gauge visual designer. You will see that objects are organized into a hierarchy of collections within the Gauge Panel. The items in bold are those used in this report:

  • LinearGauges
    • Scales
      • Ranges
      • Pointers
      • CustomLabels
  • RadialGauges
  • Indicators
  • Labels

Objects in the linear gauge and radial gauge have corresponding names, like LinearGaugePanel1 and LinearRange1. You can modify or remove these objects by selecting them in the designer, or you can add or remove then in the properties pane using the ellipsis button for the collection of child objects.

For example, the Gauge Panel has a collection of Scales, and a Scale has a collection of Ranges and Pointers. Until you get the feel of things, you might want to make a second copy of the gauge. You can always use Ctrl-Z to undo a delete or property change if you get into trouble.

Setup the Gauge

Delete all the Pointers and all but the first Range (it’s named LinearRange1). You can click to select each object in the designer. Confirm the selection by checking the object name in the top of the properties pane before you delete or modify anything.

The following image shows the properties of the LinearScale1. Modified and non-default properties are shown in bold:

This is the entire MaximumValue property expression:

=DATEDIFF("d", Parameters!ChartRangeStart.Value, Parameters!ChartRangeEnd.Value)

To choose the LinearRange1 object, either click to select the object in the designer or use the ellipsis () button in the properties panel to edit the Ranges collection, then update the properties shown here in bold:

For properties that represent a collection of objects, you’ll see an ellipsis button (a button with three dots) on the right side of the property value cell when that property is selected. Click the button to open a dialog window to add, remove and edit the objects in the collection.

This is the entire ToolTip expression (text wraps so don’t use line returns):

=Fields!Task.Value & " (" & CSTR(DATEDIFF("d", Fields!DateFrom.Value, Fields!DateTo.Value)) & " days) " & vbCrLf & FORMAT(Fields!DateFrom.Value, "Short Date") & " - " & FORMAT(Fields!DateTo.Value, "Short Date")

Using the Insert ribbon, drop-down the Table button and choose Insert Table. Drop the table in the body of the report.

Expand the dataset and then drag and drop the Task field to the “Data” row in the first column of the table.

Select the gauge and then use Ctrl-X to cut it to the clipboard.

Select the second cell in the “Data” row of the table, and then use Ctrl-V to paste the gauge. Resize the second column and row height to adjust the gauge size and position.

Select and copy the gauge to the clipboard, using Ctrl-C.

Select the second cell in the “Header” row. Use Ctrl-V to paste a copy of the gauge into this cell.

In the new copy of the gauge, you will delete the Range and show the Scale. To do this, select the gray range bar and make sure that LinearRange1 is displayed at the top of the properties panel. Once you confirm that the Range object is selected, press the Delete key to remove it.

Select the LinearScale1 object in the new copy of the gauge.

Change the Hidden property to False.

Expand the LabelStyle group and change the UseFontPercent property to False. You can optionally expand the Font group and change the Font property of needed. This is one of the many properties you can tweak later if you want to adjust things to your liking.

To change the sort order of the table, choose the table and use the SortOrder expression. The default table name is Tablix1.

To select the entire table data region, click any cell in the table to show the gray row and column handles, and then click the gray box in the top-left corner, between the row and column handles.

The rest of the report design is typical. I’ve added textboxes to show the title and parameters selections. I’ve also added the option to hide and show additional columns using a parameter. These should be apparent in the sample RDL file I’ve provided. With variations on the design, you can enhance this report to add more details, and perhaps deal with weekends and non-workdays, etc. If you make such enhancements, please share them for benefit of the community.

Download Gantt Chart sample project here

Doing Power BI The Right Way – for Enterprise Reporting

I started a series of blog posts back in 2020 about best-practice guidelines for planning and designing enterprise reporting solutions with Power BI. To make the topics covered in this series of posts easier to find and follow, they are listed on this page: Doing Power BI The Right Way – for Enterprise Reporting | Paul Turley’s SQL Server BI Blog which you can access from the main menu on the blog. We have a few more topics to go so check back and subscribe for notifications.

Doing Power BI the Right Way: 8. Delivery options

Part of the the series: Doing Power BI the Right Way

When you sign-up for the Power BI service at PowerBI.com (this address redirects to App.PowerBI.com), use establish a tenant for your organization, hosted in the Azure cloud. Even if you setup a 90-day trial account, you have a tenant that you can upgrade later on. By default, all of the datasets, reports and other assets you publish to the workspaces in your tenant, are in a shared capacity. This means that Microsoft will manage the storage and resource allocation among shared servers in their data centers. Larger organizations will benefit from a dedicated capacity “Premium” tenant. This means that Microsoft dedicates at least one virtual machine in the Azure data center, with guaranteed capacity, for your tenant. With Premium-per-User (PPU) licensing, you pay per user. Here are all the options in summary:

  1. Publish to shared capacity tenant, share reports with organization users who each have a Power BI Pro license.
  2. Publish to a dedicated capacity tenant, share with organization users who each have a Power BI Premium-per-User license.
  3. Publish to a Premium dedicated capacity tenant, share with users in the organization who do not require individual licenses to view reports.
  4. Publish to shared capacity and share un-securely for any Internet user, using “Publish to Web” – for free.
  5. Publish reports to your tenant and then share visual content through a custom web application using the Power BI Embedded service.
  6. Publish reports to on-premises Power BI Report Server. Some feature limitations apply. Requires Power BI Premium license or SQL Server Enterprise w/SA + 1 Pro license.

With Premium licensing, any user in the organization can view reports that are published as an app. Since you are buying a fixed amount of capacity (e.g. CPU cores, memory and storage), you can have as many users as you like. If you need more capacity to handle more users and/or more reports & datasets, you can upgrade to a higher capacity.

The Power BI Embedded service utilizes Premium shared capacity and allows you embed report content (a dashboard, a report page, or individual visuals) into a custom web application that you develop, using API code. Through the Embedded service, you can share Power BI report content with any users of your application, inside or outside of your organization.

Finally, Power BI Report Server is an on-premises hosting environment based on SQL Server Reporting Services (SSRS) that you can install on your in-house hardware or virtual machines. There are several technical restrictions and feature caveats, but this does give you the ability to host Power BI reports alongside Paginated reports, in your own environment. Before you decide that this is the right option for your organization, it is important to understand that there are a number of technical and feature restrictions with this option, so do your homework and understand these trade-offs.

Costs and Capabilities

The following guide may help simplify the options. For comparison, these are current and approximate retail prices from Microsoft, which are subject to change. Some customers also receive discounted pricing or reduced prices for bundled services. For example, discounts are offered to organizations that are not-for-profit or education.

Bottom Line

  • Some advanced and most attractive features are only available with Premium capacity and Premium-per-User licensing. This is the most complete and convenient option to deliver end-to-end Power BI solutions to users for an enterprise customer – at a cost of about $5,000/month or $20/user/month.
    Premium-only features include datasets over 1GB, auto-scaling and Paginated Reports.
  • For organization with less than a few hundred users, it may be more cost-effective to purchase a Pro license for every user at a retail cost of about $10/user/month.
  • The Power BI Embedded service allows reports to be exposed securely so they can be integrated into a custom-developed application. This is usually attractive for ISVs who want to integrate reports and visuals into some kind of service offering to their own external customers. This option requires custom development and programming skills.
  • The publish-to-web option allows a single report developer to expose a report to the entire Internet without security restrictions, and this option is free.
  • Organizations who cannot use cloud services for compliance reasons can use Power BI Report Server to host reports on-prem and in-house.

When to Use Paginated Reports vs Interactive Power BI reports

Technology evolves, the industry changes and the way businesses use technology changes. The road that has brought us to the current state of reporting capabilities in the Microsoft data platform has been long and winding. Ten years ago, we were using SQL Server Reporting Services (SSRS) to create interactive, dashboard-like reports scorecard reports. It wasn’t easy and the report interactivity wasn’t exactly “snappy” as the report server had to re-query data and re-render the report every time a link or chart series was clicked. But, because SSRS was the best tool that we had back in the day, the techniques we used for advanced report development were arcane by today’s standards. Over the years, I’ve written a few articles and posts about how to create reports with interactive KPI gauges, charts and scorecards using SSRS. Today we have a better tool for that style of reporting and analytics. Reporting Services remains a powerful and flexible tool for a few different styles of reports. Now that SSRS has been integrated into the premium Power BI service as “Paginated Reports”, this raises questions about how and when to use each tool – and how to use them together.
Related: Gauges in SSRS reports | Graphical Report Design

Power BI

Power BI makes the chore of creating highly visual and interactive reports much easier and the user experience is for more modern and engaging. However, Power BI reports are purposefully intended to be viewed with each page in a single browser window, with all visuals in view. Power BI reports are not optimized to be printed or split across multiple pages. To a small degree, it is possible to export some data to Excel, PDF or PowerPoint; but this is not the design purpose of Power BI – and it probably never will be. By contrast, SSRS/Paginated Reports was purposefully designed to render reports as continuous, paper-sized pages, with breaks and totals. Paginated reports can be exported to a variety of formats in high fidelity.

Can <> Should

When should we use Power BI vs Paginated Reports for our reporting needs? You can use either tool to create different styles of reports for different purposes, but should you? Just because you CAN doesn’t mean you SHOULD.

  • Can we use Power BI to create operational or transactional style reports?
  • Can we create scorecards and dashboard-style reports with SSRS/Paginated Reports?
  • Should we use a Power BI dataset as a data source for a Paginated report?

The following diagram depicts the relative capabilities and design purpose for these two reporting tools. See the overlap? On the left, you see that Power BI is optimized and best used to create interactive, visual reports, scorecards and dashboard-style reports. Contrast this with Paginated Reports on the right side, which is best used for “list-type” reports that continuously flow across multiple pages. SSRS/Paginated Reports was architected and works ideally with SQL queries, where filter values are passed into the query using parameters.

Let’s break-down a few of these important decision points.

Data Model vs SQL Queries

There are a few reporting use cases that would tip the scales in one direction. If the need is to run a query and dump the results into a long list of continuous report pages, that is a classic use case for Paginated Reports. Power BI relies on data organized and optimized in a semantic data model, with data shaped and purposefully structured to support interactivity and analytics. Although Power BI models can handle a lot of data if properly organized, handling transactional details are not a natural fit for Power BI reports, both due to the format and the way data must be modeled for reporting. Importing tables that are both tall and wide, with many columns, will tank Power BI performance and impede users’ ability to explore and interact with analytic data.

By contrast, Paginated Reports rocks at this sort of thing. Need to generate a running list of ten thousand order line items, split into formatted pages with breaks, headers and footers? This is a natural fit for a paginated report. Do you need to export a table-style report to Excel with each page in a separate worksheet? Paginated Reports is a good tool to use for this sort of thing. Under the covers, Power BI communicates with a tabular data model with the DAX query language. Although elegant and efficient, DAX queries can be very verbose and difficult to read. When SSRS was enhanced to work the option to query Power BI datasets, some effort was made to allow the Report Builder query designer to generate DAX query code. It is not robust nor very easy to maintain. If you need to enhance a DAX-query-based report with any degree of sophistication, you will likely not have a good experience. SSRS/Paginated Reports were developed to support, and are best used with, native SQL queries.

There are many nuances to the choice between these two options. A semantic data model provides a layer of abstraction for measures that define calculations and business rules. A model supports self-service analysis and discovery. Row-level security rules ensure that users only see their own protected data. Running a SQL query just gets the job done for a finite set of requirements, and perhaps only for one report.

Interactivity

In a Paginated Report, if you want to give a user the ability to click a chart column or cell in a table or matrix and then navigate or filter items in the report using that value as a parameter; this kind of interaction is possible, but it might take some code and development effort to get there. Even though SSRS and Power BI Paginated Reports are both defined as RDL files, the implementation of Paginated Reports in the Power BI service doesn’t yet support the same native report navigation and drill-though functionality as SSRS on-premises. Report navigation can be achieved using parameterized expressions to build clickable web URL links but this is a work-around solution at best. The product will evolve, and I think we can bet on seeing more and more parity between on-prem SSRS and Power BI Paginated Reports as the product continues to mature.

Guidelines

  • Can you create a dashboard-style report with SSRS or Paginated Reports? – Yes, but it is complex and time-consuming.
  • Can a Paginated report use a published Power BI dataset or SSAS model as a data source, with filters and parameters? – Yes, but DAX queries generated with the query designer quickly get large, complicated and difficult to maintain. Even using a small number of parameters and fields can produce an unruly query.
  • Can the page dimensions of a Power BI report page be resized or to accommodate more visuals than will fit on a standard browser page? – Yes, but several features won’t work as intended, and performance will suffer with more visuals on a single page.
  • Can a Power BI report page be formatted to print on a sheet of paper, or exported with formatted and high fidelity to Excel or a PDF file? – Maybe to a point, but perhaps not very effectively.

The bottom line is that both of these tools can be used to create a variety of report styles and to support a range of features, but one is better suited than the other for different purposes. Before stretching the limits of any tool, make sure you are using the right reporting tool for the job and consider how you can use both Power BI interactive reports and Paginated Reports together, to meet your analytic and operational reporting needs.

Hybrid Tables, Incremental Refresh and Table Partitioning in Power BI

The December 2021 Power BI Desktop update introduced a long-awaited upgrade to the partitioning and Incremental Refresh feature set. The update introduces Hybrid Tables, a new Premium feature that combines the advantages of in-memory Import Mode storage with real-time DirectQuery data access; this is a big step forward for large model management and real-time analytic reporting.

Partitions allow us to break large tables in a data model into smaller, manageable storage units for a variety of reasons. Until recently, each table in the data model could be stored only in Import storage mode or as a DirectQuery table, which stored no data in the model. Generally, Import mode is fast and full-featured but large tables take up memory, while DirectQuery uses no memory for storage but is typically slower and limits many reporting features. Incremental Refresh can automatically break a table down so that only partitions containing new or changed records can be refreshed. There are two scenarios where combining Import and DirectQuery mode partitions might be optimal: 1. providing real-time access to recently inserted records, or: 2. proving access to infrequently accessed old records that take up a lot of storage space. This demonstration addresses the first scenario.

Hybrid tables, an extension to the Incremental Refresh feature set, can consist of both Import Mode and DirectQuery partitions, so that some data is cached in memory and some records are read directly from the data source as users interact with the data model. Setting up Incremental Refresh is the same as it was when introduced a few years ago, but the options have been enhanced.

Let’s take a look at the updated Incremental Refresh dialog for a fact table. I will demonstrate with three years of historic fact data so you can see the results. I’ve already setup the required query parameters and filter step needed to enable the feature, that I described in this post. Choosing the Incremental Refresh option from the ellipsis menu for the Online Sales table opens this dialog. There’s a lot going on in this window but pay attention to the following options that I have called-out in the graphic:

  • Enable automatic partitioning by switching on “Incrementally refresh this table”
  • Archive data starting 3 Years before refresh date
  • Incrementally refresh data starting 12 Months before refresh date
  • Enable Get the latest data in real time with DirectQuery

For simplicity, I have not set the Detect data changes option, which is something I typically would also do but we’re just keeping this demonstration focused on the topic of how partitions are created.

Note the graphic added to the bottom of the dialog window, which displays a graphical representation of the results. I will break this down into more detail using the following graphic. Let’s see the effect of these settings The graphic depicts the partitioning scheme for the Online Sales fact table in this data model. Bases on the settings in this dialog, the Power BI service will generate partitions using the RangeStart and RangeEnd query parameters to inject dates into the WHERE clause of each partition query. A few query examples are shown for demonstration purposes. The dark-colored boxes represent the Import Mode partitions that cache a copy of data in the model, and the light-colored box represents a DirectQuery partition that runs in real time when the model is used for reporting. Note that I have abbreviated things for illustration purposes, and there will actually be more monthly partitions than those represented here.

Referencing the earlier Incremental Refresh settings dialog, the Archive date starting option specifies three years, which generates three yearly partitions, each with a relative date range. The Incrementally refresh data starting option specifies 12 months, which generates 12 monthly partitions (only 3 are illustrated here). The example query shows a range of one month per partition query. Lastly, a DirectQuery partition uses a range of dates to include all records in the current month and possible future dates.

These partitions are created by the Power BI service the first time the dataset is processed, after the data model is deployed to a Premium Capacity or Premium Per User workspace.

Let’s take a look at the actual partitions. Using the XMLA endpoint, I can use the workspace URI address to connect to the service using SQL Server Management Studio and show the partitions for the Online Sales table. With 15 partitions, it takes longer initially, to process the dataset than it would if it were not partitioned. This is because the database engine only processes a few queries at a time. However, subsequent refresh cycles should be much faster since only a few partitions will need to be processed with additions and changes.

Connecting to the published dataset using Tabular Editor shows even more partition details. Here, you see the monthly partition for November 2021. The Start and End dates for this partition are 11/1/2021 and 12/1/2021 (US date format), respectively.

The last partition, which is in DirectQuery mode has a Start date of 12/1/2021 and an End date several years in the future. This range is a catch-all for records that could potentially have future dates.

  • What are the performance implications of a hybrid table?
  • How can you add an historic partition that uses DirectQuery mode?

I will address these questions in a later post. Please share your experience using the comments for this post.

The documentation on this feature is very brief right now, but here are some additional resources:

Hybrid tables | Microsoft Docs

Power BI Hybrid Tables with Synapse Analytics Serverless SQL Pools – Serverless SQL

Are We There Yet? …Composite Models Using DirectQuery Over Power BI Datasets

Last year I wrote this post about the new composite model feature in Power BI that enables datasets to be chained together using a featured called “DirectQuery for PBI Datasets and AS“. The prospect of creating a data model that virtually layers existing data models together without having to repeat the design, sounds like nothing less than Utopia. We can leverage existing datasets, with no duplicate models, no duplication of business logic, and no duplication of effort. The promise of this capability is that data models may be referenced from other data models without duplicating data. So, is this really possible?

The feature we are discussing in this post is not for beginners and it is in Preview, with some important limitations that should be considered in the overall solution architecture. Make sure you understand these trade-offs before using this feature.

When the new composite model feature (using DirectQuery for AS) was released to Preview last year, I gave it a thorough test as did others in the community. It was cool and I was able to whip up a few demos, but right from the gate, I could see that it wasn’t ready for production use. The product engineering teams have been working on this dream feature for years, and there are several dependencies in the Power BI service (Gen2 architecture), the VertiPaq engine, and the Power BI Desktop model designer that needed to be orchestrated. It is complex and will be slow to mature.

The most significant shortcomings, in my opinion, were that in the local model (where you add DirectQuery connections to existing, published data models), you couldn’t add objects with duplicate names. If any measures with the same name exist in the remote data model, you couldn’t add a reference to the remote model. If you could get past resolving the measure name conflicts, when connecting to a remote data model, you get every single table in the remote model; thus, compounding the duplicate object name issue. Cool technology with lots of promise but there were significant implementation blockers from the get go.

The product teams were notably quiet over the past year, with very little news about enhancements to this feature. With all the fanfare about the Preview launch last year, I was looking for more news about critical feature updates. There have been a lot of organization changes and feature reassignments at Microsoft this year, so maybe the work on this feature just hasn’t been visible externally. But alas, there is good news! As of the October Desktop update, the composite model design experience has improved significantly. Any duplicate measure names are now resolved by the designer by adding numbers to the end. You can now select only the tables you want to bring into the local model after connecting to a remote model.

New Concepts and Terminology

Gen2 composite models introduce some new concepts and terms that are important to understand. Before you can grasp these new composite model concepts, you must understand the fundamental data modeling architecture in the SSAS Tabular/VertiPaq modeling engine, which is utilized within Power BI.

Direct Connect and DirectQuery

The first thing is that when you create a new report connected to a published Power BI dataset, this uses “Live Connect” which is the default mode for Power BI reports connected to a published dataset. To enable a Gen2 composite model, this connection must be changed to use DirectQuery, using the XMLA endpoint, to connect to the remote published data model as an Analysis Services connection.

Chaining

A data model that references another data model is referred to as a chain of models. Currently, up to three levels of chaining are supported. This means that if you have a data model that imports or connects to data sources, that published dataset can be referenced by another data model. The second model can then be published as a dataset. The latter dataset may reference multiple published datasets but it cannot be used as a DirectQuery source for a third data model. The product documentation uses the following illustration to demonstrate the limits of a three-level chain. In this example, the Sales data model is an Azure Analysis Services data model referenced as a remote dataset by the Sales and Budget dataset.

Note: The Sales AAS model in this example could just as well be a published Power BI dataset because the XMLA endpoint sees them as the same type of SSAS Tabular model connection.

The Sales and Budget dataset, in turn, is referenced as a remote dataset by the Sales and Budget Europe dataset. As illustrated by the Budget Excel source, any of these data models can have a variety of connectors used to import or connect to various data sources.

You can create relationships between the tables from different sources as long as they don’t violate the rules for existing relationships, either in the local data model or relationships that exist in the underlying remote data model. Since the local model developer may not have visibility to the entire schema for the remote model, this is a complexity to consider when using composite models. This is an example of why composite models are not for beginners.

Remote data model

From the perspective of the “local” data model that is open in the Desktop designer, when a published dataset is referenced by another data model, it is referred to as a Remote dataset.

Local data model

Any published dataset (within the limits of this feature) can be used in a remote connection. The data model that connects to a remote dataset is referred to as the Local dataset.

Relationships may be created between tables from different remote datasets. These relationships should generally behave like they do in standard data models but there are some limits and subtle behavioral differences. There are still features of Power BI and Azure Analysis Services features that are unsupported in Gen2 composite models, so use this with caution and consult the documentation for details: Using DirectQuery for datasets and Azure Analysis Services (preview) – Power BI | Microsoft Docs.

Remote table relationships

What are those funny icons on the relationship lines between local and remote tables? These denote that the relationship is “limited”, otherwise known as a “weak” relationship. This means, in simple terms, that since the relationship is defined using columns from tables in different DirectQuery sources, the VertiPaq engine cannot guarantee that the usual cardinality and key uniqueness rules can be enforced. These rules are inferred by the modelling engine but could potentially be violated and produce inconsistent results. By defining a relationship, as the model author, you are essentially stating that you expect the data to conform to cardinality and key-matching rules without the ability to force the model to behave that way. Alberto Ferrari wrote an extensive article explaining these relationship types in detail.

New Composite Model Features

I’m not going to walk through an entire demo because this would repeat the same steps as my previous post from December 2020, but I will get started (again). If you are interested in stepping through the process, I suggest following the instructions in that post. There are a few significant improvements from the previous design experience:

  • Duplicate measure names are automatically resolved
  • Individual tables can be selected when adding a new remote dataset connection
  • Tables can be added and removed after the initial design
  • Objects can be renamed after connecting

To start using this feature, make sure you have the DirectQuery for PBI datasets and AS switched on in the Power BI Desktop Options dialog. Restart Desktop to apply any changes.

The Power BI tenant must have support for XMLA endpoints enabled. This is managed in the Admin Portal under Tenant settings > Integration settings.

After all datasets that you intend to use in remote connections have been published and access is granted from the respective workspaces, open or create a new PBIX file in Power BI Desktop. Here are the two published datasets I’ll use in my new composite model:

Use Get Data and connect to a published Power BI dataset. After the live connection is made to the remote dataset, the message shown below will appear in the status bar. This is subtle and easy to miss.

Click the link titled Make changes to this model to update the data model and convert the connection to use DirectQuery for AS.

You will be prompted to confirm the change.

It is a good idea to make a backup of the PBIX before using a preview feature but there is little risk since this is a brand-new file for experimentation. However, keep this in mind for complex projects.

After making this change, using Get Data to connect additional datasets will result in DirectQuery connections. For each, you are prompted to select only the tables you want to include in the local data model.

If any objects in the remote model have the same name as those in the existing local model, those names will have a number appended to ensure uniqueness. This is true for tables and measures.

In the model diagram view, tables from the same remote model are displayed with a unique color in the heading. The default colors for the first two remote datasets are blue and purple, shown here. These colors are similar so watch carefully.

Just like a like a standard Import model, you can create relationships between any of the tables by dragging and dropping the key fields. Always double-check the relationship type and cardinality. Note that the relationship I just created between the Store and Geography 2 tables is incorrect (1:1 and bidirectional) which I will correct by editing the relationship, changing it to be one-to-many and a single-direction filter.

You cannot create a relationship between two tables (whether local or remote), if it conflicts with an existing relationship. This is bound to become a challenge in more complex composite models, but it is a reality in dimensional model design.

Objects can be added, removed and renamed

This is a huge improvement over the previous design experience. Tables can be removed one-by-one from the data model and additional tables from the same source can be added later by repeating the steps using the Get Data options. Tables, fields and measures can also be renamed. In short, I think we’re in a good place now, and this is a big improvement.

What About Performance?

There are many factors that can adversely affect the performance of reports, calculations and user interactions with a data model. Categorically, composite data models are bound to be less performant than a simple Import data model having all data residing within the same memory space. That’s just a fact. Experience has shown that DirectQuery has performance impacts just due to the nature of query mechanics, and interactions between different databases, and network latency. In many cases, performance will be a trade-off for the convenience of data model consolidation using DirectQuery and composite models in general. Optimization techniques like aggregations will help. Time will tell with this new capability. With this incredibly powerful new tool there will be many opportunities to implement poor design choices and to take Power BI to – and beyond – its limits.

There are several methods to tune and optimize models to run faster and perform well. I’m personally anxious to start using production-scale data with Gen2 composite models to test the water beyond small demo datasets. Please share your experience and I’ll continue to share mine.

Microsoft Data Community Interviews and Advice about Working Remotely

You may not know that I have been travelling the country interviewing community leaders from the Microsoft data platform, on a new video blog: Data On The Road (dataontheroad.blog). Earlier this year, we travelled across the US in our motorhome, “The Data Bus” and stopped in several cities to chat with friends and colleagues who speak at conferences, write books, lead user groups and teach people about thriving in the data field. I had a chance to chat with folks like Bob Ward, Buck Woody, Adam Saxton, Kelly Kaye, Chris Finlan, Matthew Roche, Kellyn Pot’Vin Gorman, Leslie Andrews, Oz Du Soliel and many others. We also talk about how to work effectively from home and remotely, the importance of community and how these intense data experts unplug to balance work and life.

Please check-out DataOnTheRoad.blog, and subscribe to see new videos, interviews and posts coming out in the near future, on this page.

Learn Databricks, ADF & Paginated Reports at the PASS Community Summit

Seven days and counting… For the past eighteen years, thousands of data tech professionals would travel to the PASS Summit conference held somewhere in the US, paying for airfare, hotel and conference admission. This year, the PASS organization (Professional Association for SQL Server) is rebooted as the PASS Data Community, led by Red-Gate Software, and the virtual summit is free to attend online. Hundreds of speakers will deliver training sessions and informative lectures on a variety of important data tools and industry skills. This year, I am speaking about Paginated Reports in the Power BI service. I will reflect on the progression of SQL Server Reporting Services (SSRS) through past PASS presentations and then talk about how Paginated Reports integrates with the Power BI platform.

Register to attend the PASS Community summit

Two of my colleagues from 3Cloud are also presenting at the PASS Community Summit next week. There is just no good reason not to attend the summit his year, so register and make time to attend selected sessions. To reach the global audience, some of the keynotes and sessions are presented live. Others have recorded content followed by a live Q&A chat with the presenter.

May I remind you again that the online summit is FREE, so participate, volunteer and attend! There is so much to learn from world-renowned experts. PASS Community Summit speakers are the subject matter experts who write the books, teach the classes, provide expert-level consulting and actually develop many of these products and Azure services.

Data On The Road

As we are traveling in the Data Bus, I have had the opportunity to interview server data community leaders and conference speakers, including those highlighted in this post. Here, Leslie and Andie talk about their topics and passion for data and community. Watch all of the interviews and follow us as we visit conferences and data community leaders in the Data Bus.

Learn Azure Data Factory from Leslie Andrews

Becoming a Data Engineer with Azure Databricks, from Andie Letourneau

Guy In A Cube Power BI Livestream

Please subscribe to Data On The Road blog to see the community leader interviews as we travel across the country.

I just wanted to give a big big Thank You to Adam Saxton and Patrick Leblanc at GuyInACube.com for inviting me to participate in the Power BI Livestream yesterday. It was a blast and an honor to hang with you and the Saxton clan (even the sheep and the great dane!). You can watch the replay for the Saturday morning YouTube livestream here:

I’m trying to minimize excessive cross-posting between the my long-time blog, SQLServerBI.blog and the new blog: DataOnTheRoad.blog, but we are still getting started and just wanted to make sure you all got the word. Please subscribe to DataOnTheRoad.blog if you are interested in following the data community leader interviews along our journey in the Data Bus. Regular technical blog posts and videos will continue to be here on SqlServerBI.blog.

Here are a few quick shots from yesterday. Interviews with Adam will follow in the next few days.

After the livestream, Adam and I sat down and talked about the genesis and evolution of the Guy In A Cube effort. In our interview, he told me about how he got started in his 16 year career with Microsoft, and his love of data, community, gadgets, games and video. You can watch the interview in this post: An Interview from The Cube, with Adam Saxton – Data On The Road

Power BI Live Stream this Saturday

Only the most dedicated data professionals are willing to invest their personal time and spend any part of a weekend improving their skills. Having organized many SQL Saturday events, I know that there is a big difference between those who punch out on the proverbial time clock on Friday and those who love working with data so much that they will invest their personal time on a Saturday. One of the greatest examples of this kind of dedication is the group who participate in the weekly Power BI Live Stream from Guy In A Cube.

When Adam invited me to participate in the weekly Guy In A Cube Power BI livestream on Oct 23, I was elated. I am honored to join. If you are not already familiar with the Saturday morning live stream, you can join by subscribing to the Guy In A Cube YouTube channel. The channel is member-supported. Patrick and Adam do all of this outside of their day jobs at Microsoft and then member donations help cover their costs. I encourage you to support the channel with a donation but it is not a requirement. Even without supporting the channel, you can attend the hour-long livestream. Everyone can chat and submit questions during the first half and then the second half of the session is open for members to submit their questions.

So, why am I going to be “in the cube” rather than joining remotely? Well, Houston is one of the stops in our DataOnTheRoad journey in the Data Bus! I’m interviewing Adam and other data professionals as we tour the country, working on the road, on the way to the Orlando Live! conference in November.

PASS Data Community Summit & Orlando Live! Sessions

There e are two conferences in November where I am thrill to be presenting sessions. The second annual PASS Data Community Summit is on November 8-12. Like last year, this is an online event and IT IS FREE! I will be presenting a session titled Power BI Paginated Reports: the New Old Operational Reporting Platform. Join me on Thursday, Nov 11 at 12:30 PM Eastern Time.

You can register to attend at PASS Data Community Summit November 8-12 2021

The Orlando Live conference is an in-person event, and in-case the name of the event didn’t make it obvious, it will be on Orlando, Florida the week of November 15-19. This event is multiple co-located conferences in one, for attendees from multiple technical disciplines.

I will be presenting Reporting Services and Paginated Report Recipes and Doing Power BI The Right Way For Database Developers. Sessions are on November 17 at 8:00 AM and 4:00 PM Eastern, respectively.

We’re heading to Orlando in the Data Bus now and taking our time to get there. To follow our journey, visit DataOnTheRoad.blog.

A Developer’s Guide to Creating Bad Power BI Projects – Part 1

It occurred to me that we have put so much effort into promoting best practices and proper design that there is far less information about how to create bad reports and data models. In that light, the purpose of this article is to talk about what to do if you want things to go poorly and make sure your projects fail – if not immediately, then sometime in the future – and if not for you then for whoever inherits the work that you have done.

I welcome your comments and thoughts about the worst practices you have seen in Power BI. Post your ideas in the comments below. What examples have you encountered of sure-fire ways to make projects go badly?

Keep in mind that the rules of “good” and “bad” design might be different for small, informal projects vs larger-scale, formal projects where longevity and scale are more important. Arguably, bad practices for enterprise-scale, multi-user report projects might be just peachy keen for small, informal desktop reports for one user with limited data. This article is written from the perspective of formal, larger-scale projects.

So, you’ve taught yourself how to use Power BI! You’ve Googled some questions and let YouTube show you how to cobble together a little data from different places, wire-up a few tables and drag-and-drop fields to make some attractive visual reports. It wasn’t all that hard to do, right? You’ve demonstrated your new Power BI design chops and now all the managers in the company want more cool charts and reports. Since the first round of reports were so well received, the company bought licenses so everyone in the organization can use Power BI to run their business. Welcome to the big leagues, folks! Just import data from all the corporate systems, desktop Excel files and whatever web feeds you can find. If it’s on the Internet it MUST be right! Munge all this data together and create dashboards to help the executives run the business!

This guide demonstrates common examples of how enterprise Power BI projects are bound to fail using self-taught undisciplined techniques to build larger, enterprise solutions. Most, if not all of the examples in this and the posts that follow, are examples that my colleagues and I encounter when we go in the fix failed Power BI projects.

Bad practice #1: Keep PBIX files on your local computer rather than shared storage

Many Power BI projects evolve through stages of prototypes, so they might create multiple PBIX files. After some trial-and-error, solution developers often save-off multiple files with specially-encoded names including things like the date, their initials and other information each one with different features and techniques. One example might be: FinancialAccountingReport_20210615_BobR_TestDisconnectedSlicer and another one might be FinancialAccountingReport_20210618_BobR_FixedForecastMeasureButNowItRunsSuperSlow. Reminds me of when I was eight. We would write notes with cyphers that only the writer and recipient would understand. There is nothing wrong with making a few backup copies of your work in a convenient place where you will find it. someone has to keep track of it all and recover work when there is a staff change.

Without a deliberate effort to exercise discipline, delete or archive the old files and then store the most current copies in a central storage location that gets backup and synced to the cloud; you are sure to have a mess on your hands. Worse than that, if the project ever needs to be turned over to someone else, they wouldn’t have a clue about how to continue the work. I always ask myself what would happen if I had to hand-over my work to another developer tomorrow? Would they have what they need to continue and would they know where to pick things up? Some transition documentation is good to have in this case, but make sure the master files are in a safe location. Better yet, keep them in a source code repository like Azure DevOps or GitHub, or in OneDrive or SharePoint with automatic file versioning.

Oh, wait… this article is about creating bad projects! Bad practice is to just leave files on your local drive. If you want things to go badly, that’s what you should do. When you leave this mess for someone else, they will have no idea about what you did or how to continue the work.

Bad practice #2: Keep the only copy of a report and dataset published in the online workspace to download when you need it

Today, we’re all about using the cloud to safely store files. This is a good practice. You can store all your PBIX and other important files in a OneDrive folder where they are backed up to the cloud and the files are even versioned so you can revert back to an earlier version if you do something silly and need to recover. However, relying on the option to store your one-and-only copy of a report or data model file in a Power BI workspace is optimistic at best.

There are several conditions that will disable to ability to download a PBIX file. The thing is that enterprise-scale data models need to take advantage of features that are not supported in Power BI Desktop (and probably never will be). This is by design. Desktop is primarily a self-service tool and most of the feature we use in enterprise-scale projects require the model definition file to either be saved to a JSON model definition file (e.g. “Model.BIM”) or to a PBIX file with an internal model version number unusable by Power BI Desktop. I’ve heard from numerous novice developers who were accustomed to publishing their latest PBIX file to the service, only to discover that after enabling some advanced feature; they could no longer download the file.

If you want to shoot yourself in the foot, do not keep a master copy of your model and report files (which you are managing separately, right?) in OneDrive and roll the dice that you will be able to download a copy when you need to edit the model or report in Desktop.

Bad practice #3: Import source data files from your local file system and then publish the report to the service

If your source data files are on your local computer, after you publish the data model to the service as a dataset; you will not be able to refresh the dataset from the service unless you take a few more steps. There are a few options to remedy this situation, such as:

  1. Move the source file folder to a shared OneDrive folder or SharePoint library, and then use that connectors to refresh the dataset from the service.
  2. Move the source file folder to a network share and then setup the enterprise gateway to import and refresh the published dataset from the service.
  3. Install and setup a personal gateway to fresh files from your local computer.

Option #3 is a sure fire way to run into problems down the road. This might work for a small project managed by a single developer. But, if you have source files on your laptop which happens to be asleep or shut off during a scheduled refresh event, that’s not gunna work, and the refresh will fail.

Another common gotcha is that if you import source files that are stored in a OneDrive folder, and use a local syched folder as the source; the connection is to the local file system (like “C:\Users\Paul\OneDrive\…”) and not to the OneDrive service in the cloud. In this case, the data refresh will try to read from the local folder which is inaccessible from the Power BI service.

Bad practice #4: Use hand-written SQL queries to import large database tables

SQL is the tool of choice for most database professionals, so it would stand to reason that you should write a SQL query when connecting to a relational database (like SQL Server or Oracle.) Right? Sure, if you want things to be slow and inefficient. Best practice is to connect your queries in Power BI to read from a database table or view, and not to use SQL statements so that Power Query can employ a method called “query folding”. This means that the Power Query processing steps, where possible, will be translated into SQL or the native query language for the data provider. There are advanced techniques to promote query folding over a SQL statement, but this is the exception to the rule. If you need to write SQL logic to join, filter, group or sort data from multiple tables before it hits Power Query, write a view and store it in the source database.

Simple answer: If you want things to go badly, write SQL queries in your Power Query queries using Power BI Desktop.

Bad practice #5: Use DirectQuery for “real-time” analytics

Power BI includes an awesome capability called “DirectQuery” that enables certain data source engines to handle query operations as user interacts with report visual, rather than using the VertiPaq in-memory analytic engine, to process queries. DirectQuery doesn’t use the VertiPaq storage engine, which is really the heart and soul of Power BI and SSAS Tabular data models. With DQ, records are read live from the source. The downside to DQ is huge. Most interactive report visuals can take several seconds to load, and some DAX functions simply can’t be used due to resource overhead and performance cost.

Database professionals often consider DQ to be a cure-all solution because it enables familiar database technologies. Everything is managed centrally, there is no record duplication in the data model and users see the current state of a record when they interact with reports. That’s all fine and dandy but compared to Import mode, DirectQuery performance can be several orders of magnitude slower.

Import mode is fast because because data sits in memory, compressed by column. DQ is slow because queries must be translated and pushed down to the data store, then results must be sent back. DirectQuery is a powerful feature when used strategically to address a few specific problems. It is a good solution for massive fact tables that would otherwise use a lot of memory. Just filter large tables and don’t do a lot of grouping and aggregation over a large set of records.

In a Composite data model, with most tables using Import mode, and one or a few detail tables using DirectQuery, you can a achieve “best-of-both-worlds” solution by allowing users to drill to details against a filtered set a records.

If you want your visual, analytic reports to be really slow, use DirectQuery for all the tables.

Bad practice #6: Combine source data and load one big table into the Power BI data model

People who approach data problems in Power BI the same way they do in Excel, tend to combine source tables into a large spreadsheet-like table. That approach creates an seemingly easy-to-navigate grid of data with a lot of columns; but you can lose a lot of functionality and create a very inefficient model this way. The VertiPaq, Tabular modeling engine was architected around the premise of “star schema” dimensional models.

It’s not that Power BI won’t work with flattened data, or that you “have to” build a star schema, with separate dimension tables for every attribute! The the natural rules of data behavior just cause Power BI to work better when data is modeled in related fact and dimension tables. There are several DAX functions that expect records to be filtered through a relationship. One of the most common examples is a date dimension table, used to filter a fact table to perform time-series calculations. But that same behavior can exist for Products, Customers, Geography and other filtering or grouping attributes that can be used to filter a related table of numeric business facts.

If you want to struggle with DAX calculations and limit users’ options to interact with your data, pound your data a large, flat table, rather than individual dimension or look-up tables.

How to Name Worksheets in a Paginated Report Exported to Excel

This question comes up every few years in SQL Server Reporting Services. Of course, in Power BI Paginated Reports, we have the same features. A couple of days ago, Karthik posted this question as a comment to my post titled Chapter 7 – Advanced Report Design:

I am working on a SSRS report where the grouping is done to group the records in to multiple tabs/worksheets. When the report is exported to excel, the worksheets has the default name (Sheet1, Sheet2, Sheet3,…). Here I want to override the default worksheet name with (Tab1, Tab2, Tab3, …). The number of tabs/worksheets varies each time depending on the parameter selection by the user. How to address this? any suggestions please.

There are a couple of requirements to set this up. First, you must drive report pagination using a Group and then set a Page Break on the Group. This figure shows that I am grouping my table data region on the Brand field and then setting a page break on that group. You can get to the Page Break settings by using the little arrow next to the group item in the Row Groups pane or you can use the properties window like I’m doing here.

Next, you want to give the page a name under the PageBreak properties. Expand PageBreak and then use the Expression builder in the PageName property to use the grouped field name in an expression. In this case, I am referencing the Brand field.

The result is shown here when I open the file resulting from exporting this report to Excel. You can see that the worksheet tabs are titled with each Brand value and each sheet contains that product information group by the corresponding brand.

New Blog: Data On The Road

That’s right! I’m starting a new blog site today. Just to be clear, it will not replace this one. I will continue to post here at SqlServerBi.blog about everything related to Business Intelligence best practices, Power BI, Paginated Reports, SQL and the Azure Data Platform.

The new blog is called Data On The Road and it is dedicated to connecting with the professional community while on the road as a working technology professional. In May of 2021, my wife and I decided we needed an adventure so we transitioned our living space into a new Tiffin Class A motorhome. We had it outfitted with a private office and tech so we could stay connected while working fulltime from the road. We call it the “Data Bus”.

We’re just getting started and taking short trips to practice and work-out the kinks – but it has been great so far. We hit the road for good in the Fall to go experience the US from the Southwest, the South, the East coast – and who knows where?

We’re still assembling the tech and working out the logistics to visit and interview friends and colleagues along the way. In the database and BI community, we’re a dedicated and social bunch so let’s talk about work and about fun! We might chat a little tech but the purpose of this blog – and this adventure – is to talk about work/life balance. In our jobs, we work hard and play hard – so we want to know what our community friends and leaders do when they need to get away. How do they balance their careers with other activities?

Please follow both of my blogs but head on over to DataOnTheRoad.blog and follow us on this new adventure!

Doing Power BI the Right Way: 10. Designing and Managing Large Datasets

Part of the the series: Doing Power BI the Right Way

Later in this post, you will find a 27 minute framed video including a series of interviews and discussions with a few of my colleagues. They provide perspectives and recent experiences related to this topic. The best approach to design and maintain large data models depends on different factors. However, many years of experience have proven that there are best practices to embrace and common mistakes to avoid.

I was just talking to a consulting client about the best approach to build a data model and he told me something very interesting about the way they were loading data into Power BI. He said “We don’t use facts and dimensions, we load all of our data into one huge table.” He said that their data model performs well and that it meets their reporting needs. It is a difficult point to argue, when something is working at the time although the design might not follow the accepted rules. Life is like that and there are plenty of analogies to make the point that a practice, even a real bad practice, might solve a problem for a period of time and under certain conditions. <analogy>You can drive a car at excessive speed to get to your destination faster. You might not get caught by the police on that day and you might not crash but eventually, if you make it a habit, this practice will catch up to you.</analogy> Data is like that. If you don’t play by the rules, you limit your options. Bending the rules lets you move faster and sometimes with less hassle. But, as the project scope expands – and after adding enough data or other complexities to the solution, it will not endure. The data model won’t perform well, won’t load the correct data or it just won’t be reliable.

This post will explore the realities of best practice design for large data models; some important considerations and trade-off decisions when working with both “big data” and “large data”.

One of the great challenges with Power BI is that it is so easy to build simple data models and reports quickly. But, often when those projects need to grow-up, they aren’t ready for enterprise use. This isn’t a problem with the tool, it is a problem with the way it is often used. If someone says “Hey, Paul, I need you to create a report quickly. Can you import data from these three sources into Power BI and create a quick dashboard report?” I might say “sure, let’s just do this:”

  • Merge records from multiple files and queries into a single table in the model.
  • Use in-line SQL queries in Power Query to join source tables together and apply transformation logic.
  • In reports, summarize numeric columns from a table in the data model, rather than defining DAX measures.
  • Reload the entire model when source data changes.

There is nothing inherently wrong or bad about using these quick-and-easy techniques in a smallish project, but if it needs to grow, scale and be future-proof, you may need to follow a longer path to achieve a more durable result.

The answer to a question we often ask at the beginning of a BI project: “what information do we need include in the data model?”, is often the same. I think Freddie Mercury said it best: “I want it all and I want it now!” More specifically, stakeholders might say that they want:

  1. A lot of rows of detail or historical fact records, (lets say tens of millions or billions of rows.)
  2. Many columns from the source table(s) so they can slice, filter or group values on anything imaginable.
  3. Reports must perform well and return results quickly.

…to these requests, we might say “pick any two but you can’t have all three”. Maybe you can have all three to a certain point but not to an extreme.

Tall and Wide tables

A “tall” table contains many rows or records. A million might be a substantial number of fact table rows in a desktop Power BI data model, and row counts in the billions are not out of reach in an enterprise-scale solution built in either Power BI or Analysis Services. The VertiPaq analytic data processing engine (the data modeling engine behind SSAS, AAS, Power Pivot and Power BI) is very good at handling very tall tables when the columns are conservative, numeric data types; especially when they contain repeating, non-unique values to aid columnar compression. In rare cases, Power BI data models can even read and process PetaBytes of data! The point is that tall tables are very achievable.

Tables with many columns can be managed by the modeling engine but tall (many rows) and wide (many columns) are often not a good combination. The fact is that analytic reporting solutions shouldn’t include the whole kitchen sink. The problem often stems from developers not building data models based on on business requirements but simply exposing all the available data just in case someone might find it valuable. This is just not the way BI works. If you don’t need certain columns to meet the current report requirements, leave them out. Columns can usually be added later when they are needed. Additional tables can be also added until the model exceeds its design goals. Herein lies the need for balance.

The thing is that simple data models can be relatively easy to build and often don’t take up a lot of space. It may be more cost-effective in the long run to create another data model with some duplicate design elements, than to spend all of your time and energy trying to craft an all-encompassing data model.

Large Data and Big Data

By “Large Data”, we generally mean that tables contain a high number of rows. Conversely, “Big Data” refers to a set of technologies that were engineered to manage large volumes of data records using distributed storage and processing.

Big Data

In the first wave of cloud computing, we saw the introduction of “Big Data” which was a new concept for data professionals. Technologies like Hadoop and Apache Spark use distributed computing to process larges sets of data in parallel, using highly scaled-out architecture. The term “Big Data” doesn’t just mean that there is a lot of data but it really means “distributed data processing” which can potentially be used to handle very large data sets. Like any architecture, there are strengths and weaknesses. Big Data implementations like Spark and Databricks can transform and process an enormous amount of information very quickly with portions of the data processed on distributed nodes.

Conversely, the VertiPaq engine at the center of a Power BI and SSAS tabular data model, is not a distributed architecture. Because VertiPaq keeps all of the data in memory on a single computing node, it can perform calculations extremely fast. For this to work, data must reside in contiguous memory on a single computer. So, Power BI is not Big Data but it can be used in conjunction with Big Data sources when you use these respective technologies the way they were designed and architected to work.

Complex Models

Unique problems require unique data model designs. Sometimes a Power BI report and accompanying data model just need to address a few business reporting objectives. As time goes by, additional requirements are introduced and the model grows. Along the way, certain data challenges might require additional data massaging or new tables to bridge tables and address special needs. Over time, a model can get complicated. This often raises the question: should we create one big, complex data model to address many requirements or multiple, less complicated data models? There are trade-off decisions either way. It is often possible to engineer larger, high-value models. However, you have to weight the cost to maintain a complex model with the relative simplicity and cost savings of less-complicated models. Sometimes starting over is cheaper than layering more messiness onto a data model with a shaky foundation.

Partitioning

Once considered an essential activity in the design of any data model, partitioning breaks a table into multiple storage units that can be managed and loaded separately. Partitions containing unchanged records don’t need to be reprocessed, and the smaller partitions containing new records or those that have changed, can be processed quickly and in parallel, so they don’t block other operations.

Video Interviews

This video runs about 27 minutes and includes a series of interviews and discussions with some of my colleagues, with their thoughts and experiences on this topic.