Part of the series: Doing Power BI the Right Way
When starting a Power BI project, you have many choices to make. Decisions like how to source your data, where and how to create queries to cleanse, transform and reshape the data; where and how to create calculations and the nuances of modeling are just the tip of the iceberg. Most Power BI solutions progress through a few stages in their lifespan, which include:
- Creating a simple prototype to explore ideas and to get feedback from users and business leaders
- Building a working proof-of-concept to frame the solution
- Designing a foundational data model to build upon
- Completing a production-scale solution
- Iterative enhancements & releases
The purpose of this post is to provide some guidance to help you design Power BI solutions that can survive the journey through these stages with as little “throw-away” design as possible. There will always be prototyping and redesign in any project but if you follow proven design patterns and lessons learned from prior experience, you’ll have a much better chance of building a reporting solution that will endure.
Oh, I should mention this… the difference between item #1 (the simple prototype) and #2 (working proof-of-concept) is that you need to throw away the first one – and anything else that doesn’t lay a foundation that you can build on top of. This is a hard lesson but one that will pay off by helping our teams, sponsors and business stakeholders understand that we can proceed after laying down solid bricks (based on well-defined functional and technical requirements) to build the rest of the solution.
If you are using Power BI to create reporting solutions for business users, your circumstances are likely similar to one of these scenarios:
- You are a self-service report designer creating a small report solution that will grow and evolve over time.
- You work in a software development group in an IT organization with a formal application lifecycle management discipline.
- You work in a business unit or small company with an informal software process – and they need your guidance to be successful.
If you are a self-service report designer creating a small report solution that will grow and evolve over time, you likely have a data source that wasn’t specifically curated for reporting and analysis. The data will need to be massaged and shaped before you can design insightful, interactive visuals. That’s cool because we have great tools built into Power BI Desktop to do that sort of thing. You might have some ideas about how to present the data but you’ll probably try different things and see where the journey takes you. These projects tend to have a lot of business impact right out of the gate. Business report users will be delighted with what you can do in a short time with Power BI and they will want more. You’ll get excited and expand the project to include more data or more dimension to the data you have. It is almost inevitable that you will experience two common challenges:
- Scope creep
- Data quality issues
These are not necessarily “problems” per se but realities to acknowledge. Building a solid, foundational data model is challenging in under these circumstances. The good news is that Power BI let’s you explore options quickly and allows you to try different approaches while you sort-out the formal business and technical requirements. The typical outcome is to start over with a new project where you will be well-prepared to use a cleansed data source and design to a well-defined set of requirements within the new project scope. An experienced BI practitioner will set this expectation with the business and either carefully evolve certain elements into the final project and toss the rest, or use the first attempt as a throw-away prototype and start over. Just know that this is a natural part of the process.
If you work in an software development group in an IT organization with a formal application lifecycle management discipline, you are probably accustomed to using team development a code repository, version control and differencing tools to manage a shared code base. The main components of Power BI came from a long heritage of IT-centric development so at its core, objects (like tables, fields and measures) have properties and metadata descriptors that can be versioned, differenced, merged and scripted. But what has made Power BI so successful is that is is primarily designed for self-service reporting. Power BI Desktop is a beautifully streamlined product that packages data transformation queries, model tables, data and reports into a tidy package but it is not an IT development tool and doesn’t have these capabilities built in – nor will it. That’s not what Power BI Desktop was designed for. The good news is that the community has rallied to develop tools to meet the needs of IT developers who work in teams to develop, test and deploy formal, enterprise-scale BI reporting solutions. Power BI and the rest of the Microsoft BI platform – can be scripted and automated to create custom deployments.
Before creating Power BI, earlier versions of Microsoft’s Business Intelligence components existed as separate products. BI reporting projects took many months to design, develop and deploy. By contrast, today you can open-up Power BI Desktop and create a quick “report” (PBIX file). In this form, the file contains data transformation queries, an in-memory data model with measures and other calculations, and report pages that include the interactive visuals. This design typically meets the needs of small and intermediate sized projects.
One of the first questions to consider is whether the solution you intend to create with Power BI is a quick informal project or a formal project – or may become a formal project in the future. Informal projects don’t necessarily have to conform to all the design best practices. They are limited in scope and typically meet a very specific and short-term need. Formal projects are built on best practice design standards to endure a certain degree of expanding scope and are designed for longevity. They have a business sponsor or champion who drives the business requirements and at least one developer who is responsible for designing a solution that meets those requirements. Formal projects are typically testing, signed-off an then deployed in accordance with an established development discipline or application lifecycle process so they can be supported by the business. If your organization has a process for delivering and supporting company-sanctioned applications and reports, then you will need a roadmap to transition from an informal project to a solution that the business is prepared to support. A few common questions that come up in these discussions are:
- How is the source data procured and maintained?
- Who is responsible for maintaining that data?
- Who from the business defines the reporting requirements and signs-off on Power BI reports meeting those requirements?
- Who owns the development of the Power BI solution (data model and reports)?
- Who is the developer’s backup if they become unavailable?
- Are the requirements and project assets sufficiently documented to support such a transition?
- What are the security requirements?
- Who and how will users be given access to reports or dataset(s) for ad hoc analysis?
- Are users assigned to report and dataset permissions directly or through group membership?
- Should users have conditional or restricted access to data within the Power BI dataset?
- What is the process for requesting new reporting features, adding features to the existing solution, testing and deploying updates?
Personal BI projects, where a data analyst imports data, creates reports and then publishes a report for their own use serves an important need in business. These reports can easily be promoted to a departmental or business area solution by sharing the report with others. As the scope of the project continues to expand, there are several important considerations that define the threshold between an informal project, designed for quick development and convenience, and a formal project; intended for longevity and long-term business support. For each project component, I start with a typical informal project scenario and then provide recommended practices to to plan for futureproofing.
Planning for separation
Designing transformation queries, data model and reports in one PBIX file is convenient and offers quick troubleshooting, debugging and redesign. However, it also limits development to one person at a time and promotes a monolithic solution. Separating the data model from the report files helps to promote self-service and focused report design. The certified or promoted dataset is deployed, secured and managed as a separate object. Reports are connected to a Power BI dataset live connection and any number of reports can connect to a published dataset.
This pattern supports both fully IT-managed projects with multiple developers and it supports self-service report design using certified and promoted datasets. After the data model design is stabilized and locked-down, report designers simply focus on report creation and data analysis by connecting to the secured dataset. Any tools and products capable of connecting to Power BI or SQL Server Analysis Services (SSAS/AAS) can be used for reporting and ad hoc analysis.
Spreadsheets and files extracted from reports, manually manipulated using Excel or other line-of-business applications have limited utility as reliable data sources. Manually preparing spreadsheets and source data files is only feasible for small sets of data and one-time or manual data loads. An analyst user can easily remove, rename or add a column or formula to a sheet that will break or negate the import process.
Futureproofing: Data from source systems can be loaded into a managed database or data lake before using Power Query to load the data model. Data from files (if carefully protected and maintained) may be transformed, cleansed and validated before it is ready to be loaded. Using “brute force” transformation steps to replace values and cleanse records one step at a time that can be slow and difficult to manage as the complexity and volume increases. Reduce the number and complexity of query steps and apply the following recommendations for Power Query design optimization.
Use parameters in Power Query to manage any variable information such as database servers, database names (e.g. DEV database, QA database & PROD database) and file paths.
Planning for data volume
In smaller, informal projects we typically designed for a limited set of data and fixed time-frame. If manageable, all production data can be loaded into the working PBIX file on a developer’s desktop.
Futureproofing: When working with large data volumes, a subset of data should loaded into the desktop dataset for development. Query parameters are used to expand the number of records (usually using a date range) loaded after the dataset is deployed for production use. Groups and ranges of data records, often partitioned by months or years, are defined and stored separately so a large volume of records don’t need to be reloaded along with new records.
Power Query design practices
Using the robust Power Query design user interface (ribbon and menu options), you can build sophisticated and complex transformations that work well with moderate data volumes. With a small number of query steps, default names like “Rename Columns” and Change Data Type” are fine but it can be difficult to trace the steps and remember where and why you performed each step.
Futureproofing: Tables and fields should be meticulously renamed to use standard friendly names, categorized and optimized with appropriate data types. Unused columns are removed to save space and improve performance. Spend extra time to get the field names right before creating dependent steps. (see; Chris Webb: Names, Tables, Columns & Measures In Power BI)
Consolidate redundant query steps – like those used to rename columns, remove columns and change data types – into a single step in the query.
Attribute fields needed for grouping and filtering should be moved to separate dimension tables. Dimension unique keys are defined with corresponding fact table keys.
All numeric columns used for aggregation (which share the same level of grain) should be extracted to fact tables, with corresponding dimension keys.
Row-level calculated columns should be defined in either in database views or Power Query tables rather than DAX calculated columns.
Promote query folding by using database tables, views or table-valued user-defined functions in the database rather than writing SQL statements in Power Query. Test and optimize queries to support query folding by checking to see in “View Native Query” is an available right-click menu option on each query step. Power Query also doesn’t work well with stored procedures.
Data loading & refresh management
For large fact tables, design them to be “tall” rather than “wide” by removing unneeded columns or moving them into dimension tables. Tall tables should only include columns that compress efficiently, such as numeric values and short, non-distinct text values. Do not store a single unique key column in a fact table.
If large tables take too long to load or cause the PBIX file to exceed a reasonable size for development (no more than 500 MB), use parameters to limit the range of records. To manage the table size once deployed tot he service, do one of the following:
- Use start and end date range parameters to filter a Date or Date/Time type column in the table. Use this to limit the working set on the development desktop and then to expand the range of records in the service.
- Similar to step one, use a pair of Date/Time parameters with a range filter on a Date/Time type column and follow the product documentation to implement incremental refresh. This will automatically partition the table records on specified calendar date parts (e.g. weeks, months, years) and then only load changes.
- Devise a table partitioning strategy and implement partitions using Tabular Editor. Use PowerShell or the XMLA read/write endpoint to schedule and execute TMSL or XMLA script to manage partition processing.
Many modelling techniques that perform well when tested on small data sets will slow down with large tables. These include relationships on bridging tables, many-to-many relationships, bi-directional relationship filters and multi-measure layered calculations.
Avoid using SWITCH statements over already complex DAX expressions or referenced measures. Avoid using iterator functions (like SUMX) with complex row-by-row predicate logic. Separating calculation logic into separate measures is a useful design pattern to segregate business logic but it can create performance bottlenecks. Identify slow measures and consider building streamlined measures using alternate patterns when bottlenecks are discovered.
Import mode tables are generally faster and more flexible than DirectQuery mode tables. However, DirectQuery is a good option for very tall transactional tables where simple aggregations are used, rather than complex DAX measures. If Import and DirectQuery mode tables are both used in a composite data model, consider creating a summary table using Import model and a detail table using DQ, and then define aggregates to cache certain query results in memory (to compensate for the normal performance degradation resulting from the query translation in DirectQuery.)
There is so much more to discuss at this point but I’ll need to address additional topics in later posts in this series titled “Doing Power BI the Right Way“. Topics include:
- Service vs on-premises
- Interactive vs operational reports
- Spreadsheet reporting
- Iteration vs redesign: When to start over
- Designing for scale
Power Query or Power BI Dataflows
Multi-tier and advanced self-service solutions can be architected by further separating the transformation queries from the data model using Power BI dataflows. These are server-hosted Power Query/M queries that are both designed and managed in the Power BI service.
Dataflows expose several integrations with the Microsoft Azure cloud services, such as the Common Data Service and Azure Data Lake Gen2 Storage. Dataflows also have integrations with AutoML machine learning models.
The advantages and disadvantages of dataflows should be carefully considered before jumping in head-first. Power Query in Power BI Desktop is a fast, right query development environment that provides nearly immediate feedback. By contrast, the browser-based dataflows designer is similar in function but slower and has a less-robust query design feature set. Some developers choose to develop Power Query in Desktop and then port their M query code to dataflows as needed.
Stick around. There is a lot more coming in this series in the weeks ahead:
- Futureproofing Power BI solutions
- Preparing source data for Power BI
- Choosing the best option to shape and transform Power BI data
- Power Query design best practices
- Power Query in dataflows or Power BI Desktop
- Data modeling essentials and best practices in Power BI and AS tabular
- Validating data model results
- Planning for separation – data models and reports
- Power BI delivery options
- Choosing the right report type: analytic or paginated
- Designing and managing large datasets in Power BI
- Multi-developer and lifecycle management for Power BI
- Certified reports, certified datasets & the self-service mindset
To wrap up this edition of the series, I will leave the rest of my outline points as teasers for subsequent posts. Please freely comment with your thoughts about what to include in these later editions.
The rest of these are random thoughts left from the outline for the series. Feel free to comment.
- Define the technical audience
- Establish solution ownership
- Continuation plan (if they were to leave?)
- Who will maintain & manage the solution after deployment?
- User access
- Version updates
Know and understand the business audience
- Report/Dashboard Consumer
- Skilled Data Modeler
- Analyst, Data Scientist
- IT BI Developer
- Planning for capacity
- Certifying reports and datasets: Gaining business trust
- Validating results
- Team collaboration
- Project file sharing
- Promote continuous improvement
- Version management
- Managing deployment
- Power Query and Dataflows
- Promote and support repeatable design
- Templates and layouts
- Planning for security
- Access to reports & dashboards
- Publish to web
- Embedding to a business portal
- Embedding for external access
- Access to dataset(s) for ad hoc analysis and future report development
- Conditional data access / role-based access / row-level security