Using Power Query “M” To Encode Text As Numbers

I worked through a brain-teaser on a consulting project today that I thought I’d share in case it was useful for someone else in the community.  We needed to convert application user names into an encoded format that would preserve case sensitive comparison.  Here’s the story… A client of mine is using Power BI Desktop to munge data from several different source systems to create analytic reports.

Two-Phase BI  Projects

I’m going to step out of the frame just a moment to make a soapbox speech:  I’m a believer in two-phase Business Intelligence project design.  What that means in a few words is that we rapidly work through a quick design, building a functional pilot or proof-of-concept to produce some reports that demonstrate the capability of the solution.  This gets stakeholders and folks funding the project on-board so we can get the support necessary to schedule and budget the more formal, production-scale long-term business solution.  Part of the negotiation is that we might use self-service BI tools to bend or even break the rules of proper design the first time through.  We agree to learn what we can from this experience, salvage what we can from the first phase project and then we adhere to proper design rules, using what we learned to build a production-ready solution in Phase Two.

Our project is in Phase One and we’re cutting corners all over the place to get reports done and ready to show our stakeholders.  Today I learned that the user login names stored in one of the source systems, which we will use to uniquely identify system users, allows different users to be setup using the same combinations of letters as long as the upper and lower case don’t match.  I had to ask the business user to repeat that and I had heard it right the first time.  If there were two users named “Bob Smith” that were setup with login user names of “BOBSMITH” and “BobSmith”, that was perfectly acceptable per the rules enforced in the application.  No right-minded application developer on this planet or any other should have let that happen but since their dink-wad software produces this data, we have to use it as it is.  In the Phase Two (production-ready) solution we will generate surrogate keys to define uniqueness but in this version, created with Power BI Desktop, I have to figure out how to make the same user name strings, with different upper and lower-case combinations, participate in relationships and serve as table key identifiers.

SNAGHTML1158f792

Wouldn’t it be nice if I could convert each UserName string to a numeric representation of each character (which would be different for each upper or lower case letter).  I knew that to convert each character one-at-a-time, I would need to bust off each string into a list of characters.  Let’s see…  that’s probably done with a List object but what method and where do I find the answer?

It’s Off To The Web, Batman!

Yes, I Googled it (I actually used Bing) and found several good resources.  Most official docs online weren’t very helpful.  I have a paper copy of Ken Puls book where he mentions List.Splitter, which seemed promising.  I have an e-copy of Chris Webb’s book – somewhere – and I know he eats and breathes this kinda stuff.  Running low on options, I came across Reza Rad’s December, 2017 blog post and found Mecca.  Reza has an extensive post about parsing and manipulating lists. He helped me understand the mechanics of the List.Accumulate function, which is really powerful.  Reza provides several good examples of List manipulation; pulling lists apart and putting them back together.  This post didn’t entirely address my scenario but did give me a foundation to figure the rest out on my own.  The post is here. It was educational and sent me in the right direction.  But, the sample code didn’t resolve my issue entirely.  It did, however get me thinking about the problem a certain way and I figured it out.  HOT DANG!

So Here’s The Deal

The first step was to tear each string down into a List object.  At that point, you have a collection of characters to have your way with.

I created a calculated column and entered something like this:

=Text.ToList( [UserName] )

image

If you were to add this column in the query design and then scroll on over to the new column, you’d see that it shows up as a List object placeholder, just all waiting for you to click the magic link that navigates to the list of all the characters in the column string.

image

We don’t want to do this.

Beep Beep Beep…. Backing up The Bus

Removing the navigation step and looking at the column of List object placeholders…  I want to modify the M code for this step to do the following:

  1. Parse the items in the list (each character in the UserName field)
  2. For each character, convert it to a number
  3. Iterate through the list and concatenate the numbers into a new string of numerals

To enumerate over the elements of a list and put the list members back into some kind of presentable package (like a single string or a number), we can use the Accumulate method.

The Accumulator is a little machine with a crank handle on the side.  Every turn of the handle spits out on of the element values, using the current variable.  You can do whatever you want with the object in the current variable, but if you want to put it back into the machine for next turn, you should combine it with the state variable, which represents the previous value (when the handle was cranked the last time).

Here’s my final desired result:

image

In a nutshell, List.Accumulate contains two internal variables that can be used to iterate over the elements of a list (sort of like an array) and assemble a new value.

The state variable holds the temporary value that you can build on each each iteration, and the current variable represents the value of the current element.  With an example, this will be clear.

The final code takes the output from “Text.ToList” and builds a List object from the characters in the UserName field on that row.

Next, List.Accumulate iterates over each character where my code uses “Character.ToNumber” over the current character to convert it to numeric form.

Adding this custom column…

image

…generates this M code in the query:

= Table.AddColumn(#”Reordered Columns”, “Encoded UserName 1”, each List.Accumulate(
Text.ToList([UserName])
, “”
, (state, current)=>
state
&
Number.ToText(
Character.ToNumber(current), “000”
)
)

Just like magic, now I have a unique numeric column representing the distinct upper and lower-case characters in these strings, that I can reliably be used as a key and join operator.

Bad Data Happens

As I said earlier, in a solution where we can manage the data governance rules, perhaps we could prevent these mixed-case user names from being created.  However, in this project, they did and we needed to use them.

Paul Turley

Paul Turley

Microsoft Data Platform MVP, Principal Consultant for 3Cloud Solutions Specializing in Business Intelligence, SQL Server solutions, Power BI, Analysis Services & Reporting Services.

10 thoughts on “Using Power Query “M” To Encode Text As Numbers

  1. Extremely helpful, Paul. Thank you.

    I needed to do this transformation in to queries, so I moved it into a function like this, which can then be called from both.

    let
    Source = (t as text ) =>
    let
    textAsList = Text.ToList(t),
    converted = List.Accumulate(textAsList, “”, (state, current) => state & Number.ToText(Character.ToNumber(current), “000”))
    in
    converted
    in
    Source

  2. For a better understdaning, the “state” is not the previous value but it’s the summarize of your accumulation function.
    In other words, it’s working as a bucket wich you fill up with your “current” value.
    eg.:
    = List.Accumulate({“1”, “A”, “^”}, “”, (state, current) => state & current & state)
    Step 1 => 1
    Step 2 => 1A1
    Step 3 => 1A1^1A1

  3. Maybe a pipe character delimiter | would be a good addition, to absolutely ensure uniqueness? Otherwise you can’t say hand-on-heart that your encoded usernames are unique.. (Although you could say that they are almost certainly unique)

  4. No. There seems to be some evidence that it is part of the data mashup engine internals but I couldn’t find a function that returns a hash value. The Table.Partition function includes an optional hash parameter and the JoinAlgorithm class has a handful of undocumented “hash”-type properties. Admittedly my solution is functional but incomplete as it generates very long numeral strings. As a work-around, you could also use SQL or R. I’m interested in the community’s input on this topic.

  5. Interesting. From a compression perspective, will this make any text string smaller?
    I take it that there is not a hash function in M?

Leave a Reply

Discover more from Paul Turley's SQL Server BI Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading