Part 2 – Analysis Services Multidimensional: “A Duplicate Attribute Key Has Been Found When Processing”
Introduction – Part 2
Part 1 of this post focused on the first category of how the Analysis Services Multidimensional (MD) duplicate attribute key error can arise. It reflects the perspective of an atomic attribute – an attribute having no attribute relationships other than with the dimension key attribute.
This post focuses on the second category of this error, which can arise when an attribute does have attribute relationships besides the (required) one with the dimension key attribute.
As is well known, creating attribute relationships is a best practice in Analysis Services MD for improving query performance. The most common reason attribute relationships are created is to support a natural hierarchy – so your data model has to have one for this to arise. The next most common reason is to support attribute properties, such as a sort order – i.e. when the sort order of one attribute is modeled as based on the value of another attribute. Let it be clear that we are here speaking of attribute relationships within a dimension, not the relationship of a dimension to a measure group in a cube.
To explore the various manifestations of the issue, we will continue with the same software prerequisites mentioned in Part 1, as well as the cube project created there. The SQL artifacts are not required. Remember that since the SQL artifacts we will create are in tempdb, they must be recreated if you reboot your computer or otherwise cycle the SQL Server service. Depending on your configuration, you may also need to adjust tempdb permissions to enable Analysis Services to read from it. Alternatively, you can alter the DDL scripts to create objects in another database.
Case 1 – Attribute Relationships not used in a hierarchy
Execute the following on your SQL Server instance:
create table tempdb..MD_Dupe_Rel_Case1 ( GameSK int not null, GameName varchar(20) not null, NumPlayers int not null, GameSubCategory varchar(20) not null, GameCategory varchar(20) not null ) go insert tempdb..MD_Dupe_Rel_Case1 values (1, 'Central Park', 1, 'Place', 'ElectroMechanical'), (2, 'Sittin'' Pretty', 1, 'Carnival', 'ElectroMechanical'), (3, 'Fireball', 4, 'Mythological', 'ElectroMechanical'), (4, 'Twilight Zone', 4, 'Place', 'SolidState'), (5, 'Fun House', 4, 'Horror', 'SolidState'), (6, 'Speakeasy', 2, 'Miscellaneous', 'SolidState') go
Next, in the Multidimensional project from Part 1, import tempdb..MD_Dupe_Rel_Case1 into the DSV, then create a new dimension “Games1″ that models the attributes in the above dimension table. Define GameSK as the dimension Key attribute, and for each other attribute define only the KeyColumn property. Visual Studio will automatically create for each attribute a relationship between the attribute and the GameSK attribute, since every attribute must relate, directly or indirectly, to the dimension key attribute.
Your work should look like the following – note particularly the relationship pane highlighted, where all attributes relate directly to the dimension key attribute GameSK:
You may have noticed that there exists a hierarchy that we have not yet modeled
in the dimension source data, GameCategory->GameSubCategory->NumPlayers->GameName. Before we do model it, take a look at the SQL queries Analysis Services generates to populate this dimension. This is done by expanding each attribute on the Processing Progress dialogue until the “SQL queries” level is exposed. We will see that for each attribute other than the dimension key attribute GamesSK, the query is just a SELECT DISTINCT of the attribute’s key – for example:
SELECT DISTINCT [dbo_MD_Dupe_Rel_Case1].[GameSubCategory] AS [dbo_MD_Dupe_Rel_Case1GameSubCategory0_0] FROM [dbo].[MD_Dupe_Rel_Case1] AS [dbo_MD_Dupe_Rel_Case1]
For GamesSK, it is a SELECT DISTINCT of all the attributes including GamesSK:
SELECT DISTINCT [dbo_MD_Dupe_Rel_Case1].[GameSK] AS [dbo_MD_Dupe_Rel_Case1GameSK0_0], [dbo_MD_Dupe_Rel_Case1].[GameName] AS [dbo_MD_Dupe_Rel_Case1GameName0_1], [dbo_MD_Dupe_Rel_Case1].[NumPlayers] AS [dbo_MD_Dupe_Rel_Case1NumPlayers0_2], [dbo_MD_Dupe_Rel_Case1].[GameCategory] AS [dbo_MD_Dupe_Rel_Case1GameCategory0_3], [dbo_MD_Dupe_Rel_Case1].[GameSubCategory] AS [dbo_MD_Dupe_Rel_Case1GameSubCategory0_4] FROM [dbo].[MD_Dupe_Rel_Case1] AS [dbo_MD_Dupe_Rel_Case1]
Now let’s model the hierarchy (leaving it named the default, “Hierarchy”, in the following) without creating attribute relationships between the levels, and re-process. We will then have the following. Select the “Hierarchy” hierarchy on the Browser tab and verify that it navigates as expected:
Some books and documentation refer to such a hierarchy – one with no relationships defined – as unnatural.
This exercise shows that explicit attribute relationships are not required between the levels of a hierarchy in order to model and browse it – they are only a best practice for browsing performance. If you examine the SQL generated now, you will find it is the same in each case as before. Now we can see how Analysis Services understands the hierarchical structure – it is because of the SELECT DISTINCT of all attributes that it does for the dimension key attribute GamesSK. Note: because it does a DISTINCT on all attributes involved in the hierarchy, including the dimension key, there is no possibility of a duplicate key error, no matter what the data.
Case 2 – Attribute Relationships are used along with a hierarchy, but Attribute keys are not changed
Let’s see what happens when we do model explicit attribute relationships to support the hierarchy, as is recommended. In the Attribute Relationships pane, edit the relationships such that you now have:
Some Analysis Services books and documentation refer to such a hierarchy – one with supporting relationships defined – as natural.
Fully process the dimension. What happens ? We now get the duplicate attribute key error, specifically the following:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: ‘dbo_MD_Dupe_Rel_Case1′, Column: ‘GameSubCategory’, Value: ‘Place’. The attribute is ‘Game Sub Category’.
The error message provides clues but lacks clarity about the true cause of the problem. What is the problem anyway, and why doesn’t it occur, with the same data and functional hierarchy, when intra-level relationships don’t exist ?
For insight, once again we’ll look at the SQL queries generated by the latest configuration. You will find that (with default processing settings), processing stops upon the first error, and that happened on the Game Sub Category attribute, so let’s look at its query:
SELECT DISTINCT [dbo_MD_Dupe_Rel_Case1].[GameSubCategory] AS [dbo_MD_Dupe_Rel_Case1GameSubCategory0_0], [dbo_MD_Dupe_Rel_Case1].[GameCategory] AS [dbo_MD_Dupe_Rel_Case1GameCategory0_1] FROM [dbo].[MD_Dupe_Rel_Case1] AS [dbo_MD_Dupe_Rel_Case1]
If we run this manually in tempdb we get (after reformatting in Excel):
You can see that with attribute relationships, the query for a given non-dimension-key attribute is now not itself alone, but also includes its parent attribute as dictated by the attribute relationship – in this case Game Category. We see that the “Place” value of Game Sub Category shows up twice now – once under each value of Game Category. Reasonably or not, Analysis Services calls this a duplicate key error, even though the cause is quite different from those explained in Part 1.
This makes it clear that defining attribute relationships imposes a restriction on the data which does not exist otherwise – namely that a given child member key value – “Place” here – can only have one parent member key value. Here “Place” has two parents. In effect this means that to gain the query performance benefits of attribute relationships, the data model must obey stricter constraints than without them. This is non-obvious and certainly not explained clearly by the resulting error, official documentation, or any book I have seen. Relationships are typically explained as if they are purely metadata embellishments that Analysis Services can leverage for performance, without mentioning that using them imposes constraints on your data as well.
Case 3 – Attribute Relationships are used in a hierarchy, and Attribute keys are changed
Fixing the problem requires first determining that you don’t have a data quality problem – if you do, fixing the data should fix the duplicate key issue. If your data is correct, then the fix is to restructure the key of the child attribute that gets the duplicate key error in such a way that the SELECT DISTINCT for its key and the key of its parent no longer finds the same child key under more than one parent – or more generally, ancestor – key.
The usual way to do this is to change the key structure of the child attribute to include both itself and the keys of each ancestor level within the hierarchy, as described by the attribute relationships. With Analysis Services MD there are at least three options for accomplishing this:
- Analysis Services MD allows modeling an attribute’s key using more than one column from the dimension table. In the present case, the key for Game Sub Category could be defined as using the columns GameSubCategory and GameCategory, instead of just GameSubCategory.
- Add a calculated column to the dimension table in the DSV exposing an appropriate key through an expression – perhaps a concatenation of other columns. In this approach the attribute key would then be based on this single column.
- If the dimension table in the DSV is actually a view (as it should be per another best practice), the equivalent of #2 could be done in the view.
The first two options have the advantage of not requiring any change in the data source, which may not be feasible or allowed. Between these two, #1 is to be preferred because it makes explicit what is going on when you view the attribute properties, rather than burying the detail in the DSV. Option #3 also hides “what’s really going on”, but might still be the most elegant approach overall. It pushes Analysis Services details into the database (not a good thing in general), but if a view layer specifically for Analysis Services has been created (as it should have been), then it is reasonable to encapsulate such details there. In the end it is a judgement call.
In most cases, using any of these methods will also require explicitly defining some appropriate column in the source table as the source for the NameColumn property, since the new key value will in most cases not be suitable for display as the member value as it was before (with an important exception noted later).
Let’s pursue option #1. Open the properties pane of the Game Sub Category attribute and scroll down to the KeyColumns property highlighted below:
Notice how both the Key and Name columns are GameSubCategory. We will leave the NameColumn as is. Click the ellipsis to the right on the KeyColumn property to open the dialogue where you select the column(s) to use. Use the horizontal and vertical arrows to achieve the following, and click OK:
Now, fully reprocess the dimension. What happens ? We seem to have fixed one problem – the Game Sub Category attribute now processes – only to expose another – the Num Players attribute now gets the error:
Errors in the OLAP storage engine: A duplicate attribute key has been found when processing: Table: ‘dbo_MD_Dupe_Rel_Case1′, Column: ‘NumPlayers’, Value: ‘1’. The attribute is ‘Num Players’.
Before we investigate the new error, let’s check the SQL that is now generated for the Game Sub Category attribute:
SELECT DISTINCT [dbo_MD_Dupe_Rel_Case1].[GameCategory] AS [dbo_MD_Dupe_Rel_Case1GameCategory0_0], [dbo_MD_Dupe_Rel_Case1].[GameSubCategory] AS [dbo_MD_Dupe_Rel_Case1GameSubCategory0_1] FROM [dbo].[MD_Dupe_Rel_Case1] AS [dbo_MD_Dupe_Rel_Case1]
This is the same query as before we changed the key (except that for some reason Analysis Services reverses the column order). Changing the key tells Analysis Services how to interpret the result, such that it no longer considers there to be a duplicate key error.
Now let’s check the SQL behind the new error on the query for the Num Players attribute:
SELECT DISTINCT [dbo_MD_Dupe_Rel_Case1].[NumPlayers] AS [dbo_MD_Dupe_Rel_Case1NumPlayers0_0], [dbo_MD_Dupe_Rel_Case1].[GameCategory] AS [dbo_MD_Dupe_Rel_Case1GameCategory0_1], [dbo_MD_Dupe_Rel_Case1].[GameSubCategory] AS [dbo_MD_Dupe_Rel_Case1GameSubCategory0_2] FROM [dbo].[MD_Dupe_Rel_Case1] AS [dbo_MD_Dupe_Rel_Case1]
Note that now three attributes are involved. They represent the complete ancestry of Num Players as defined by the relationships (as did the earlier queries for Game Sub Category). After rearranging the column order to better show the order of the levels, adding an ORDER BY clause to sort the data from the bottom up, and running it, here is the resulting data:
See the problems ? Num Player values 1 and 4 each have multiple Game Sub Category parents. The fix now is to change the KeyColumn property of Num Players to use its own column plus those of its parent and grandparent:
Once this is done the dimension processes successfully. Notice that the selected columns need not all be the same data type – Num Players is an integer.
You might be wondering, why did we not have to continue this key-revision in the next level down – Game Name ? Usually you would need to do so for successful processing. In the present case we did not need to because Game Name is the leaf-level of the hierarchy, and the game names happen to be unique across all ancestry in our data. Another way of putting this is that the leaf level of the hierarchy – Game Name – has the same grain as the dimension key GameSK. In the real world this is often not the case – for example, over time, pinball machines with the same name but different ancestry are not unusual. For reliability, the key of Game Name should be changed to include its ancestors also, even if not strictly necessary with our current data.
The Special Case of a Calendar Dimension
I mentioned earlier that usually when you construct a compound key to avoid the duplicate key scenario, you’ll also want to define the NameColumn property to be a different column than the one or more used for the KeyColumns property (rather than using the same column as it initially defaults to). If you defined a compound key using the KeyColumns property given earlier as option #1, you would have to pick some single column as the NameColumn property. If you used option #2 or #3 you would be exposing a new column for KeyColumns, and could potentially also use it for the NameColumn property, but usually you would not want to. For example, with our Game Sub Category case, the key value would become something like (for one case) “Place-Electromechanical”, while you would want just “Place” to be displayed. In all these cases you could use the GameSubCategory column for the NameColumn.
Calendar dimensions are usually an exception to this. The classic starting point involves having Year, Month and Quarter attributes, with Month members “January”,…,”December” and Quarter members “Qtr1″,…,”Qtr4″. If the Month->Quarter->Year hierarchy is created and not backed by attribute relationships (call this Choice A), it will work functionally but may not provide optimal query performance. As soon as you define the related relationships (Choice B), you will get duplicate key errors on the Month and Quarter attributes, for reasons that should now be clear.
You could fix this using option #1 described earlier by adding the Year column to the Quarter column for the Quarter key, and adding both Year and Quarter to the Month attribute’s key. However, in this particular case it would be more common to use option #2 or #3, defining a new column for each of the Month and Quarter attributes. The month column would be populated like “January – YYYY”, where YYYY is the corresponding year (available from another column), and the quarter column similarly, e.g. “Qtr1 – YYYY”. In these cases the new columns can be used for both the KeyColumns and NameColumn properties, since these formats are usually not only fine for display but preferred. Why preferred ? Because this perfectly disambiguates the values when they appear in a report or pivot table. If you use Choice A this will not be the case.
Detect All Duplicate Keys At Once
I deliberately contrived the examples such that after fixing the first case, another one is discovered. This assumes the default error configuration for processing, where it stops upon the first error. It is possible to change this configuration so that it does not stop upon error, but continues through all attributes – this way you can discover all cases at once. See Part 1 for how to do this.
Duplicate Attribute Key Error and Analysis Services Tabular
The short answer is that this situation cannot occur in Tabular, where the possibility is not even a concept. Tabular has no notion of attributes (though the columns of a table appear similar – more similar than they really are), and still less a notion of attribute relationships – it has table relationships. In effect, a hierarchy in Tabular is analogous to an unnatural hierarchy in MD, as defined earlier.
As mentioned, Tabular columns really are not the same as attributes in MD, much as it appears they are in a pivot table. Tabular does not allow multi-column relationships between tables. It also does not support the idea of one column being the “key” of something and a different column being the display value of that something. These facts have an important bearing on the sort of data model you can have in Tabular.
Proactively Detecting Duplicate Key Error Cases
When first developing a cube or making significant changes, it might be desirable to be able to analyze the source data for any desired dimension hierarchies where relationships will be used, to see where one of the above techniques will be required, rather than waiting for Analysis Services to tell us via the duplicate key error. SQL queries to do this could easily be created in a data-driven fashion if a small amount of metadata were created to describe the desired hierarchies. As an example, here is a query that discloses the cases covered in this post:
with GameSubCategoryCheck as ( select distinct 'GameSubCategory' as Attribute, a.[GameSubCategory] + '-' + a.[GameCategory] as Example from [dbo].[MD_Dupe_Rel_Case1] a inner join (select [GameSubCategory] from (select [GameSubCategory], [GameCategory] from [dbo].[MD_Dupe_Rel_Case1] ) x group by [GameSubCategory] having count(*) > 1 ) b on a.[GameSubCategory] = b.[GameSubCategory] ), NumPlayersCheck as ( select distinct 'NumPlayers' as Attribute, cast(a.[NumPlayers] as varchar) + '-' + a.[GameSubCategory] + '-' + a.[GameCategory] Example from [dbo].[MD_Dupe_Rel_Case1] a inner join (select [NumPlayers] from (select [NumPlayers], [GameSubCategory] from [dbo].[MD_Dupe_Rel_Case1] ) x group by [NumPlayers] having count(*) > 1 ) b on a.[NumPlayers] = b.[NumPlayers] )
select * from GameSubCategoryCheck union all select * from NumPlayersCheck
Here is the result:
Here are the key points of this post:
- Hierarchies may be defined and browsed without defining attribute relationships between the levels. This is not recommended, but it works, and is more forgiving of the data than if relationships are created.
- Besides the causes described in Part 1, the duplicate attribute key error may also be caused by attribute relationships. Typically such relationships are created between the levels of a hierarchy, but the error, if it occurs, is driven by the relationship, not the existence of a hierarchy. In other words, if relationships which surface the error are created but there is no corresponding hierarchy, the error(s) will still occur.
- When the error occurs, assuming it is not due to a data integrity problem, the fix is to change the key structure of the affected attribute, using one of the several approaches described in this post.
- Analysis Services Tabular will not raise the duplicate key error, nor indeed any of the errors MD can raise about the data. This is not because it is more forgiving, but because its modus operandi is completely different.
Here are the key points of Part 1:
The duplicate attribute key error will occur for an attribute meeting the all of the following conditions:
- its KeyColumns property is a column in the source which is nullable.
- this KeyColumns column becomes a WChar data type
- the NullProcessing property is set to “Automatic” (the default)
- the NameColumn property is either not defined (shows in BIDS/SSDT as “(none)”), or is explicitly defined as the same column as the KeyColumn property. In both cases the Name values come from the Key column.
- a system-generated Unknown member is not defined for the dimension.
- the Key/Name column contains both nulls and blank/empty strings.
- With so many requirements for the error to occur it might seem like it would be rare, but in practice it is very common.