I was fortunate to attend a training about Cortana Intelligence Suite and SQL Data warehouse at Microsoft Paris. The training covered a series of modules on the field of Data Science + Azure SQL Data Warehousing.
As a BI specialist, I became interested in the SQL Data Warehousing part and plunged into the fantastic world of Azure Data Warehouse (ADW) in the last days.
In this article, I would like to talk about two concepts in Azure SQL Data warehouse : MPP & distribution. These concepts define how your data is distributed and processed in parallel:
1 - Massively Parallel Processing (MPP)
Let's start with the general architecture of Azure SQL Data warehouse.
Conceptually, you have a control node on which all applications and connections interact, each interacts with a multitude of compute nodes.
The control node retrieves the input request and then analyzes it before sending it to the compute nodes. The compute nodes execute the query on their databases and return the results to the control node that gathers these results.
The data is stored in Azure Blob storage and is not attached to the compute nodes. That's why you can do the scale out, scale in or even suspend you ADW quickly without losing any data.
ADW splits data between 60 databases. All the time, regardless of what you do. It is a constant.
Knowing that you can change the number of compute nodes indirectly by requesting more Data Warehouse Unit (DWU) on your instance of ADW.
Databases ADWmutually share in the compute nodes. It is quite easy, now that you know there are 60, to deduce the number of compute nodes from the dedicated data storage unit (DWU) using the following table:
|DWU||# Compute Nodes||# DB per node|
2 - Distribution
Note that the data load in ADW is stored in 60 databases. What data is stored in which database?
Normally, with a simple SELECT query on a table and distributed data uniformly, you should not worry about it, right? The query will be sent to the compute nodes, they will query on each database, and the result will be merged together by the control node.
However, once you start joining data from multiple tables,ADWwill have to make Data Movementin other words, it will swing data around one database to another to join the data. It can not be avoided in general, but you should try to minimize it for better performance.
The location of the data is controlled by the distribution attribute of your tables. By default, tables are distributed in round robin: the data goes first to the 1 database then 2, then to 3 ...
You can control where your data is going by using the hash distribution method. With this method, you can specify when creating your table, that you want to use the algorithm of hash and the column to use. This ensures that rows of data with the same hash column value will be in the same table. However, it does not guarantee that a column value of two hashes will end up in the same database.
So, let's look at a simple example of a table distributed in round-robin:
CREATE TABLE [dbo] .DimProduct (ProductID INT NOT NULL, ProductName VARCHAR (50) NOT NULL, Price DECIMAL (5,2) NOT NULL, CategoryID INT NOT NULL) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN)
And now with a hash algorithm:
CREATE TABLE [dbo] .DimProduct (ProductID INT NOT NULL, ProductName VARCHAR (50) NOT NULL, Price DECIMAL (5,2) NOT NULL, CategoryID INT NOT NULL) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH (CategoryID))
In this part, I specified that the hash taken from the CategoryID column. Thus, all the products of the same category will be stored in the same database.
So what did I gain by ensuring that products of the same categories are stored in the same DB ...?
If we want to get the sum of the number of products per category, we can now do it without data movement because we are sure that the rows for a given category will all be in the same database.
In addition, if I want to attach data from another table to Category ID, this join may occur "locally" if the other table also has a hash distribution on the category ID. You need to think about the type of queries you are going to have and also make sure that the data will be evenly distributed.
It is recommended to use distribution control on columns thatui are not updated (the hash column can not be updated) and rdistribute data evenly, avoiding data bias in order toto imitate the movement of data.
Roughly speaking, I tried to explain the different ways your data is distributed aroundAzure Data Warehouse (ADW).
I hope these explanations will be useful to you on your projects and This gives you a clear picture of how compute nodes access your data and how you can control their distribution.