Do you have to create an Azure Data Lake (Gen2)? In this post I show you a strategy to keep it secured, having different accesses for different groups/users/SP (Service Principals) in each layer.
The more famous model is the Bronze, Silver, and Gold layers model. In the Bronze layer, the raw files will be uploaded. The files already cleaned and filtered will be saved in the Silver layer. And in the Gold layer, the aggregated data will be saved for final business use.
However, you should not create this Azure Data Lake and its layers regardless of the security strategy. You need a governance plan in order to avoid data viewing by teams/people/processes that should not have access. Or even write/delete permissions.
Also, as my own experience, it is much easier to apply the startup strategy than when you already have the Data Lake with millions of files.
Key concepts
In order to understand how to apply this model, first, we need to know some previous concepts/resources we will use:
- Service Principals (SP): To access resources that are secured by an Azure AD tenant, the entity that requires access must be represented by a security principal. This requirement is true for both users (user principal) and applications (service principal).
- Active Directory groups (Ad Groups): A collection of Active Directory objects. The group can include users, Service Principals, other groups, and other AD objects.
- Roles-Based Access Control (RBAC): is a system that provides fine-grained access management of Azure resources.
- Access Control List (ACL): You can associate a security principal (AD user, AD group, SP, …) with an access level for files and directories. Each association is captured as an entry in an access control list (ACL). Each file and directory in your Data Lake has an access control list. When a security principal attempts an operation on a file or directory, An ACL check determines whether that security principal (user, group, service principal, or managed identity) has the correct permission level to perform the operation.
Designing data layers
In the beginning, we talked about the bronze, silver, and gold layers model. It is a good model, but we will proceed with this layers model:
- Raw.
- Standardized (Optional).
-
Curated.
-
Application.
-
Sandbox.
Raw
The main goal is to ingest data into Raw as quickly and as efficiently as possible. Therefore the data and the format have not changed. It handles duplicates and different versions of the same data. The access should be restricted in order to keep some data anonymous.
Standardized
Remember this layer is optional. The main goal is to improve performance in data transfer from Raw to Curated. So we could change into the best reader format or add partitions. The data should not change either and consequently, the access should be restricted too.
Curated
The goal of this layer is to have a clean layer with some transformations. So it has consumable data which may be stored in files or tables. The sensible data should be anonymized as the access is not restricted to some consumers.
Application
The main goal of this layer is to read from Cleansed layer and enforce it with any needed business logic. For instance in order to have a star schema.
Sandbox
It is optional too. It is the space where data analysts and data scientist could make their tables and research.
There are several ways of accessing the different layers. First, accessing by access key and shared access signature must be disabled. The reason is that if one of these two methods is activated then they will bypass the security system that we are trying to implement. They grant access to the data lake in a general way and do not allow us to designate permissions to specific paths.
Accessing by RBAC
With these 2 methods disabled we also are able to access with role-based access control (RBAC). We could choose this option and give these roles to some Active Directory Groups, but we also have the same issue about permissions in a general way and not in a specific way.
This is important, because if some security principal (user, user group, service principal) has one of these roles, then it will bypass all our security systems.
These roles are:
- Storage Blob Data Owner
- Storage Blob Data Contributor
- Storage Blob Data Reader
However, you need the admin/s (could be a service principal) to have the Storage Blob Data Owner role, because it is needed to manage the Access Control List (ACL).
Accessing by ACL
Once accessing with access key and shared access signature are disabled, and the role Storage Blob Data Owner is set to admin/s we can set the ACLs.
We should create two AD groups for each layer, one will have read access and the other one will have write access. The naming could be something like this:
group-<layer>-<environment>-read
group-<layer>-<environment>-write
For instance, group-raw-production-write should contain all the AD users, Service Principals, and other AD groups (for example work teams) who must have write permissions on that layer. The same with the group group-raw-production-read, but only must contain the security principals who have to have only read access.
Nevertheless, we need to configure the default ACLs. They are not the same as Access ACLs we have talked about before. They are the ACLs that will inherit the subfiles/subfolders of the folder where we configure them.
However, configuring them does not imply that these permissions are applied to the subfiles/subfolders that previously existed in that folder. I recommend reading this link to better understand the difference:
[Differences between Access ACLs and Default ACLs]
This is a security system based on layers, but you could create and configure another read/write group for a specific path in case you need it.
I hope you understand this governance strategy. I know it is dense, and I have tried not to leave anything out. Thus, do not hesitate to contact me via mail or LinkedIn if you have some questions or suggestions.