Return to site

Redshift Data Masking in AWS

Amazon Web Services Redshift or simply AWS Redshift, is a fast, fully managed and petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools.

Redshift turns to be a great choice if your database is overloaded due to OLAP transactions (Online Analytical Processing).

Amazon Redshift is designed for OLAP, which allows you to easily combine multiple complex queries to provide answers.

Relational or sequel databases are row based, however, Redshift is a column based database.

Example

Let's say that you have in an Excel spreadsheet a table with countries, products and sales, and the table has different rows. The database related to the table would store in row 1 the first country in the table, the related product and the number of sales. In row 2 we will find the second country in the table, with the related product and sales, and so on.

However, in column based databases is done differently because all the countries will be stored together, the products will be stored together, and also the sales will be stored the same way.

This is the key difference between row store and column store databases.

Columnar data is stored in this case sequentially on the storage, so it requires less reads to get all the data. It also allows the user to compress data in a more efficient way.

Columnar data can be compressed much easier because all the data types are the same. We have all countries stored in one sequential row, so it's much easier to compress data this way.

In row based, all the data types will be different, so compression is much harder.

The compression scheme for Redshift is automatically chosen for you, so you don't need to worry about that.

AWS Redshift data warehouse is a collection of computing resources called nodes, and these nodes are organized into a group called cluster.

Each cluster runs an Amazon Redshift engine that contains one or more databases.

The AWS engine has the ability to actually do data masking. It can recognize certain types of data (social security number, credit cards, date of birth, phone number, etc.), and can automatically mask them, by putting them in a format that still going to be valuable from an analytic standpoint, but the data will be masked, so the sensitive data actually isn't stored in Redshift.

When you launch AWS Redshift for the first time, you can start with a single node, and once you grow, you can add additional nodes, to take advantage of massive parallel processing.

You can then operate a multimode, which means that you have a leader node that manages all the client connections and then compute nodes that store data and perform queries and connotations.

Limitations

Management business intelligence isn't viewed as business critical, it's something that would be brought up very quickly, but there would be other applications needed to be brought up first.

However, you can restore snapshots of your Redshift databases to other availability zones.