Back home

24 July 2023

How AWS Glue Data Quality Helps You Achieve Compliance For Your Data Lake With Confidence

tags:

The “recent” creation of data lakes by thousands of Organizations also created a new problem: data swamps. It is common to avoid key concepts like data quality because of time or cost constraints. Setting up a data quality framework can quickly become a huge project that requires critical features like alerting and scaling.

AWS recently announced the General Availability of AWS Glue Data Quality which aims to deliver high-quality data across your data lakes and existing pipelines. This blog will explore its capabilities and benefits to show how it revolutionizes the way businesses ensure data accuracy and reliability.

What is AWS Glue Data Quality?

AWS Glue Data Quality is a powerful data profiling and data quality solution designed to assist data engineers, data scientists, and analysts in understanding and enhancing data quality across their AWS environments. This new feature complements AWS Glue’s existing functionality and further simplifies the process of validating, cleaning, and transforming data.

Key Features and Capabilities

Use Cases

Rules and alerts setup

Dataset

I decided to create my own dataset using the awesome Python library Faker:

from faker import Faker
import csv

fake = Faker()

# Create CSV with header
with open("person.csv", "w", newline="") as csvfile:
    fieldnames = ["name", "birthdate", "ssn", "sex", "mail"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()

    # Generate 100 rows
    for _ in range(100):
        p = fake.profile()
        writer.writerow(
            {
                "name": p["name"],
                "birthdate": p["birthdate"],
                "sex": p["sex"],
                "mail": p["mail"],
            }
        )

After generating the csv and uploading it to S3, I was able to load the data into the Glue Catalog using a regular CSV crawler. For more information on how to use Glue crawlers, I recommend reading the documentation.

glue-table

Rules

Now that we have our table, we can create rules through the “Data quality” tab directly in the Glue UI. Glue can also generate a set of recommendations based on your data, this is extremely useful when having a lot of columns. A rule uses a Data Quality Definition Language (DQDL) format to define what will be tested on your table, more details about DQDL here.

Here is what we gonna use for our Person table in a healthcare context:

Rules = [
    (IsComplete "name") and (Uniqueness "name" > 0.95),
    (IsComplete "birthdate") and (ColumnLength "birthdate" = 10),
    ColumnLength "ssn" = 0,
    IsComplete "sex",
    IsComplete "mail",
    CustomSql "select count(*) from primary where date(birthdate) > current_date" = 0
]

Each column (except the SSN one) are checked to make sure that it’s not NULL using the IsComplete rule. The rule CustomSql is very powerful and allows us to run any query on our dataset. Here it is used to make sure that birthdate is not a future date.

After running the rule, we can see that two rules failed:

glue-table

This is normal as I deliberately added an SSN and a future date of birth in the dataset to have it fails the data quality checks. After cleaning up the data and rerunning the CSV crawler, everything is green, and we can actually see the quality score increasing to 100% on the data quality chart:

glue-table

Alerts

When running the rule, I recommend using Amazon Simple Storage Service (Amazon S3) to store the output of the run. Then a simple EventBridge rule can be created to trigger a Lambda when a file is created at the correct S3 location. The lambda will parse the JSON output and then perform any kind of alerting (email, PagerDuty, etc) based on your Organization’s alerting system.

For a more detailed implementation, I suggest reading this series of blogs from the AWS team.

Conclusion

AWS Glue Data Quality introduces a new era of data integrity, offering organizations a comprehensive solution to validate, cleanse, and maintain high-quality data in their AWS environments. By empowering data professionals to make informed decisions based on reliable data, AWS Glue Data Quality becomes an indispensable tool in driving successful data-driven strategies for businesses across diverse industries.

Need help with your existing data lake? Or do you need help using AWS Glue? Ippon can help! Send us a line at contact@ippon.tech.