Are you a data engineer or seeking to become one? This blog post is for you. This is the first entry of a series of articles about skills you’ll need in your everyday life as a data engineer. So let’s begin with the first and, in my opinion, the most useful tool in your technical tool belt, SQL.
Listed as one of the top technologies on data engineer job listings in January 2020, Structured Query Language (SQL) is a standardized programming language used to manage relational databases (not exclusively) and perform various operations on the data in them. In our case, we would use it to communicate with the data storage system (which could be a large number of systems like data lakes, databases, or data warehouses) for, mainly, storing and querying data.
Data cleansing and enrichment processes need to combine, filter, aggregate, and select different sets to answer questions we have. That translates into knowing how to perform different types of JOINs and WHERE clauses to simple SELECT statements. First and foremost, we need to get up to speed with JOINs.
Let’s imagine for a minute that we have customers and orders tables, which are related by the customers_id field:
You want all orders but also want to know the name of the customer who ordered.
SELECT a.order_id , a.customer_id , b.customer_name FROM example.orders AS a INNER JOIN example.customers AS b ON a.customer_id = b.customer_id;
LEFT “OUTER” JOIN
In this case, you want all orders, even if they have a customer associated or not (let’s not dig into data integrity issues here for the sake of simplicity of this example).
SELECT o.order_id , o.customer_id , c.customer_name FROM example.orders AS o LEFT [OUTER] JOIN example.customers AS c ON o.customer_id = c.customer_id;
Some database engines do not require you to specify OUTER.
RIGHT “OUTER” JOIN
All customers with or without any orders.
SELECT o.order_id , o.customer_id , c.customer_name FROM example.orders AS o RIGHT [OUTER] JOIN example.customers AS c ON o.customer_id = c.customer_id;
Order is important for LEFT and RIGHT join. This query will bring us all example.customers rows and whatever matches with example.orders.
LEFT ANTI SEMI JOIN and LEFT OUTER JOIN (with exclusion)
Let’s say you want all customers who haven’t ordered anything yet.
SELECT c.customer_id , c.customer_name FROM example.customers AS c WHERE NOT EXISTS (SELECT o.customer_id FROM example.orders AS o WHERE c.customer_id=o.customer_id)
SELECT c.customer_id , c.customer_name FROM example.customers AS c LEFT [OUTER] JOIN example.orders AS o ON c.customer_id = o.customer_id WHERE o.customer_id IS NULL;
Always do performance tests on your queries and subqueries. Join performance depends on table structure, indexes, and generated query plans.
Cross join returns the Cartesian product of both tables, meaning it will bring all possible combinations between all the elements of both sets.
SELECT c.customer_id , c.customer_name , o.order_id FROM example.orders AS o CROSS JOIN example.customers AS c
There are other ways to combine datasets, but both query parts need to have the same structure:
There are a lot of interesting functions to transform and add properties to your dataset and take advantage of our DBMS capabilities.
Window functions are very useful if you want to run a calculation on a set of rows that are related in some way (ie. you want a running total of the order total price for a customer using order_date as the reference column for time).
SELECT a.order_id , a.customer_id , SUM(a.order_total) OVER (PARTITION BY a.customer_id ORDER BY a.order_date) AS running_order_total_price FROM example.orders AS a;
Another example could be if you want a separate column that indicates the date of the customer’s previous order.
SELECT a.order_id , a.customer_id , a.order_date , LAG(a.order_date, 1) OVER (PARTITION BY a.customer_id ORDER BY a.order_date) AS previous_order_date FROM example.orders AS a;
With SQL, you can also work with complex data types like arrays and JSON objects. You’ll want to take into account that syntaxis and implementations of these functions may vary between different database engines or big data environments, such as SQL Server, Oracle, PostgreSQL, MySQL, Redshift, Hive, BigQuery, and so on.
CTE (Common Table Expression)
One great feature you can find in many SQL systems is CTEs (Common Table Expressions). This is very useful if you want to simplify complex subqueries and joins like recursive relationships grouped on the same dataset (think hierarchical data stored on a table related to itself).
Basically, CTE is a temporary result set that you can reference with another SELECT, INSERT, UPDATE, or DELETE statement. You can significantly increase readability and simplification by deconstructing queries into simple blocks and reusing them. Let me show you what I am talking about.
I want to extract some columns from the customers table from customers who had orders in 2021, with discounts equal to or greater than $100, and who have a registered address in the U.S.
Query in ordinary form:
SELECT DISTINCT c.customer_name , c.customer_id , c.customer_address , c.customer_phone FROM example.orders AS o INNER JOIN example.customer AS c ON o.customer_id = c.customer_id WHERE o.order_date BETWEEN DATE '2021-01-01' AND DATE '2021-12-31' AND (c.customer_address LIKE '%US%' OR c.customer_address LIKE '%USA%' OR c.customer_address LIKE '%United States%') AND o.order_discount >= 100
WITH customers_in_USA AS ( SELECT c.customer_name , c.customer_id , c.customer_address , c.customer_phone FROM example.customer AS c WHERE c.customer_address LIKE '%US%' OR c.customer_address LIKE '%USA%' OR c.customer_address LIKE '%United States%' ), orders_from_2021 AS ( SELECT DISTINCT customer_id FROM example.orders AS o WHERE o.order_date BETWEEN DATE '2021-01-01' AND DATE '2021-12-31' AND o.order_discount >= 100 ) SELECT a.customer_name , a.customer_id , a.customer_address , a.customer_phone FROM customers_in_USA AS a INNER JOIN orders_from_2021 AS b ON a.customer_id = b.customer_id;
In case you want to create a recursive query (a query that references itself), CTEs can be a real ally in this battle. Let’s say we have a reference program for our customers that is actually stored on our customer’s table, but we want to know who referenced who…well, let’s take a look:
WITH referred_customers ( SELECT customer_id , customer_name , referred_by_id , CAST(NULL AS STRING) AS referred_by_name FROM example.customers WHERE referred_by_id IS NULL UNION ALL SELECT c.customer_id , c.customer_name , c.referred_by_id , r.customer_name AS referred_by_name FROM example.customers AS c INNER JOIN referred_customers AS r ON c.referred_by_id = r.customer_id ) SELECT customer_id , customer_name , referred_by_id , referred_by_name FROM referred_customers;
There are plenty of things that we can call Advanced SQL, but here is a brief list to consider:
After you have mastered a lot of SQL concepts and functions, you might like to write efficient queries to have great response times and make yourself proud while deploying your code on production environments.
This may not always be easy, but with time and patience, you can become proficient. Here are some recommendations to help you get started:
- Only use columns you need and avoid using SELECT *
- Avoid using operations and functions on joins and where clauses
- Avoid using DISTINCT queries
- More guidelines
Always remember, your query performance will depend on table structures, indexes, data types, and query plans (all SQL systems have query optimizers, but behavior may vary from vendor to vendor).
I hope this post gives you an idea about where to start and the key elements behind writing queries on SQL, as this will become your bread and butter as a data engineer. My intention is to give you a glance of what you would need and tear down the first technical barrier to your learning experience. Thanks for reading and see you on the next one.