Here, we will mostly talk about the google cloud platform components along with the on premise data engineering tools. A data engineer has to be well equipped with SQL and programming skills. In this blog, we will mostly discuss the important topics from industry data engineering prospective. Let's begin with the SQL concepts.
SQL Concepts
SQL concepts are very much important in data engineering as you have to interact with SQL databases, data warehouses like HIVE, BigQuery etc. in your day to day activity. SQL concepts are also required in spark coding, data modelling and creating data pipelines. in this article, we will discuss some important SQL concepts and useful SQL queries which will be useful in your day to day work and also in your interviews.
some basic concepts: one must have some basic understanding about databases, tables, rows, columns, schema, Acid properties, primary key and foreign key to develop an idea about the fundamentals of SQL. However, in the industry, you are required to work on advance tech stacks using the SQL fundamentals.
important concepts:
Normalization & Denormalization :
Normalization is a technique which reduces data redundancy and data inconsistency from tables and it reduces Insertion, Update and Deletion Anomalies. We can not put all the data of an organization into one table. If we do so, it will create data redundancy and data inconsistency, memory and performance overhead. In short, we should decompose the organization's data into multiple tables where each table represents a single idea.
First Normal Form (1NF) - A relation should not contain composite or multi valued attribute to be in first normal form. 1NF is trivial. Usually the data in the organization are already in 1NF.
Second Normal Form (2NF) - To be in second normal form, a relation must be in first normal form and it must not contain any partial dependency. In other words, every non primary key attribute should be fully functionally dependent on the primary key. First normal form usually increases data redundancy. 2NF removes the data redundancy and removes the partial dependencies.
Third Normal form (3NF) - A table said to be in 3NF if it does not have partial dependency (i.e. it should be in second normal form) and does not not have any transitive dependency. A transitive dependency is a situation where a non prime attribute uniquely finds another non prime attribute. lets say we have a table having three attributes A, B and C where A being a prime attribute uniquely identifies B and B being a non prime attribute uniquely identifies C, then there's a transitive occurs between B and C.
pros of normalization - normalization eliminates the duplicate data. Overall size of the database is reduced. performance is better. Normalized tables are fine tuned and are easier to maintain.
cons of normalization - Tables need to be joined to fetch all the relevant information about a certain entity. The joins are always a tedious task.
denormalization:
database denormalization is the combination of normalized tables into one. This is done to speed up the operation on the table.
pros of denormalization- read and write queries will be faster. aggregate opertaions will be easier and number of tables will be smaller.
cons of denormalization - size of the query processing increases. data may be incosistent. cost of updating and inserting id=s higher.
CASE Statements:
when you need to implement conditions in SQL, you need SQL CASE statements. it is like the if-else statements in programming. once a condition is true it returns the result and if no conditions met, it returns value from else.
suppose, you need to create a field called quantity_value_indicator and the field value will be min if quantity lesser than 100, mid if quantity is between 100 - 10000 and max for the rest.
SELECT OrderID, ProductID, Quantity,
CASE
WHEN Quantity < 100 THEN 'Min'
WHEN Quantity > 100 and Quantity < 10000 THEN 'Mid'
ELSE 'Max'
END AS quantity_value_indicator FROM Orders;
SQL Aggregate functions: Whenever we need to find a scalar value from a set of values in table, we have to use aggregate functions. some aggregate functions examples are below. we can use group by clause to aggregate the set of values.
AVG() returns the average of the specified values.
SUM() calculates the sum of all values in the set.
MAX() and MIN() return the maximum and minimum value, respectively.
COUNT() returns the total number of values in the set.
SELECT order_id, item, sum(amount) as amount_sum FROM Orders group by item;
SQL Window Functions - Window functions operate on a set of rows called a window frame. Window functions return a single value for each set from the query. windows functions are very important while implementing transformation in big data pipelines. window functions can be defined using over() clause.
Suppose, we have a customer table and we want to find out city wise average age and the results to be printed in the first name ascending.
SELECT customer_id, first_name, AVG(age) OVER (PARTITION BY country order by first_name) AS avg_age FROM Customers
JOINS : SQL Join statement is used to join rows from two or more tables based on a common field between them. There are different types of joins:
INNER JOIN, LEFT JOIN, RIGHT JOIN, FULL JOIN
lets say we have two tables. employees and dept. we will do all the 4 joins and see the results.
Inner join - returns matching records from both the tables
Left Join or Left Outer Join - Returns all records from the left table and matching records from the right table.
Right Join or Right Outer Join - Returns all records from the right table, and the matching records from the left table
FULL (OUTER) JOIN - full outer join is a combination of the results of both left and right outer joins and returns all the rows from the tables.
Note - mySql does not support full outer join. The above result was fetched doing an union between left and right join.
select * from employee e left join dept d on e.id = d.dept_id union select * from employee e right join dept d on e.id = d.dept_id
Above are some basic, yet important concepts of SQL. We will see more in queries and use cases in the bigquery section