What is Docker? How Does it Work?

what is docker

Docker has become the defacto standard when it comes to container-based implementations. From small scale implementations to large scale enterprise applications, docker serves as the base for container-based orchestration.

Docker gained so much popularity and adoption in the DevOps community in a short time because of the way it’s developed for portability and designed for modern microservice architecture.

In this blog you will learn,

  1. Container evolution and the underlying concept of Linux Containers
  2. What really is a container and what Linux features make it work.
  3. The difference between a process, container, and a VM
  4. Learn about Docker and see why Docker is very useful and different from other container technologies.
  5. Docker core architecture and its key components

The idea here is to get your basics right so that you understand what Docker really is and how it works.

Evolution of Containers

If you think containerization is a new technology, it is not. Google has been using its own container technology in its Infrastructure for years.

The concept of containers started way back in the 2000s. In fact, the roots go back to 1979 where we had chroot, a concept of changing the root directory of a process.

Here is a list of container-based projects that started in 2000.

2000FreeBSD jails introduced the container concept.
2003Linux-V server project released with the concept of OS-level virtualization
2005Solaris Zones– OS-level virtualization project introduced
2007Google released a paper on Generic Process Containers
2008The initial release of LXC containers
2011cloudfoundry announced warden
2013lcmcty– Open-sourced by Google
2013Docker project was announced by DotCloud
2014Rocket. (rkt) announced by CoreOS
2016 Windows container preview as releases as part of Windows server 2016

What is a Linux container (LXC)?

Before diving directly into Docker concepts, first, you need to understand what is a Linux Container.

In a typical virtualized environment, one or more virtual machines run on top of a physical server using a hypervisor like Xen, Hyper-V, etc.

Containers, on the other hand, run on top of operating systems kernel. You can call it as OS-level virtualization. Before getting into the underlying container concepts, you need to understand two key Linux concepts.

  1. Userspace: All the code which is required to run user programs (applications, process) is called userspace. When you initiate a program action, for example, to create a file, the process in the userspace makes a system call to Kernal space.
  2. Kernel Space: This is the heart of the operating system, where you have the kernel code which interacts with the system hardware, storage, etc.

A container is a Process

When you start an application, for example, an Nginx web server, you are actually starting a process. A process itself is a self-contained instruction with limited isolation.

What if we can isolate the process with only files and configuration required for the process to run and operate. That is what a container does.

A container is basically a process with enough isolation of userspace components so that it gives a feeling of a separate operating system.

container process

The parent container process may have a child process. So you can say, a container is also a group of processes.

For example, when you start an Nginx service, it starts a parent Nginx process. The parent process then spans its child processes like cache manager, cache loader, and workers.

nginx child process min
Image Src: https://www.nginx.com

So when you start an Nginx container, you are starting a master Nginx process in its isolated environment.

I will show you this practically in the below sections.

Each container has its isolated userspace, and you can run multiple containers on a single host.

Does that mean a container has the whole OS?

No. As opposed to a VM with its own kernel, a container just contains the required files related to a specific distro and uses the shared host kernel.

More interestingly, you can run different Linux distros based containers on a single host which shares the same kernel space.

host containers min

For example, you can run an RHEL, CentOS, a SUSE based container on an Ubuntu server. It is possible because for all the Linux distros, only the userspace is different, and kernel space is the same.

Underlying Concept of Linux Containers

The following image gives you a visual representation of Linux continers.

what is a linux container

Containers are isolated in a host using the two Linux kernel features called namespaces and control groups.

A real-world analogy would be an Apartment building. Even though it’s a single big building, each condo/flat is isolated for individual households having their own identity with metered water, gas, and electricity. We use concrete, steel structures, and other construction materials to establish this isolation. You do not have visibility into other homes unless they allow you in.

Similarly, you can relate this to a single host containing multiple containers. To isolate containers with their own CPU, memory, IP address, mount points, processes, you need two Linux kernel features called namespaces and control groups.

namespaces controlgroups

Linux Namespaces

A container is all about having a well-isolated environment to run a service (Process). To achieve that level of isolation, a container should have its own file system, IP address, mount points, process IDs, etc. You can achieve this using the Linux Namespaces.

Namespaces are responsible for containers to have their own mount points, user, IP address, process management, etc..Essentially it sets boundaries for the containers.

Following are the key namespaces in Linux

  1. pid namespace: Responsible for isolating the process (PID: Process ID).
  2. net namespace: It manages network interfaces (NET: Networking).
  3. ipc namespace: It manages access to IPC resources (IPC: InterProcess Communication).
  4. mnt namespace: Responsible for managing the filesystem mount points (MNT: Mount).
  5. uts namespace: Isolates kernel and version identifiers. (UTS: Unix Timesharing System).
  6. usr namespace: Isolates user IDs. In simple words, it isolates the user ids between the host and container.
  7. Cgroup namespace: It isolates the control group information from the container process

Using these namespaces a container can have its own network interfaces, IP address, etc. Each container will have its own namespace and the processes running inside that namespace will not have any privileges outside its namespace.

Interestingly, you can list the namespaces in a Linux machine using the lsns command.

listing namespaces in linux

Linux Control groups

When we start a service, we don’t specify any memory or CPU limit. We leave it to the kernel to prioritize and allocate resources for the services.

However, you can explicitly set CPU, memory limits for your services using a Linux kernel feature called CGroups. It is not a straight forward approach, you need to make some extra configurations and tweaks to make it work.

Since you can run multiple containers inside a host, there should be a mechanism to limit resource usage, device access, etc. Here is where control groups come into the picture.

The resources used by a container is managed by Linux control groups. You can restrict CPU, memory, network, and IO resources of a container Linux control groups.

So what happens if I don’t limit the CPU & Memory resource of a container?

Well, a single container might end up using all the host resources leaving other containers to crash because of resource unavailability.

Tools like docker abstract away all the complex backend configurations and lets you specify these resource limits with simple parameters.

What is Docker?

Docker is a popular open-source project written in go and developed by Dotcloud (A PaaS Company).

It is basically a container engine that uses the Linux Kernel features like namespaces and control groups to create containers on top of an operating system.

Meaning, all the container concepts, and functionalities we learned in the LXC section are made very simply by Docker. Just by executing a few Docker commands & parameters, we will have containers up and running.

You might ask how Docker is different from a Linux Container (LXC) as all the concepts and implementation look similar?

Docker was initially built on top of Linux containers (LXC). Later docker replaced LXC with its own container runtime libcontainer (now part of runc)

Well, apart from just being a container technology, Docker has well-defined wrapper components that make packaging applications easy. Before the Docker, it was not easy to run containers. Meaning, it does all the work to decouple your application from the infrastructure by packing all application system requirements into a container.

For example, if you have a Java jar file, you can run it on any server which has java installed. Same way, once you package a container with required applications using Docker, you can run it on any other host which has docker installed.

Difference Between Docker & Container

Docker is a technology or a tool developed to manage container implementations efficiently.

So, can I run a container without Docker?

Yes! of course. you can use LXC technology to run containers on Linux servers.

Things you should know about Docker:

  1. Docker is not LXC
  2. Docker is not a Virtual Machine Solution.
  3. Docker is not a configuration management system and is not a replacement for chef, puppet, Ansible, etc.
  4. Docker is not a platform as a service technology.

What Makes Docker So Great?

Docker has an efficient workflow for moving the application from the developer’s laptop to the test environment to production. You will understand more about it when you look at a practical example of packaging an application into a Docker image.

Do you know that starting a docker container takes less than a second?

It is incredibly fast and it can run on any host with compatible Linux Kernel. (Supports Windows as well)

Note: you cannot run a Windows container on a Linux host because there is no Linux Kernel support for Windows. You can read about Windows containers from here

Docker uses a Copy-on-write union file system for its image storage. Whenever changes are made to a container, only the changes will be written to disk using copy on write model.

With Copy on write, you will have optimized shared storage layers for all your containers.

Docker Adoption Statistics

Here is the google trends data on Docker. You can see it is an exploding topic for the last five years.

docker trends

Here is a survey result from Datadog which shows the rise in Docker adoption.

docker 2018 5 final v2
source: datadog.com

Docker Core Architecture

In the following sections, we will look at the Docker architecture and its associated components. We will also look at how each component works together to make Docker work.

Docker architecture has changed a few times since its inception. When this article was first released, Docker was built on top of LXC

Here are some notable architectural changes that happened for the Docker

  1. Docker moved from LXC to libcontainer in 2014
  2. runc – a CLI for spinning up containers that follow all OCI specifications.
  3. containerd – Docker separated its container management component to containerd in 2016

OCI: Open Container Initiative is an open industry standard for container runtime and specifications.

When docker was initially launched, it had a monolithic architecture. Now it is separated into following three different components.

  1. Docker Engine (dockerd)
  2. docker-containerd (containerd)
  3. docker-runc (runc)

Docker and other big organizations decided to contribute to a common container runtime and management layers. Hence containerd and runc are now part of the Cloud Native Foundation with contributors from all the organizations.

Note: When installing Docker, all these components get installed. You don’t have to install it separately. For exaplanation, we are showing it as different components.

docker core architecture

Now let’s have a looks at each Docker component.

Docker Engine

Docker engine is composed of the docker daemon, an API interface, and Docker CLI. Docker daemon (dockerd) runs continuously as dockerd systemd service. It is responsible for building the docker images.

To manage images and run containers, dockerd calls the docker-containerd APIs.

engine components flow min

docker-containerd (containerd)

containerd is another system daemon service than is responsible for downloading the docker images and running them as a container. It exposes its API to receive instructions from the dockerd service


runc is the container runtime, which is responsible for creating the namespaces and cgroups required for a container. It then runs the container commands inside those namespaces. runc runtime is implemented as per the OCI specification.

To understand more about container runtimes read this excellent 3 part blog post series.

How Does Docker Work?

We have seen the core components for Docker. But to build, ship, share and run docker containers, there are other components involved.

Let’s look at the key Docker components in a Docker ecosystem.

Docker Components

Docker is composed of the following four components

  1. Docker Daemon (dockerd)
  2. Docker Client
  3. Docker Images
  4. Docker Registries
  5. Docker Containers

Here is the official high-level docker architecture diagram that shows the common Docker workflow.

docker architecture min
image source: docs.docker.com

Docker Daemon

Docker has a client-server architecture. Docker Daemon (dockerd) or server is responsible for all the actions that are related to containers.

The daemon receives the commands from the Docker client through CLI or REST API. Docker client can be on the same host as a daemon or it can be present on any other host.

By default, the docker daemon listens to the docker.sock UNIX socket. If you have any use case to access the docker API remotely, you need to expose it over a host port. One such use case us running Docker as Jenkins agents.

If you want to run docker inside docker, you can use the docker.sock from the host machine.

Docker Images

Images are the basic building blocks of Docker. You need an image to run a Docker container. Images contain the OS libraries, dependencies, and tools to run an application.

Images can be prebuilt with application dependencies for creating containers. For example, if you want to run an Nginx web server as a Ubuntu container, you need to create a Docker image with the Nginx binary and all the OS libraries required to run Nginx.

Docker has a concept of Dockerfile that is used for building the image. A Dockerfile basically a text file that contains one command (instructions) per line.

Here is an example of a Dockerfile.

example Dockerfile reference

A docker image is organized in a layered fashion. Every instruction on a Dockerfile is added a layer in an image. The topmost writable layer of the image is a container.

Every image is created from a base image.

For example, if you can use a base image of Ubuntu and create another image with Nginx application in it. A base image can be a parent image or an image built from a parent image. Check out his docker article to know more about it.

You might ask where does this base image (Parent image) come from? there are docker utilities to create the initial parent base image. Basically it takes the required OS libraries and bakes it into a base image. You don’t have to do this because you will get the official base images for all the Linux distros.

docker image layers

The top layer of a image is writable and used by the running container. Other layers in the image are read only.

docker image writable min

Docker registry

It is a repository for Docker images. Using the Docker registry, you can share images. It acts as a central repository for the Docker images.

A registry can be public or private. Docker Inc provides a hosted registry service called Docker Hub. It allows you to upload and download images from a central location.

Note: By default, when you install docker, it looks for images from the public Docker hub unless you specify a custom registry in Docker settings.

If your repository is public, all your images can be accessed by other Docker hub users. You can also create a private registry in Docker Hub.

Docker hub acts like git, where you can build your images locally on your laptop, commit it and then can be pushed to the Docker hub.

Tip: When using docker in enterprise networks/project, set up your own docker registries instead of using the public docker hub. All cloud providers have their own container registry services.

Docker Container

It is the execution environment for Docker. Containers are created from images. It is a writable layer of the image.

If you try to relate image layers and a container, here is how it looks for a ubuntu-based image.https

You can package your applications in a container, commit it, and make it a golden image to build more containers from it.

Containers can be started, stopped, committed, and terminated. If you terminate a container without committing it, all the container changes will be lost.

Ideally, containers are treated as immutable objects, and it is not recommended to make changes to a running container. Make changes to a running container only for testing purposes.

Two or more containers can be linked together to form tiered application architecture. However, hosting hight scalable applications with docker has been made easy with the advent of container orchestration tools like kubernetes.

You Might Like: List of Containers Orchestration Tools

Why Containers Are Better Than VMs?

Containers have some key advantages over VMs. Lets take a looks at those.

Resource Utilisation & Cost

  1. You can use VMs to run your applications independently, which means one service per VM. But it can still be underutilized. And resizing a VM is not an easy task for a production application.
  2. Containers, on the other hand, can run with very minimal CPU and memory requirements. Also, you can even run multiple containers inside a VM for application segregation. Plus, resizing a container takes seconds.

Provisioning & Deployment

  1. Provisioning a VM and deploying applications on it might take minutes to hours depending on the workflow involved. Even rollback takes time.
  2. But you can deploy a container in seconds and roll it back in seconds as well.

Drift Management

  1. Drift management in VMs is not easy. You need to have full-fledged automation and process in place to make sure all the environments are similar. Following immutable deployment models avoids drift in VM environments.
  2. When it comes to containers, once the image gets backed, it will be the same in all the environments. For any changes, you need to start making changes in dev env and re-bake the container image.

Recommended Course: Docker Mastery: The Complete Toolset From a Docker Captain


What is the difference between containerd & runc?

containerd is responsible for managing the container and runc is responsible for running the containers (create namespaces, cgroups and run commands inside the container) with the inputs from containerd

What is the difference between the Docker engine & Docker daemon?

Docker engine is composed of the docker daemon, rest interface, and the docker CLI. Docker daemon is the systemd dockerd service responsible for building the docker images and send docker instructions to containerd runtime.


The best feature of Docker is collaboration.

Docker images can be pushed to a repository and can be pulled down to any other host to run containers from that image.

Moreover, Docker hub has thousands of images created by users, and you can pull those images down to your hosts based on your application requirements. Also, it is primarily used in container orchestration tools like kubernetes

If you want to run Docker for production workloads, make sure you follow Docker images’ recommended practices.

You can get started by installing Docker and run the basic operations.

what is docker

SQL For Data Science: A Comprehensive Beginners Guide

SQL For Data Science

In this blog, I am going to talk about the importance of SQL for data science, some statistics, and key concepts that you need to be aware of as a beginner in pursuit of a data scientist career path.

I have also covered some of the key SQL practices that have to be followed when working with data.

You will also learn the fundamentals of SQL and working with data so that one can begin analyzing it for data science purposes

SQL – Data Science Statistics

Data Science has emerged as the most popular field of the 21st century. It is because there is a pressing need to analyze and construct insights from the data. Industries transform raw data into furnished data products.

To do so, it requires several essential tools to churn the raw data. Collectively with Python and R, SQL is now considered to be one of the most demanded skills in Data Science.

As per KDnuggets, SQL is the third skill in demand for data scientist jobs.

importance of SQL skill for data scientist

SQL isn’t going anywhere. Yes, it’s true. SQL is more popular amongst data scientists and data engineers than R or Python. The truth that SQL is a language of choice is incredibly significant.

In the chart below, from recent StackOverflow’s 2020 developer survey, one can see that SQL eclipses both Python and R in popularity.

SQL demand as per developers survey

What’s SQL?

Structured Query Language (SQL) is a database language used to create, retrieve & manipulates the data from a Relational Database Management System (RDBMS) such as MySQL, Oracle, SQL Server, Postgre, etc. that stores the data in relational database tables.

The relational database table consists of columns and rows, where columns describe the stored data’s properties, and rows describe the real data entries.

As the name suggests, SQL comes in to picture when we have structured data, e.g., in the form of tables.

On the other hand, the databases that are not well structured or not relational, and therefore do not use SQL are known as NoSQL databases. Examples of such databases are MongoDB, DynamoDB, Cassandra, etc.

Some interesting facts about SQL that should Know

  • One of the interesting facts about SQL is that it’s made up of descriptive words. In other words, most of the commands used in SQL are reasonably easy to understand as compared to many other computer languages. It makes SQL, as a language, really simple to read and learn. For example, one wants to select a column NAME from the STUDENT table. Then he/she can write the SQL command as
  • SQL statement keywords are case sensitive. It means that

is not similar to,

  • There is an ISO standard for SQL; most of the implementations lightly differ in syntax. So one may encounter queries that work in SQL Server but do not work in MySQL.
  • SQL is a non-procedural language that means we can’t write a complete application using SQL, but what you can do is interact and communicate with data. It makes it relatively simple, but also a clear language.

Why SQL for Data Science?

  • According to research, more than 2.5 quintillion bytes of data are produced every day. Hence for storing such vast amounts of data, it is strictly needed to make use of databases.
  • Industries are now giving more importance to the value of data. Data can be used in various ways to analyze more significant business problems, solve business problems, make predictions, and understand customer needs.
  • While performing manipulations on data, SQL can be accessed directly, and this is one of the main advantages of using SQL because this can expedite the workflow executions.

Before deep dive into SQL, prior knowledge in Relational Model, Some important terminologies related to the relational model, and keys in the relational model are essential for the beginners.

Using SQL for Data Science

So now, the question is, how do you bring SQL together for use in data science?

Every data science project starts with the aim of finding a solution to a problem.

These are a few approaches and standard practices followed when beginning with a new data science problem that requires SQL.

As a beginner, it’s essential to know how to work through a problem from beginning to end.

Here we are going to discuss the key practices that have to be followed before starting to write SQL queries for any data science-related projects.

  1. Understanding Data
  2. Understanding the Business Use Case
  3. Data Profiling
  4. Troubleshooting
  5. Data Formating
  6. Review Process
standard practices for using SQL in data science

Let’s understand each stage in detail.

Data Understanding

So the very first step is data understanding, and it is the most crucial step. It is why you need to spend enough time in modeling diagrams, discussing the associations in the data, because knowing the data is key to writing successful queries.

It’ll be very beneficial to kill the time to know the data as much as possible before interpreting it. It’s essential to know the associations and dependencies among the data.

That drives us to our second step, which is the business understanding.

Business Understanding

As we start to get familiar with the data, what happens, is that we move into questions regarding the business problem that we are striving to solve. 

Usually, one will be moving back and forth to study at the data and move back to a subject matter specialist who knows the core business problem and strives to resolve it. 

This is crucial in being capable of wrapping your head around the problem.

One of the things to be aware of is the unspoken need. There may be a business problem; for example, one wants to predict whether or not a customer is expected to buy the product. 

That looks pretty straight forward and easy. But as one falls into the data more, he/she may start to get questions like which customers? Which products? Some of the unspoken points are specific logical exclusions.

If you can get this and understand a problem, query writing will just become more comfortable. If you genuinely know the data and understand the problem, then writing the queries is like filling in the blanks.

Profiling Data

When we are trying to understand the data, we will start profiling the data. This is where we perform descriptive statistics.

It’s also a significant opportunity to classify any data quality issues before diving into the analysis. It’s a good approach to perform this profiling step before finalizing any of the data that we are extracting.

To catch a problem, we require a plan of what are the exact data components we need.

We require to understand the data we are working on going after and get some of the data from the profiling we have done.

If we are regularly obtaining the data, then it will begin with the select statement. So we have to use select and from.

Start with Select

Remember, you’re always going to start with the SELECT statement. It means that’s the great thing about SQL, it’s consistent in that way.

It is recommended to start simple, particularly if you’re new to the data. Begin with just one table, add more data, add in another table, check the result, and then go back from there.

If you’re using subqueries, remember to always start with the innermost query and then work out and build. That drives to the next point, which is test along the way.

Test & Troubleshoot

Don’t wait to test the query until we have combined various sources. And we have all our calculations done and completed. Think of this as little building blocks.

If we write an estimation of something’s average selling price, look at how many values we are getting back for that calculation from the table and make sure that appears right.

Then combine the result with different tables and then examine that. If we understand the data, we could jump in a little bit faster. But make sure that our order of manipulation is accurate.

It is important because it’s simple to get results back, but taking the correct results back that we require is a little bit harder.

Troubleshooting is necessary always to start short and simple, and slowly start to reconstruct the query to see where things went wrong. It’s essential to start with the basic things first.

Gradually start to develop a back up to understand where things have gone wrong.

Format & Comment

Be sure that when we are writing the query, the next thing to look at is to make sure we are formatting it accurately and commenting correctly.
The clean code says a lot about the coder.

Ensure that it’s easy to read the query, use recommended indentation, comment strategically where we need to do so, etc.

We never know when we will require to revisit the query, or we are going to require to deliver it into someone else, and they need to update to from there.

Just keep the code clean. Format the comments where necessary and strategically.


Now make sure that we have reviewed what we have done till now. A lot of times, what happens is we write the query, we will be using it for our model, and we will be looking at various stats and stuff like that.

Then we need to go back and change that query. Always make sure to review the query to see if anything has changed. Has the data changed? Are business rules different? Do we require to update the date indicators?

And be cautious when you’re going back and using old queries.
These approaches take us over a problem from start to end, and it all begins with the data and problem understanding.

Make sure to spend the time there. Make sure to spend the time thinking about what we are doing before we start writing the queries.

And it’s going to save our time in the long run, then go through and understand the data by profiling it. Make sure to test enough and keep the code clean with proper commentation.

DBA vs Data Scientist

Many use SQL in their jobs. This includes everything from backend developers, QA engineers, system engineers, obviously data scientists, and even data analysts. But the one we want to know about more is the DBA (Database Administrators), and how he/she compares to data scientists.

The DBA is responsible for managing the whole database and supervising it. However, the data scientist is typically a user of that database. The DBA will be responsible for granting permissions to people and deciding who has access to what data. They’re often responsible for managing the tables and creating them. 

DBAs and data scientists are related to each other in such a way that they both use SQL to interpret the data, to query it, and retrieve it. They both write quite complex queries, but the main difference is that the data scientist is the end-user. However, the DBA is the one who manages it, directs it, and maintains the database as a whole.

Relational Model in DBMS

Now, let’s start with the practical basics to understand the underlying concepts of a relations model.

The relational model illustrates how data is stored in relational databases.  The data is stored in the form of relations or tables in a relational database. Suppose a table STUDENT having attributes or characteristics ROLL_NO, NAME, PHONE, and AGE shown in the following table.



Important Terminologies

  • Relation Schema: A relation schema depicts the name of the relation with its attributes. e.g.; STUDENT (ROLL_NO, NAME, PHONE, and AGE) is a relation schema for STUDENT. If a schema has more than one relation, it is termed Relational Schema.
  • Attribute: Attributes are the characteristics that define a relation. e.g.; ROLL_NONAME
  • Tuple: Every row in the relation is called a tuple. The above relation contains 4 tuples.
  • Column: Column denotes the set of values for a particular attribute. The column ROLL_NO is extracted from the relation STUDENT.
  • Degree: The number of attributes present in the relation is called the degree of relation. The STUDENT relation defined above has degree 4.
  • Cardinality: The number of tuples in a relation is called cardinality. The STUDENT relation defined above has cardinality 4.
  • NULL Values: The value which is not known or not available(NA) is called NULL value. It is represented by blank space. e.g.; PHONE of STUDENT having ROLL_NO 2 is NULL.

Keys in Relational Model

Let’s consider two tables named as STUDENT and STUDENT_COURSE.




  • Candidate Key: The minimal set of the attribute which can uniquely identify a tuple is known as a candidate key. For Example, ROLL_NO, PHONE in STUDENT relation.
  • Super Key: A set of attributes that can uniquely identify a tuple is called Super Key. For Example, ROLL_NO, (ROLL_NO, NAME), etc.
  • Primary Key: In a relational model of databases, a primary key is a particular choice of a minimal set of attributes that uniquely define a tuple in a relation. There can be more than one candidate key in a relation of which one can be chosen as the primary key. For Example, ROLL_NO, as well as PHONE both, are candidate keys for relation STUDENT but ROLL_NO can be chosen as the primary key e.g only one out of many candidate keys.
  • Alternate Key: The candidate key other than the primary key is known as an alternate key. For Example, ROLL_NO, as well as PHONE both, are candidate keys for relation STUDENT but PHONE will be alternate key as ROLL_NO is a primary key here.
  • Foreign Key: The foreign key is a key used to connect two tables together. The foreign key is a field in one table that leads to the primary key in another table. For Example, ROLL_NO in STUDENT_COURSE is a foreign key to ROLL_NO in STUDENT relation.


Make sure that you have already installed any SQL databases IDE to execute the SQL query used in this article. If you don’t, you may also use the online IDE to execute the query. Here are some links for popular SQL database online IDE:

  1. MySQL
  2. SQLite, MariaDB, PostgreSQL, MS SQL
  3. Oracle

Getting Started with SQL

In the 21st century, data is critical. Everyone is collecting it all the time. And those people who can use and communicate with data, those who use it to critically analyze and provide insight out of it to make better decisions, those are going to be the people who shape the real world we exist in.

They are data scientists. But to be a good data scientist, one needs to know how to retrieve and work with data. And to do that, he/she needs to be very well versed in SQL, the standard language for communicating with databases.

Creating Tables

As discussed before, the data scientist isn’t normally the one in charge of handling the whole database, that usually left to the DBA or some type of administrator. However, they may have capabilities to be able to write and create their own tables. So it’s necessary to have a basic knowledge of how this works.

The CREATE TABLE statement is used to create tables in SQL. We know that a table comprises rows and columns. So whenever creating tables we have to give all the data to SQL such as the names of the columns, the data type of data to be stored in columns, constraints names, etc. Let’s take a look at how to use the CREATE TABLE statement to create tables in SQL.


CREATE TABLE table_name
column1 data_type(size) constraint_name,
column2 data_type(size) constraint_name,
column3 data_type(size) constraint_name,

table_name: Name of the table to be created.
column1: Name of the first column and so on.
data_type: Type of data that can be stored in the field.
size: Size of the data that can store in the particular column. For example: In a column if you are giving the data_type as varchar and size as 20 then that column can store a string of maximum 20 characters.
constraint_name: Name of the constraint. for example- NOT NULL, UNIQUE, PRIMARY KEY etc.


This below query will create a table named STUDENT with four columns, ROLL_NO, NAME, SUBJECT, and AGE.

ROLL_NO               int(5)            PRIMARY KEY,
NAME                  varchar(30)       NOT NULL,
SUBJECT               varchar(30),
AGE                   int(2)            NOT NULL                

The above query creates a table named STUDENT. The ROLL_NO attribute is of type int and can store an integer number of size 5. Then the two columns NAME and SUBJECT are of type varchar and can store characters. The size 30 in bracket defines that these two attributes can contain a maximum of 30 characters. And the last column AGE os of type int and can store an integer number of size 2.


  1. An error will be returned if one tries to submit a column with no values. It is significant to know that the NULL value in SQL is different from the zero value. The NULL value represents a missing value, and it has one of three different interpretations:
    • The value unknown (value exists but is not known)
    • Value not available (exists but is purposely withheld)
    • Attribute not applicable (undefined for this tuple)
  2. Primary keys can not be null. That means primary keys must have a value.

Adding/Inserting Data to The Table

The INSERT INTO statement of SQL is used to insert a new row in a table.


INSERT INTO table_name VALUES (value1, value2, value3,…);
table_name: name of the table.
value1, value2,.. : value of first column, second column,… for the new record



Now the table STUDENT is filled with 5 rows and 4 columns.

Retrieving Data

SELECT is the commonly used statement in SQL. The SELECT Statement in SQL is used to retrieve data from the database. One can retrieve either the whole table or according to some specified rules.

To fetch all the fields from the table STUDENT


SELECT * FROM table_name;





Query to fetch the fields ROLL_NO, NAME, AGE from the table Student:


SELECT column1, column2 FROM table_name;





Adding Comments to SQL

Adding comments in your code is necessary not only because it makes your queries easy to follow, both for you and for others. It’s also going to help with troubleshooting the code and obtaining the code easier to share overall.

For single-line comments in SQL use “–” (double hyphen) at the beginning of any line. For multi-line comments, use “/* your comment */”.  The line starting with ‘/*’ is considered as a starting point of comment and is terminated when ‘*/’ is encountered.


-- This is a single line comment 
SELECT * FROM Customers;

/* This is a multi 
line comment */ 
SELECT * FROM Customers;

Filtering, Sorting and Calculating Data with SQL

So far what you’ve been studying with only has been a few thousand records, but several databases have a lot of data in them. Eventually, you might operate with databases with the millions or even tens of millions of records in it. Clearly, you’re never going to need to look at all of that data at once. So SQL gives us various methods to pair down and sort your data so you can quickly get the results you want. So let’ discuss that in this module.

Basics of Filtering with SQL

Filtering is very important because it enables us to narrow the data you want to retrieve. Filtering is also done when we are making an interpretation to get an idea about the data that we require to analyze as part of our model.

Why Filter?

  1. Be precise about the data that we want to retrieve
  2. Decrease the number of records we retrieve
  3. Reduce the strain on the client application
  4. Increase query performance
  5. Governance limitations

WHERE Clause

WHERE keyword is used for retrieving filtered data in a result set. It is used to retrieve data according to particular criteria and can also be used to filter data by matching patterns.


SELECT column1,column2 
FROM TableName
WHERE column_name operator value;

column1 , column2: Fields in the table
TableName: Name of the table 
column_name: Name of the field
operator: Operation to be performed for filtering
value: The exact value to get related data in result

List of operators that can be used with where clause:

>Greater Than
>=Greater than or Equal to
<Less than
<=Less than or Equal to
=Equal to
<>Not equal to
BETWEENIn an inclusive Range
LIKESearch for a pattern
INTo define multiple likely values for a column



Example 1: To fetch record of students with age equal to 22

WHERE Age=22;



Example 2: To fetch records of students where ROLL_NO is between 2 and 4 (inclusive).




Example 3: To fetch NAME and SUBJECT of students where Age is 22 or 30.

WHERE Age IN (22,30);



Wildcards in SQL

What are wildcards?

  1. Those are special characters used to match parts of a value.
  2. Search patterns made from literal text, wildcard character, or a combination.
  3. Uses LIKE as an operator.
  4. It can only be used with strings.
  5. It cannot be used for non-text datatypes.
  6. It is helpful for data scientists as they explore string variables.

There are four basic wildcard operators:

%Used in substitute for zero or more characters
_Used in substitute of one character
Note: Not supported by DB2
[range_of_characters]Used to retrieve matching set or range of characters defined inside the brackets.
Note: Not work with SQLite
[^range_of_characters] or [!range of characters]Used to fetch a non-matching set or range of characters specified inside the brackets.


SELECT column1,column2 
FROM TableName 
WHERE column 
LIKE wildcard_operator; 

column1 , column2: fields in the table 
TableName: name of table 
column: name of field used for filtering data



Example 1: To fetch records from Student table with NAME ending with letter ‘A’.




Example 2: To fetch records from Student table with SUBJECT ending any letter but starting from ‘DB’.




Example 3: To fetch records from the Student table with NAME containing letters ‘a’, ‘b’, or ‘c’.

LIKE '%[A-C]%';



Example 4: To fetch records from Student table with ADDRESS not containing letters ‘a’, ‘b’, or ‘c’.

LIKE '%[^A-C]%';



Downsides of Wildcards:

  1. The statements having wildcards will take more time to execute if it is used at the end of search patterns.
  2. Better to use another operator (if possible): =, <. =>, etc.
  3. Placements of wildcards are important.

Sorting with ORDER BY

 To sort data with SQL, one can use the ORDER BY clause. Before going to its syntax first understand why sort data?

Why sort data?

  1. Sorting of data in an appropriate method is very helpful when inspecting data. Otherwise, data will return in such a way which makes the data a bit more complex to interpret.
  2. Updated and deleted data can change this order.
  3. The sequence of retrieved data can not be assumed if the order was not specified.
  4. Sorting data logically helps keep the information you want on top.

ORDER BY statement is used to sort the retrieved data in either ascending or descending according to one or more columns.

  • By default ORDER BY sorts data in ascending order.
  • We can use the keyword DESC to sort data in descending order and the keyword ASC to sort in ascending order.


FROM TableName 
ORDER BY column1 ASC|DESC , column2 ASC|DESC;

TableName: name of the table.
column_name: name of the column by which the data is going to be arranged. 
ASC: to sort the data in ascending order. 
DESC: to sort the data in descending order. 
| : use either ASC or DESC to sort in ascending or descending order



Example 1: Sort according to a single column: In this following example, let’s retrieve all data from the table STUDENT and sort the data in descending order depending on the column AGE.




Example 2: Sort according to multiple columns: In this example, let’s retrieve all data from the table STUDENT and then sort the data in ascending order first depending on the column AGE. and then in descending order depending on the column ROLL_NO.




Arithmetic Operations

One can perform various arithmetic operations on the data stored in the tables by using arithmetic operators. Arithmetic Operators are:




Example 1: Let’s perform addition operation on the data items, items include either a single column or multiple columns. Let’s add 10 with column AGE:

AS "AGE + 10" 



Example 2: Let’s perform addition of 2 columns:




Similarly you can try other arithmetic operations.

Aggregate Functions

Aggregate functions are applied for all kinds of things, and they are beneficial in finding the highest or lowest values, the total number of records, average value, etc. A lot of times in descriptive statistics, we are getting to know and understand our data. You’re going to use a lot of those different types of aggregate functions. Here are some important aggregate functions:

AVG( )Averages a column of values
COUNT( )Counts the number of values
MIN( )Finds the minimum value
MAX( )Finds the maximum value
SUM( )Sums the column values



AVG( ): It returns the average value after calculating from values in a numeric column.


SELECT AVG(column_name) FROM table_name;





MAX( ): The MAX( ) function returns the maximum value of the selected column.


SELECT MAX(column_name) FROM table_name;





SUM( ): The SUM( ) function returns the sum of all the values of the selected column.


SELECT SUM(column_name) FROM table_name;





Similarly you can try other aggregate functions.

Grouping Data with SQL

The GROUP BY statement in SQL is utilized to arrange similar data into groups with the help of some functions e.g if a particular column has the same values in other rows then it will arrange these rows in a group.


  • GROUP BY clause is applied with the SELECT statement.
  • In the query, the GROUP BY clause is placed after the WHERE clause.
  • In the query, the GROUP BY clause is placed before the ORDER BY clause if used any.


SELECT column1, function_name(column2)
FROM table_name
WHERE condition
GROUP BY column1, column2
ORDER BY column1, column2;

function_name: Name of the function used for example, SUM() , AVG().
table_name: Name of the table.
condition: Condition used.

Let’s consider a different tables in this case.



Example: Group By single column: Group By single column means, to place all the rows with same value of only that particular column in one group. Consider the query as shown below:

FROM Employee 




You know that WHERE clause is used to place conditions on columns but what if you want to place conditions on groups?

This is where the HAVING clause comes into use. You can use the HAVING clause to place conditions to decide which group will be the part of the final result-set. Also, you can not use the aggregate functions like SUM(), COUNT(), etc. with WHERE clause. So you have to use the HAVING clause if you want to use any of these functions in the conditions.


SELECT column1, function_name(column2)
FROM table_name
WHERE condition
GROUP BY column1, column2
HAVING condition
ORDER BY column1, column2;

function_name: Name of the function used for example, SUM() , AVG().
table_name: Name of the table.
condition: Condition used.





SQL for Various Data Science Languages

Because of the popularity and versatility of SQL, SQL is also used for many big data applications. Below are a brief description and few resources for how SQL is used with common big data and data science languages.

SQL for R

R language is developed by statisticians to help other statisticians and developers faster and efficiently with the data.

It is an open-source programming language and generally comes with the Command-line interface. It is available across extensively used platforms like Windows, Linux, and macOS. 

R is an essential tool for Data Science. It is highly popular and is the first choice of many statisticians and data scientists.

For those studying R and who may be well-versed in SQL, the sqldf package provides a mechanism to handle R data frames using SQL. 

Also, for experienced R programmers, sqldf can be a useful tool for data manipulation.


  1. SQLDF Package
  2. Documentation
  3. Examples

SQL for Python

Python is an open-source, interpreted, high-level language. It supports object-oriented programming.

It is one of the best languages used by data scientists for various data science projects/applications. Python provides a lot of functionalities to deal with mathematics and statistics.

 It provides excellent libraries to deals with data science applications. python-sql is the library to write SQL queries in a pythonic way.


  1. Python-SQL Package Documentation

SQL with Hadoop

Apache Hive is a data warehouse tool that implements the SQL-like interface between the user and the Hadoop distributed file system (HDFS)

It is a software project that gives data query and analysis. It provides reading, writing, and managing large-scale datasets stored in distributed storage and queried by SQL syntax. 

It is not developed for Online Transaction Processing (OLTP) workloads.


  1. Hive Overview
  2. Documentation

SQL for Spark

Apache Spark is an open-source distributed general-purpose framework. 

Spark is structured around Spark Core, the engine that runs the scheduling and RDD abstraction, and correlates Spark to the right filesystem (RDBMS, S3, or Elasticsearch). 

Numerous libraries work on the top of Spark Core, including Spark SQL, which permits us to execute SQL-like commands on distributed data sets, GraphX for graph queries, and streaming that allows for the input of continuously streaming log data.


  1. Overview
  2. Documentation


What is the best way to learn SQL for data science?

There are approximately thousands of SQL courses online, but most don’t teach you to use SQL in the real world. The problem is that real-world SQL doesn’t look like as they taught you in courses. However, you can follow these references to learn SQL for data science.

What is the role of SQL in data science?

SQL code is more effective, produces easily reproducible scripts, and keeps you closer to the data. It is very beneficial when working with large data sets in a data science project.

What elements of SQL do data scientists need to know?

The aspiring Data Scientists must have the following required SQL skills
1. Knowledge of Relational Database Model
2. Knowledge of the SQL commands
4. SQL indexing
5. SQL Joins
6. Primary & Foreign Key
7. SubQuery
8. Creating Tables and Retrieving data from the tables

What types of SQL databases are best for data science?

In general, out of all the relational databases, MySQL remains the most popular database for organizations. However, PostgreSQL is also a choice for some implementations. When it comes to the cloud, there are managed database solutions optimized for data science workloads.

Which SQL certification is best for a data analyst/scientist?

There are no specific SQL certifications. However, you can get one from online course portals like Coursera, Udacity, Udemy, etc

SQL For Data Science