About Me

My photo
Mumbai, Maharastra, India
He has more than 7.6 years of experience in the software development. He has spent most of the times in web/desktop application development. He has sound knowledge in various database concepts. You can reach him at viki.keshari@gmail.com https://www.linkedin.com/in/vikrammahapatra/ https://twitter.com/VikramMahapatra http://www.facebook.com/viki.keshari

Search This Blog

Sunday, October 13, 2024

Running Airflow on your local system using docker-compose

Steps to Perform on Linux Shell

    1. docker --version
    2. docker-compose --version
    3. cd docker_tutorial/
    4. cd airflow-docker/
    5. curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml'
    6.     https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html
    7. code .

After VS Code is open

Set the Volumes:
  1. Open the terminal 
  2. docker --version
  3. cd sources/
  4. mkdir ./dags ./logs ./config ./plugins
  5. ls
  6. cd ..

Create and Run the Airflow containers
docker-compose up airflow-init
docker-compose 


docker-compose.yaml

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME           - Docker image name used to run Airflow.
#                                Default: apache/airflow:2.10.2
# AIRFLOW_UID                  - User ID in Airflow containers
#                                Default: 50000
# AIRFLOW_PROJ_DIR             - Base path to which all the files will be volumed.
#                                Default: .
# Those configurations are useful mostly in case of standalone testing/running Airflow in test/try-out mode
#
# _AIRFLOW_WWW_USER_USERNAME   - Username for the administrator account (if requested).
#                                Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD   - Password for the administrator account (if requested).
#                                Default: airflow
# _PIP_ADDITIONAL_REQUIREMENTS - Additional PIP requirements to add when starting all containers.
#                                Use this option ONLY for quick checks. Installing requirements at container
#                                startup is done EVERY TIME the service is started.
#                                A better way is to build a custom image or extend the official image
#                                as described in https://airflow.apache.org/docs/docker-stack/build.html.
#                                Default: ''
#
# Feel free to modify this file to suit your needs.
---
x-airflow-common:
  &airflow-common
  # In order to add custom dependencies or upgrade provider packages you can use your extended image.
  # Comment the image line, place your Dockerfile in the directory where you placed the docker-compose.yaml
  # and uncomment the "build" line below, Then run `docker-compose build` to build the images.
  image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:2.10.2}
  # build: .
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
    AIRFLOW__API__AUTH_BACKENDS: 'airflow.api.auth.backend.basic_auth,airflow.api.auth.backend.session'
    # yamllint disable rule:line-length
    # Use simple http server on scheduler for health checks
    # See https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/check-health.html#scheduler-health-check-server
    # yamllint enable rule:line-length
    AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK: 'true'
    # WARNING: Use _PIP_ADDITIONAL_REQUIREMENTS option ONLY for a quick checks
    # for other purpose (development, test and especially production usage) build/extend Airflow image.
    _PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:-}
    # The following line can be used to set a custom config file, stored in the local config folder
    # If you want to use it, outcomment it and replace airflow.cfg with the name of your config file
    # AIRFLOW_CONFIG: '/opt/airflow/config/airflow.cfg'
  volumes:
    - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
    - ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
    - ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
    - ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
  user: "${AIRFLOW_UID:-50000}:0"
  depends_on:
    &airflow-common-depends-on
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD", "pg_isready", "-U", "airflow"]
      interval: 10s
      retries: 5
      start_period: 5s
    restart: always

  redis:
    # Redis is limited to 7.2-bookworm due to licencing change
    # https://redis.io/blog/redis-adopts-dual-source-available-licensing/
    image: redis:7.2-bookworm
    expose:
      - 6379
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 30s
      retries: 50
      start_period: 30s
    restart: always

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - "8080:8080"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:8974/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      # yamllint disable rule:line-length
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.providers.celery.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}" || celery --app airflow.executors.celery_executor.app inspect ping -d "celery@$${HOSTNAME}"'
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    environment:
      <<: *airflow-common-env
      # Required to handle warm shutdown of the celery workers properly
      # See https://airflow.apache.org/docs/docker-stack/entrypoint.html#signal-propagation
      DUMB_INIT_SETSID: "0"
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-triggerer:
    <<: *airflow-common
    command: triggerer
    healthcheck:
      test: ["CMD-SHELL", 'airflow jobs check --job-type TriggererJob --hostname "$${HOSTNAME}"']
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    # yamllint disable rule:line-length
    command:
      - -c
      - |
        if [[ -z "${AIRFLOW_UID}" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: AIRFLOW_UID not set!\e[0m"
          echo "If you are on Linux, you SHOULD follow the instructions below to set "
          echo "AIRFLOW_UID environment variable, otherwise files will be owned by root."
          echo "For other operating systems you can get rid of the warning with manually created .env file:"
          echo "    See: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#setting-the-right-airflow-user"
          echo
        fi
        one_meg=1048576
        mem_available=$$(($$(getconf _PHYS_PAGES) * $$(getconf PAGE_SIZE) / one_meg))
        cpus_available=$$(grep -cE 'cpu[0-9]+' /proc/stat)
        disk_available=$$(df / | tail -1 | awk '{print $$4}')
        warning_resources="false"
        if (( mem_available < 4000 )) ; then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough memory available for Docker.\e[0m"
          echo "At least 4GB of memory required. You have $$(numfmt --to iec $$((mem_available * one_meg)))"
          echo
          warning_resources="true"
        fi
        if (( cpus_available < 2 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough CPUS available for Docker.\e[0m"
          echo "At least 2 CPUs recommended. You have $${cpus_available}"
          echo
          warning_resources="true"
        fi
        if (( disk_available < one_meg * 10 )); then
          echo
          echo -e "\033[1;33mWARNING!!!: Not enough Disk space available for Docker.\e[0m"
          echo "At least 10 GBs recommended. You have $$(numfmt --to iec $$((disk_available * 1024 )))"
          echo
          warning_resources="true"
        fi
        if [[ $${warning_resources} == "true" ]]; then
          echo
          echo -e "\033[1;33mWARNING!!!: You have not enough resources to run Airflow (see above)!\e[0m"
          echo "Please follow the instructions to increase amount of resources available:"
          echo "   https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#before-you-begin"
          echo
        fi
        mkdir -p /sources/logs /sources/dags /sources/plugins
        chown -R "${AIRFLOW_UID}:0" /sources/{logs,dags,plugins}
        exec /entrypoint airflow version
    # yamllint enable rule:line-length
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_MIGRATE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
      _PIP_ADDITIONAL_REQUIREMENTS: ''
    user: "0:0"
    volumes:
      - ${AIRFLOW_PROJ_DIR:-.}:/sources

  airflow-cli:
    <<: *airflow-common
    profiles:
      - debug
    environment:
      <<: *airflow-common-env
      CONNECTION_CHECK_MAX_COUNT: "0"
    # Workaround for entrypoint issue. See: https://github.com/apache/airflow/issues/16252
    command:
      - bash
      - -c
      - airflow

  # You can enable flower by adding "--profile flower" option e.g. docker-compose --profile flower up
  # or by explicitly targeted on the command line e.g. docker-compose up flower.
  # See: https://docs.docker.com/compose/profiles/
  flower:
    <<: *airflow-common
    command: celery flower
    profiles:
      - flower
    ports:
      - "5555:5555"
    healthcheck:
      test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 30s
    restart: always
    depends_on:
      <<: *airflow-common-depends-on
      airflow-init:
        condition: service_completed_successfully

volumes:
  postgres-db-volume:


Pushing the code to repo

create a new repository on the command line
echo "# apache_arflow" >> README.md
git init
git add README.md
git commit -m "first commit"
git branch -M main
git remote add origin https://github.com/tariniteam/apache_arflow.git
git push -u origin main

…or push an existing repository from the command line

git remote add origin https://github.com/tariniteam/apache_arflow.git
git branch -M main
git push -u origin main


Post Reference: Vikram Aristocratic Elfin Share

Friday, September 27, 2024

Docker Post-6 Step-by-step guide to create your first Docker file with a Python script

Docker makes it easy to package and deploy applications. If you're new to Docker, here's a simple guide on how to create your first Docker file, which runs a Python script (`hello.py`). This post will cover setting up the Docker file, creating a basic Python script, and handling dependencies through a `requirements.txt` file.

1. Create the Python script `hello.py`

This script is very simple and prints out a message:

Sunday, September 15, 2024

Post 5: Mounting Multiple Directories to a Single Container Directory Using Docker Run and Symlink



In this post, we will walk through the steps of mounting multiple directories to a single container directory using Docker and symbolic links (symlinks). This can be particularly useful when you need to access the same directory from different paths within your containerized environment.

Step 1: Create the Root Directory
First, you need to create a root directory that will be shared across the symlinks.

For example, let's create a directory named `root_dir1`:

mkdir root_dir1 


Step 2: Create Symlink Directories

Now that we have the root directory, we'll create multiple symlink directories that will all point to the same root directory. Below are instructions for both **Linux Ubuntu** and **Windows PowerShell** users.

For Linux Ubuntu Users:
Use the `ln -s` command to create symbolic links to the root directory:
ln -s /mnt/c/Users/vikik/Projects/PyProjects/root_dir1 sym1
ln -s /mnt/c/Users/vikik/Projects/PyProjects/root_dir1 sym2
ln -s /mnt/c/Users/vikik/Projects/PyProjects/root_dir1 sym3


For Windows PowerShell Users:
Use the `New-Item` command to create symbolic links in PowerShell:

New-Item -ItemType SymbolicLink -Path "sym1" -Target "root_dir1"
New-Item -ItemType SymbolicLink -Path "sym2" -Target "root_dir1"
New-Item -ItemType SymbolicLink -Path "sym3" -Target "root_dir1"

In both cases, `sym1`, `sym2`, and `sym3` will all point to the `root_dir1` directory.

Step 3: Create and Run a Python Docker Container

Now, let's create a Python container and mount the root directory (`root_dir1`) to the container's `/app` path.

Command to Run the Container:

docker run -d --name my-python-container -v C:\Users\vikik\Projects\PyProjects\root_dir1:/app python sleep infinity

This command runs a Python container in detached mode with the root directory mounted to the `/app` path inside the container.

Access the Container:
To access the container, execute the following:
docker exec -it my-python-container /bin/bash

Once inside the container, update the package manager and install the `nano` text editor:

apt update
apt install nano

Now, create a Python script inside the container:
nano test.py

Step 4: Test the Symlink Folders

Any file you create in the root directory (such as `test.py`) will be accessible in all the symlink folders (`sym1`, `sym2`, and `sym3`) since they all point to the same directory.

For example, if you add code to `test.py` in the `/app` directory, it will also appear in `sym1`, `sym2`, and `sym3`.

This setup allows you to mount multiple directories to the same container directory using Docker, while maintaining flexibility with symlinks.

Conclusion

By following these steps, you can successfully create multiple symlink directories that all point to the same root directory inside a Docker container. This approach is helpful when managing multiple access points to a shared directory within a containerized environment.

Post Reference: Vikram Aristocratic Elfin Share

Monday, January 3, 2022

SQL Query: Immediate Food Delivery solved using CTE


QUESTION: Table: Delivery

+-----------------------------+---------+
| Column Name                 | Type    |
+-----------------------------+---------+
| delivery_id                 | int     |
| customer_id                 | int     |
| order_date                  | date    |
| customer_pref_delivery_date | date    |
+-----------------------------+---------+
delivery_id is the primary key of this table.
The table holds information about food delivery to customers that make orders at some date and specify a preferred delivery date (on the same order date or after it).


If the preferred delivery date of the customer is the same as the order date then the order is called immediate otherwise it's called scheduled.

The first order of a customer is the order with the earliest order date that customer made. It is guaranteed that a customer has exactly one first order.

Write an SQL query to find the percentage of immediate orders in the first orders of all customers, rounded to 2 decimal places.

The query result format is in the following example:

Delivery table:
+-------------+-------------+------------+-----------------------------+
| delivery_id | customer_id | order_date | customer_pref_delivery_date |
+-------------+-------------+------------+-----------------------------+
| 1           | 1           | 2019-08-01 | 2019-08-02                  |
| 2           | 2           | 2019-08-02 | 2019-08-02                  |
| 3           | 1           | 2019-08-11 | 2019-08-12                  |
| 4           | 3           | 2019-08-24 | 2019-08-24                  |
| 5           | 3           | 2019-08-21 | 2019-08-22                  |
| 6           | 2           | 2019-08-11 | 2019-08-13                  |
| 7           | 4           | 2019-08-09 | 2019-08-09                  |
+-------------+-------------+------------+-----------------------------+

Result table:
+----------------------+
| immediate_percentage |
+----------------------+
| 50.00                |
+----------------------+
The customer id 1 has a first order with delivery id 1 and it is scheduled.
The customer id 2 has a first order with delivery id 2 and it is immediate.
The customer id 3 has a first order with delivery id 5 and it is scheduled.
The customer id 4 has a first order with delivery id 7 and it is immediate.
Hence, half the customers have immediate first orders.
SOLUTION:
  • Step1) Try to find first order of each customer using dense rank function
  • Step2) Get the immediate order from the first order set
  • Step3) Get the percentage of immediate order to the total first order
;with FirstOrderCTE as
(select 
 delivery_id, customer_id, order_date, 
 customer_pref_delivery_date, 
 dense_rank() over (partition by customer_id order by order_date asc) as rnk 
from Delivery
)
,immediate_order as
(select count(*) as immediate_order_count 
 from FirstOrderCTE 
 where rnk = 1 and order_date = customer_pref_delivery_date
)
 ,percent_of_immediate_order as
 (select 
  cast(((select immediate_order_count from immediate_order limit 1)/
  cast(count(*) as numeric(8,2))) * 100 as numeric(8,2)) as immediate_percentage,
  
  (select immediate_order_count from immediate_order limit 1),
  cast(count(*) as numeric(8,2))
  from FirstOrderCTE where rnk = 1
 )
 --select immediate_order_count from immediate_order limit 1
 select immediate_percentage from percent_of_immediate_order
Post Reference: Vikram Aristocratic Elfin Share

Saturday, January 1, 2022

SQL Query: Department Top Three Salaries using Window Function

QUESTION: Table: Employee

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| Id           | int     |
| Name         | varchar |
| Salary       | int     |
| DepartmentId | int     |
+--------------+---------+
Id is the primary key for this table.
Each row contains the ID, name, salary, and department of one employee.


Table: Department

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| Id          | int     |
| Name        | varchar |
+-------------+---------+
Id is the primary key for this table.
Each row contains the ID and the name of one department.


A company's executives are interested in seeing who earns the most money in each of the company's departments. A high earner in a department is an employee who has a salary in the top three unique salaries for that department.

Write an SQL query to find the employees who are high earners in each of the departments.

Return the result table in any order.

The query result format is in the following example:



Employee table:
+----+-------+--------+--------------+
| Id | Name  | Salary | DepartmentId |
+----+-------+--------+--------------+
| 1  | Joe   | 85000  | 1            |
| 2  | Henry | 80000  | 2            |
| 3  | Sam   | 60000  | 2            |
| 4  | Max   | 90000  | 1            |
| 5  | Janet | 69000  | 1            |
| 6  | Randy | 85000  | 1            |
| 7  | Will  | 70000  | 1            |
+----+-------+--------+--------------+

Department table:
+----+-------+
| Id | Name  |
+----+-------+
| 1  | IT    |
| 2  | Sales |
+----+-------+

Result table:
+------------+----------+--------+
| Department | Employee | Salary |
+------------+----------+--------+
| IT         | Max      | 90000  |
| IT         | Joe      | 85000  |
| IT         | Randy    | 85000  |
| IT         | Will     | 70000  |
| Sales      | Henry    | 80000  |
| Sales      | Sam      | 60000  |
+------------+----------+--------+

In the IT department:
- Max earns the highest unique salary
- Both Randy and Joe earn the second-highest unique salary
- Will earns the third-highest unique salary

In the Sales department:
- Henry earns the highest salary
- Sam earns the second-highest salary
- There is no third-highest salary as there are only two employees
SOLUTION:
  • Step1) Use dense rank to rank the salary on the basis of department and order by salary descending
  • Step2) Filter the above result where dense rank is less than equal 3

;with EmpCTE as ( select d.Name as Department,e.Name as Employee,e.Salary as Salary, dense_rank() over (partition by e.DepartmentId order by e.Salary desc) as drno from Employee e inner join Department d on e.DepartmentId = d.Id ) select Department, Employee, Salary from EmpCTE where drno <=3

SQL Query: Maximum Transaction Each Day

QUESTION: Table: Transactions

+----------------+----------+
| Column Name    | Type     |
+----------------+----------+
| transaction_id | int      |
| day            | datetime |
| amount         | int      |
+----------------+----------+
transaction_id is the primary key for this table.
Each row contains information about one transaction.


Write an SQL query to report the IDs of the transactions with the maximum amount on their respective day. If in one day there are multiple such transactions, return all of them.

Return the result table in ascending order by transaction_id.

The query result format is in the following example:



Transactions table:
+----------------+--------------------+--------+
| transaction_id | day                | amount |
+----------------+--------------------+--------+
| 8              | 2021-4-3 15:57:28  | 57     |
| 9              | 2021-4-28 08:47:25 | 21     |
| 1              | 2021-4-29 13:28:30 | 58     |
| 5              | 2021-4-28 16:39:59 | 40     |
| 6              | 2021-4-29 23:39:28 | 58     |
+----------------+--------------------+--------+

Result table:
+----------------+
| transaction_id |
+----------------+
| 1              |
| 5              |
| 6              |
| 8              |
+----------------+
"2021-4-3"  --> We have one transaction with ID 8, so we add 8 to the result table.
"2021-4-28" --> We have two transactions with IDs 5 and 9. The transaction with ID 5 has an amount of 40, while the transaction with ID 9 has an amount of 21. We only include the transaction with ID 5 as it has the maximum amount this day.
"2021-4-29" --> We have two transactions with IDs 1 and 6. Both transactions have the same amount of 58, so we include both in the result table.
We order the result table by transaction_id after collecting these IDs.
SOLUTION:
  • step1) Get each day transaction of max amount
  • step2) Make a join of transaction table with above cte on date and amount

;with highestAmt as (select date(day) as day, max(amount) as amount from Transactions group by date(day)) ,resultCte as (select t.transaction_id from Transactions t inner join highestAmt h on date(t.day) = h.day and t.amount = h.amount) select transaction_id from resultCte order by transaction_id asc

Post Reference: Vikram Aristocratic Elfin Share

SQL Query: Banned Account Problem

QUESTION: Table: LogInfo

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| account_id  | int      |
| ip_address  | int      |
| login       | datetime |
| logout      | datetime |
+-------------+----------+
There is no primary key for this table, and it may contain duplicates.
The table contains information about the login and logout dates of Leetflex accounts. It also contains the IP address from which the account logged in and out.
It is guaranteed that the logout time is after the login time.


Write an SQL query to find the account_id of the accounts that should be banned from Leetflex. An account should be banned if it was logged in at some moment from two different IP addresses.

Return the result table in any order.

The query result format is in the following example:



LogInfo table:
+------------+------------+---------------------+---------------------+
| account_id | ip_address | login               | logout              |
+------------+------------+---------------------+---------------------+
| 1          | 1          | 2021-02-01 09:00:00 | 2021-02-01 09:30:00 |
| 1          | 2          | 2021-02-01 08:00:00 | 2021-02-01 11:30:00 |
| 2          | 6          | 2021-02-01 20:30:00 | 2021-02-01 22:00:00 |
| 2          | 7          | 2021-02-02 20:30:00 | 2021-02-02 22:00:00 |
| 3          | 9          | 2021-02-01 16:00:00 | 2021-02-01 16:59:59 |
| 3          | 13         | 2021-02-01 17:00:00 | 2021-02-01 17:59:59 |
| 4          | 10         | 2021-02-01 16:00:00 | 2021-02-01 17:00:00 |
| 4          | 11         | 2021-02-01 17:00:00 | 2021-02-01 17:59:59 |
+------------+------------+---------------------+---------------------+

Result table:
+------------+
| account_id |
+------------+
| 1          |
| 4          |
+------------+
Account ID 1 --> The account was active from "2021-02-01 09:00:00" to "2021-02-01 09:30:00" with two different IP addresses (1 and 2). It should be banned.
Account ID 2 --> The account was active from two different addresses (6, 7) but in two different times.
Account ID 3 --> The account was active from two different addresses (9, 13) on the same day but they do not intersect at any moment.
Account ID 4 --> The account was active from "2021-02-01 17:00:00" to "2021-02-01 17:00:00" with two different IP addresses (10 and 11). It should be banned.
SOLUTION:
  • step1) create a self join with condition
    • l.account_id = l2.account_id and l1.ip_address <> l2.ip_address
  • Step2) Filter the data on the basis of
    • l2.login <= l1.logout and l2.login >=l1.login

select l1.account_id from LogInfo l1 inner join LogInfo l2 on l1.account_id = l2.account_id and l1.ip_address <> l2.ip_address where l2.login <= l1.logout and l2.login >=l1.login
Post Reference: Vikram Aristocratic Elfin Share

Friday, December 31, 2021

SQL Query: Biggest Window Between Visits

QUESTION: Table: UserVisits

+-------------+------+
| Column Name | Type |
+-------------+------+
| user_id     | int  |
| visit_date  | date |
+-------------+------+
This table does not have a primary key.
This table contains logs of the dates that users vistied a certain retailer.


Assume today's date is '2021-1-1'.

Write an SQL query that will, for each user_id, find out the largest window of days between each visit and the one right after it (or today if you are considering the last visit).

Return the result table ordered by user_id.

The query result format is in the following example:



UserVisits table:
+---------+------------+
| user_id | visit_date |
+---------+------------+
| 1       | 2020-11-28 |
| 1       | 2020-10-20 |
| 1       | 2020-12-3  |
| 2       | 2020-10-5  |
| 2       | 2020-12-9  |
| 3       | 2020-11-11 |
+---------+------------+
Result table:
+---------+---------------+
| user_id | biggest_window|
+---------+---------------+
| 1       | 39            |
| 2       | 65            |
| 3       | 51            |
+---------+---------------+
For the first user, the windows in question are between dates:
    - 2020-10-20 and 2020-11-28 with a total of 39 days. 
    - 2020-11-28 and 2020-12-3 with a total of 5 days. 
    - 2020-12-3 and 2021-1-1 with a total of 29 days.
Making the biggest window the one with 39 days.
For the second user, the windows in question are between dates:
    - 2020-10-5 and 2020-12-9 with a total of 65 days.
    - 2020-12-9 and 2021-1-1 with a total of 23 days.
Making the biggest window the one with 65 days.
For the third user, the only window in question is between dates 2020-11-11 and 2021-1-1 with a total of 51 days.
Solution: 
  • Step1) First Generate the row number over partition by user_id
  • Step2) Then implement lag lead to get the previous date visit
  • Step3) fin difference of latest date with prev date
  • step 4) get the max days diff on the basis of user_id

with UserVisitCTE as (select row_number() over(partition by user_id order by user_id, visit_date ) as rno, * from UserVisits order by user_id, visit_date ), UserVisitLeadCTE as ( select f.user_id, COALESCE(s.visit_date,'2021-01-01'), f.visit_date ,(COALESCE(s.visit_date,'2021-01-01') - f.visit_date) as days_window from UserVisitCTE f left join UserVisitCTE s on f.rno + 1 = s.rno and f.user_id = s.user_id ) --select * from UserVisitCTE select user_id, max(days_window) as biggest_window from UserVisitLeadCTE group by user_id order by user_id
Post Reference: Vikram Aristocratic Elfin Share

SQL Query: Second Degree Follower

QUESTION: In facebook, there is a follow table with two columns: followee, follower.

Please write a sql query to get the amount of each follower’s follower if he/she has one.

For example:

+-------------+------------+
| followee    | follower   |
+-------------+------------+
|     A       |     B      |
|     B       |     C      |
|     B       |     D      |
|     D       |     E      |
+-------------+------------+
Should output:
+-------------+------------+
| follower    | num        |
+-------------+------------+
|     B       |  2         |
|     D       |  1         |
+-------------+------------+
Explanation:
Both B and D exist in the follower list, when as a followee, B's follower is C and D, and D's follower is E. A does not exist in follower list.
Note:
Followee would not follow himself/herself in all cases.
Please display the result in follower's alphabet order.
SOLUTION: 
  • Step1) implement self join on follow table
  • Step2) join follower of left table to the followee of right table using inner join
  • Step3) group by left table follower id and take the count
select f.follower , count(fr.followee) as num from follow f
inner join follow fr 
on f.follower = fr.followee
group by f.follower

Post Reference: Vikram Aristocratic Elfin Share

SQL Query: Exchange Seats Problem

Mary is a teacher in a middle school and she has a table seat storing students' names and their corresponding seat ids.

The column id is continuous increment.

Mary wants to change seats for the adjacent students.

Can you write a SQL query to output the result for Mary?



+---------+---------+
|    id   | student |
+---------+---------+
|    1    | Abbot   |
|    2    | Doris   |
|    3    | Emerson |
|    4    | Green   |
|    5    | Jeames  |
+---------+---------+
For the sample input, the output is:

+---------+---------+
|    id   | student |
+---------+---------+
|    1    | Doris   |
|    2    | Abbot   |
|    3    | Green   |
|    4    | Emerson |
|    5    | Jeames  |
+---------+---------+
Note:

If the number of students is odd, there is no need to change the last one's seat.
Solution:

  • Step1) We get the even id set of student
  • Step2) we get the odd id set of student
  • Step3) get the even id with odd student name
  • step4) get the odd id with even student name
  • step5) union and sort step3 and step4

;with even as 
(select * from seat s1 where s1.id%2 = 0)
,odd as 
(select * from seat s2 where s2.id%2 = 1)
,comb as 
(
select e.id, COALESCE(o.student,e.student) as student 
 from even e left join odd o
	on e.id = o.id +1
union all
select o.id, COALESCE(e.student,o.student) as student 
    from odd o left join even e
		on o.id + 1 = e.id
)

select * from comb order by id asc

Post Reference: Vikram Aristocratic Elfin Share

Tuesday, December 28, 2021

SQL Query: Count Student Number in Departments

A university uses 2 data tables, student and department, to store data about its students and the departments associated with each major.

Write a query to print the respective department name and number of students majoring in each department for all departments in the department table (even ones with no current students).

Sort your results by descending number of students; if two or more departments have the same number of students, then sort those departments alphabetically by department name.

The student is described as follow:

| Column Name  | Type      |
|--------------|-----------|
| student_id   | Integer   |
| student_name | String    |
| gender       | Character |
| dept_id      | Integer   |
 where student_id is the student's ID number, student_name is the student's name, gender is their gender, and dept_id is the department ID associated with their declared major.

And the department table is described as below:

| Column Name | Type    |
|-------------|---------|
| dept_id     | Integer |
| dept_name   | String  |
 where dept_id is the department's ID number and dept_name is the department name.

Here is an example input:
student table:

| student_id | student_name | gender | dept_id |
|------------|--------------|--------|---------|
| 1          | Jack         | M      | 1       |
| 2          | Jane         | F      | 1       |
| 3          | Mark         | M      | 2       |
department table:

| dept_id | dept_name   |
|---------|-------------|
| 1       | Engineering |
| 2       | Science     |
| 3       | Law         |
The Output should be:

| dept_name   | student_number |
|-------------|----------------|
| Engineering | 2              |
| Science     | 1              |
| Law         | 0              |
Solution:
select d.dept_name, count(s.dept_id) as dept_name from department d 
left join student s
on d.dept_id = s.dept_id
group by d.dept_name
order by count(s.dept_id) desc
Post Reference: Vikram Aristocratic Elfin Share

SQL Query: Winning Candidate

Table: Candidate
 +-----+---------+
 | id  | Name    |
 +-----+---------+
 | 1   | A       |
 | 2   | B       |
 | 3   | C       |
 | 4   | D       |
 | 5   | E       |
 +-----+---------+  
Table: Vote
 +-----+--------------+
 | id  | CandidateId  |
 +-----+--------------+
 | 1   |     2        |
 | 2   |     4        |
 | 3   |     3        |
 | 4   |     2        |
 | 5   |     5        |
 +-----+--------------+
id is the auto-increment primary key,
CandidateId is the id appeared in Candidate table.
Write a sql to find the name of the winning candidate, the above example will return the winner B.

 +------+
 |  Name |
 +------+
 | B    |
 +------+
Notes:

You may assume there is no tie, in other words there will be only one winning candidate.
Solution:


PL/SQL:

select c.Name from Candidate c 
inner join vote v
on c.id = v.CandidateId
group by c.Name
order by count(*) desc
limit 1

TSQL:

select top 1 c.Name  from Candidate c 
inner join vote v
on c.id = v.CandidateId
group by c.Name
order by count(*) desc


Post Reference: Vikram Aristocratic Elfin Share

SQL Query: Rising Temperature Days

Table: Weather

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| recordDate    | date    |
| temperature   | int     |
+---------------+---------+
id is the primary key for this table.
This table contains information about the temperature in a certain day.


Write an SQL query to find all dates' id with higher temperature compared to its previous dates (yesterday).

Return the result table in any order.

The query result format is in the following example:

Weather
+----+------------+-------------+
| id | recordDate | Temperature |
+----+------------+-------------+
| 1  | 2015-01-01 | 10          |
| 2  | 2015-01-02 | 25          |
| 3  | 2015-01-03 | 20          |
| 4  | 2015-01-04 | 30          |
+----+------------+-------------+

Result table:
+----+
| id |
+----+
| 2  |
| 4  |
Solution:
select distinct w2.id from Weather w1 
inner join Weather w2
on w1.id + 1 = w2.id 
where w1.Temperature < w2.Temperature
Post Reference: Vikram Aristocratic Elfin Share

Sunday, December 26, 2021

LeetcodeSQL: 180. Consecutive Numbers [implementing Lag and Lead]

Table: Logs

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| num         | varchar |
+-------------+---------+
id is the primary key for this table.

 

Write an SQL query to find all numbers that appear at least three times consecutively.

Return the result table in any order.

The query result format is in the following example.

 

Example 1:

Input: 
Logs table:
+----+-----+
| id | num |
+----+-----+
| 1  | 1   |
| 2  | 1   |
| 3  | 1   |
| 4  | 2   |
| 5  | 1   |
| 6  | 2   |
| 7  | 2   |
+----+-----+
Output: 
+-----------------+
| ConsecutiveNums |
+-----------------+
| 1               |
+-----------------+
Explanation: 1 is the only number that appears consecutively for at least three times.
Solution:
select distinct l0.num as ConsecutiveNums
from Logs l0 
inner join Logs l1 on l1.id - 1 = l0.id 
inner join Logs l2 on l1.id + 1 = l2.id
where l0.num = l1.num and l1.num = l2.num
Post Reference: Vikram Aristocratic Elfin Share

Misc SQL: Create monthly Allowance report and Fill Null against those month where emp has not received any allowances

Question: Generate a month wise allowance received report for each employee, keep Null in allowance for those month where employee has not received any allowance.

emp_id emp_name month_number allowance
E001 Aayansh         1         1000
E001 Aayansh         2         3000
E002 Rishika         3         2000
E002 Rishika         5         4000

Output should be :

month_number emp_id emp_name allowance
1 E001 Aayansh 1000
2 E001 Aayansh 3000
3 E001 Aayansh null
4 E001 Aayansh null
5 E001 Aayansh null
6 E001 Aayansh null
7 E001 Aayansh null
8 E001 Aayansh null
9 E001 Aayansh null
10 E001 Aayansh null
11 E001 Aayansh null
12 E001 Aayansh null

Solution: 

Steps:
1) generate a month number derived CTE
2) Cross join MonthCTE with EmpAllowance Table and take the distinct record
3) Left join the CorssJoin result from 2nd step to EmpAllowance table and take allowance from EmpAllowance

Query:

;with MonthCTE AS
(select 1 as month_number 
 union all 
 select month_number + 1 as month_number from MonthCTE 
 where month_number < 12
 ),
ReportTemplateCTE as 
(select DISTINCT m.month_number, e.emp_id, e.emp_name from MonthCTE m cross join EmpAllowance e)
 
select r.*, e.allowance from ReportTemplateCTE r left join EmpAllowance e
on r.month_number = e.month_number and r.emp_id = e.emp_id
order by r.emp_id
 

Post Reference: Vikram Aristocratic Elfin Share

Misc: Generate duplicate rows based on quantity - CTE

Ques: You have below records:

item_name     item_qty
Keyboard          3
Mouse             5

You need to duplicate each records depending upon the value of item_qty values.

Sol:

This can be done using CTE


;With DupCreationCTE(item_name, item_qty, item_counter) as
(
  select item_name, item_qty, 1 as item_counter from DupCreation
  union All
  select item_name, item_qty, item_counter + 1 as item_counter
  from DupCreationCTE where item_counter < item_qty
)
SELECT * from DupCreationCTE order by item_name, item_counter asc


Output:

item_name item_qty item_counter
"Keyboard" 3                 1
"Keyboard" 3                 2
"Keyboard" 3                 3
"Mouse"         5                 1
"Mouse"         5                 2
"Mouse"         5                 3
"Mouse"         5                 4
"Mouse"         5                 5


Post Reference: Vikram Aristocratic Elfin Share

Thursday, December 23, 2021

176. Second Highest Salary

Table: Employee

+-------------+------+
| Column Name | Type |
+-------------+------+
| id          | int  |
| salary      | int  |
+-------------+------+
id is the primary key column for this table.
Each row of this table contains information about the salary of an employee.

 

Write an SQL query to report the second highest salary from the Employee table. If there is no second highest salary, the query should report null.

The query result format is in the following example.

 

Example 1:

Input: 
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1  | 100    |
| 2  | 200    |
| 3  | 300    |
+----+--------+
Output: 
+---------------------+
| SecondHighestSalary |
+---------------------+
| 200                 |
+---------------------+

Example 2:

Input: 
Employee table:
+----+--------+
| id | salary |
+----+--------+
| 1  | 100    |
+----+--------+
Output: 
+---------------------+
| SecondHighestSalary |
+---------------------+
| null                |
+---------------------+
Solution:
select salary as SecondHighestSalary from (
select salary, row_number() over (order by salary desc) as rno from Employee
    ) a
    where a.rno = 2
Post Reference: Vikram Aristocratic Elfin Share

175. Combine Two Tables

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| personId    | int     |
| lastName    | varchar |
| firstName   | varchar |
+-------------+---------+
personId is the primary key column for this table.
This table contains information about the ID of some persons and their first and last names.

 

Table: Address

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| addressId   | int     |
| personId    | int     |
| city        | varchar |
| state       | varchar |
+-------------+---------+
addressId is the primary key column for this table.
Each row of this table contains information about the city and state of one person with ID = PersonId.

 

Write an SQL query to report the first name, last name, city, and state of each person in the Person table. If the address of a personId is not present in the Address table, report null instead.

Return the result table in any order.

The query result format is in the following example.

 

Example 1:

Input: 
Person table:
+----------+----------+-----------+
| personId | lastName | firstName |
+----------+----------+-----------+
| 1        | Wang     | Allen     |
| 2        | Alice    | Bob       |
+----------+----------+-----------+
Address table:
+-----------+----------+---------------+------------+
| addressId | personId | city          | state      |
+-----------+----------+---------------+------------+
| 1         | 2        | New York City | New York   |
| 2         | 3        | Leetcode      | California |
+-----------+----------+---------------+------------+
Output: 
+-----------+----------+---------------+----------+
| firstName | lastName | city          | state    |
+-----------+----------+---------------+----------+
| Allen     | Wang     | Null          | Null     |
| Bob       | Alice    | New York City | New York |
+-----------+----------+---------------+----------+
Explanation: 
There is no address in the address table for the personId = 1 so we return null in their city and state.
addressId = 1 contains information about the address of personId = 2.
Solution:
select p.firstname, p.lastname, a.city, a.state 
from Person p left join Address a 
on p.personid = a.personId
Post Reference: Vikram Aristocratic Elfin Share

Sunday, June 7, 2020

Python Trick: Alternative to if-else/Case statement

Lets look the below code snippet

def add_number(a,b):
   
print(a+b);

def multiply_number(a,b):
   
print(a*b)

def division_number(a,b):
   
print(a/b)

result =
'30'

if result == '10':
    add_number(
10,20)
elif result == '20':
    multiply_number(
12,2)
elif result== '30':
    division_number(
36,3)


here we have three method, and methods are called depending upon the value of result, if the result value is 10 then add_number is called, if result is 20 then multiply_number is called and so on

So you can see above we have written multiline if-elseif statement to call method on the bases of result value.

Now lets see the below code snippet, here we have declare a dictionary object with result value as a key of dictionary object and the associated method as a dictionary key value.

result_dict={
   
'10':add_number,
   
'20':multiply_number,
   
'30':division_number
}

result=
'10'
result_dict[result](1,2)

The result value is store in a result variable and that variable is passed as an index to dictionary key which internally calls the associated method. So it just one line statement instead of if-else ladder.


Enjoy pythonic way J

Post Reference: Vikram Aristocratic Elfin Share