คำสั่ง CREATE VIEW

Constructs a virtual table that has no physical data based on the result-set of a SQL query. ALTER VIEW and DROP VIEW only change metadata.

Syntax

CREATE [ OR REPLACE ] [ TEMPORARY ] VIEW [ IF NOT EXISTS ] view_name
    [ column_list ]
    [ COMMENT view_comment ]
    [ TBLPROPERTIES clause ]
    AS query

column_list
   ( { column_alias [ COMMENT column_comment ] } [, ...] )

ตัวอย่าง

%sql
CREATE OR REPLACE TEMPORARY VIEW demo_tmp2(name, value) AS
VALUES
  ("Yi", 1),
  ("Ali", 2),
  ("Selina", 3)

หรือใช้คำว่า TEMP แทน TEMPORARY ก็ได้

%sql
CREATE OR REPLACE TEMP VIEW demo_tmp1(name, value) AS
VALUES
  ("Yi", 1),
  ("Ali", 2),
  ("Selina", 3)

SQL Declare Variable

คำสั่ง SET

เป็นการให้ค่าจาก SQL ด้วยคำสั่ง SET แล้วก็อ่านค่าด้วย SQL

%sql
SET value = 2;
%sql
SELECT ${hiveconf:value} 
%sql
SELECT ${hiveconf:value} AS value
%sql
SET LastChangeDate = current_date()
%sql
Select ${hiveconf:LastChangeDate}

คำสั่ง spark.conf.set()

เป็นการให้ค่าจาก Python ด้วยคำสั่ง spark.conf.set() แล้วอ่านค่าด้วย SQL

%python 
spark.conf.set("ab.name", "jack") 
spark.conf.set("AB.name", "jack") 

จะสังเกตุเห็นได้ว่าที่ Python ให้ค่าทั้ง ab และ AB เพื่อให้ SQL อ่านค่าได้ทั้ง ab และ AB

%sql
SELECT '${AB.name}' AS name

Basic SQL

  1. แสดงรายชื่อดาต้าเบส และตาราง
  2. ใช้ SQL อ่านไฟล์
  3. อ่านข้อมูลในไฟล์ CSV มาใส่ใน Delta Table
  4. อ่านข้อมูลในไฟล์ CSV มาสร้าง Temp View
  5. สร้าง external table โดยชี้ไปที่ไฟล์ CSV
  6. คำสั่ง CREATE VIEW
  7. คำสั่ง CREATE TABLE

แสดงรายชื่อดาต้าเบส และตาราง

แสดงรายชื่อดาต้าเบส

%sql
SHOW DATABASES;

แสดงรายชื่อตารางในดาต้าเบส

%sql
USE default;
SHOW TABLES;

ใช้ SQL อ่านไฟล์

%sql
SELECT * FROM delta.`${DA.paths.datasets}/nyctaxi-with-zipcodes/data`
SELECT * FROM text.`dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv`
SELECT * FROM csv.`dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv`

ใช้ SQL อ่านไฟล์แบบ text

ใช้ backtick ` ครอบ

%sql
SELECT * FROM text.`dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv`

ใช้ SQL อ่านไฟล์แบบ CSV

%sql
SELECT * FROM csv.`dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv`

อ่านข้อมูลในไฟล์ CSV มาใส่ใน Delta Table

สร้างตารางชื่อ table1 โดยดูชื่อคอลัมน์จากคิวรีด้านบน

%sql
/*Table creation with schema*/
CREATE OR REPLACE TABLE table1 (
  Package string,
  Item string,
  Title string,
  csv string,
  doc string
);

SHOW TABLE อีกทีจะเห็นตาราง table1

ลอง SHOW CREATE TABLE

%sql
SHOW CREATE TABLE table1;

copy ข้อมูลจากไฟล์ csv ลงตาราง

%sql
/*Copying dbfs csv data into table*/
COPY INTO table1
  FROM "dbfs:/databricks-datasets/Rdatasets/data-001/datasets.csv"
  FILEFORMAT = csv
  FORMAT_OPTIONS('header'='true','inferSchema'='True');

คิวรีดูข้อมูล

%sql
SELECT * FROM table1

อ่านข้อมูลในไฟล์ CSV มาสร้าง Temp View

%sql
CREATE TEMPORARY VIEW view1 USING CSV OPTIONS (
  path "/databricks-datasets/Rdatasets/data-001/datasets.csv",
  header "true"
)
%sql
CREATE TEMPORARY VIEW diamonds USING CSV OPTIONS (
  path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
  header "true"
)

ลอง SHOW TABLES จะเห็นว่าเป็น Temporary

สร้าง external table โดยชี้ไปที่ไฟล์ CSV

%sql
CREATE TABLE table2 USING CSV 
OPTIONS ('header' = 'true')
LOCATION '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv'

พาทไฟล์สามารถ กำหนดที่ LOCATION หรือใน OPTIONS (path) ก็ได้

%sql
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (
  path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
  header "true"
)

ใน OPTIONS เครื่องหมาย = จะมีหรือไม่มีก็ได้

%sql
DROP TABLE IF EXISTS diamonds;

CREATE TABLE diamonds USING CSV OPTIONS (
  path = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
  header = "true"
)

คำสั่ง CREATE VIEW

CREATE OR REPLACE TEMP VIEW demo_tmp_vw(name, value) AS VALUES
  ("Yi", 1),
  ("Ali", 2),
  ("Selina", 3);

CREATE TEMPORARY VIEW diamonds USING CSV OPTIONS (
  path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
  header "true"
);

คำสั่ง CREATE TABLE

CREATE OR REPLACE TABLE table1 (
  Package string,
  Item string,
  Title string,
  csv string,
  doc string
);

CREATE TABLE diamonds USING CSV OPTIONS (
  path = "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv",
  header = "true"
);

CREATE TABLE diamonds USING CSV 
OPTIONS ('header' = 'true')
LOCATION '/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv';

OPTIMIZE

Optimizes the layout of Delta Lake data. Optionally optimize a subset of data or colocate data by column. If you do not specify colocation, bin-packing optimization is performed.

Syntax

OPTIMIZE table_name [WHERE predicate]
  [ZORDER BY (col_name1 [, ...] ) ]

Parameters

  • table_name – Identifies an existing Delta table. The name must not include a temporal specification.
  • WHERE – Optimize the subset of rows matching the given partition predicate. Only filters involving partition key attributes are supported.
  • ZORDER BY – Colocate column information in the same set of files. Co-locality is used by Delta Lake data-skipping algorithms to dramatically reduce the amount of data that needs to be read. You can specify multiple columns for ZORDER BY as a comma-separated list. However, the effectiveness of the locality drops with each additional column.

Examples

OPTIMIZE delta.`/data/events`

OPTIMIZE events

OPTIMIZE events WHERE date >= '2022-11-18'

OPTIMIZE events
WHERE date >= current_timestamp() - INTERVAL 1 day
ZORDER BY (eventType)

VACUUM

VACUUM table_name [RETAIN num HOURS] [DRY RUN]

Parameters

  • table_name – Identifies an existing Delta table. The name must not include a temporal specification.
  • RETAIN num HOURS – The retention threshold.
  • DRY RUN – Return a list of up to 1000 files to be deleted.

แสดงรายชื่อไฟล์ที่จะถูกลบ

VACUUM table_name RETAIN 720 HOURS DRY RUN

ทำการลบไฟล์

VACUUM table_name RETAIN 720 HOURS
1 day24 HOURS
1 week168 HOURS
2 week336 HOURS
30 days720 HOURS

Configure data retention for time travel

To time travel to a previous version, you must retain both the log and the data files for that version.

The data files backing a Delta table are never deleted automatically; data files are deleted only when you run VACUUMVACUUM does not delete Delta log files; log files are automatically cleaned up after checkpoints are written.

By default you can time travel to a Delta table up to 30 days old unless you have:

  • Run VACUUM on your Delta table.
  • Changed the data or log file retention periods using the following table properties:
    • delta.logRetentionDuration = "interval <interval>": controls how long the history for a table is kept. The default is interval 30 days.Each time a checkpoint is written, Databricks automatically cleans up log entries older than the retention interval. If you set this config to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases.
    • delta.deletedFileRetentionDuration = "interval <interval>": controls how long ago a file must have been deleted before being a candidate for VACUUM. The default is interval 7 days.To access 30 days of historical data even if you run VACUUM on the Delta table, set delta.deletedFileRetentionDuration = "interval 30 days". This setting may cause your storage costs to go up.

ถ้ารัน VACUUM เลย data files ที่เกิน 7 วันจะถูกลบ

ถ้าจะเก็บ data files ให้มากกว่า 7 วัน โดยไม่ต้องมาคอยกำหนดค่า RETAIN num HOURS ให้ไป SET delta.deletedFileRetentionDuration ก่อน แล้วค่อยรัน VACUUM

SET and UNSET TBLPROPERTIES

ALTER TABLE table_name
   { RENAME TO clause |
     ADD COLUMN clause |
     ALTER COLUMN clause |
     DROP COLUMN clause |
     RENAME COLUMN clause |
     ADD CONSTRAINT clause |
     DROP CONSTRAINT clause |
     ADD PARTITION clause |
     DROP PARTITION clause |
     RENAME PARTITION clause |
     RECOVER PARTITIONS clause |
     SET TBLPROPERTIES clause |
     UNSET TBLPROPERTIES clause |
     SET SERDE clause |
     SET LOCATION clause |
     SET OWNER TO clause }

Example

ALTER TABLE dbx.tab1 SET TBLPROPERTIES ('delta.logRetentionDuration' = 'interval 30 days');
ALTER TABLE dbx.tab1 SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 30 days');

ALTER TABLE dbx.tab1 UNSET TBLPROPERTIES ('delta.deletedFileRetentionDuration');

Retail Revenue & Supply Chain – Databricks SQL

Analyze key retail and supply chain performance indicators for a fictional enterprise.

  1. Counter – Overall Customer Count
  2. Counter – TPCH – Number Suppliers
  3. Map – National Revenue Map
  4. Bar- National Revenue Trends
  5. Table – Customer Value
  6. Line – Order Revenue

1. Counter – Overall Customer Count

SELECT
  COUNT(distinct(c_custkey))
FROM
  `samples`.`tpch`.`customer`

2. Counter – TPCH – Number Suppliers

SELECT
  COUNT(distinct(s_suppkey)) AS num_suppliers
FROM
  `samples`.`tpch`.`supplier`

3. Map – National Revenue Map

SELECT
    initcap(n_name) AS `Nation`, 
    SUM(l_extendedprice * (1 - l_discount) * (length(n_name)/100)) AS revenue
FROM
    `samples`.`tpch`.`customer`,
    `samples`.`tpch`.`orders`,
    `samples`.`tpch`.`lineitem`,
    `samples`.`tpch`.`supplier`,
    `samples`.`tpch`.`nation`,
    `samples`.`tpch`.`region`
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey
    AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
GROUP BY
    INITCAP(n_name)
ORDER BY
    revenue DESC;

4. Bar- National Revenue Trends

SELECT
    year(o_orderdate) AS year,
    n_name AS nation,
    sum(l_extendedprice * (1 - l_discount) * (((length(n_name))/100) + (year(o_orderdate)-1993)/100)) AS revenue
FROM
    `samples`.`tpch`.`customer`,
    `samples`.`tpch`.`orders`,
    `samples`.`tpch`.`lineitem`,
    `samples`.`tpch`.`supplier`,
    `samples`.`tpch`.`nation`,
    `samples`.`tpch`.`region`
WHERE
    c_custkey = o_custkey
    AND l_orderkey = o_orderkey
    AND l_suppkey = s_suppkey
    AND c_nationkey = s_nationkey
    AND s_nationkey = n_nationkey
    AND n_regionkey = r_regionkey
    AND n_name in ('ARGENTINA', 'UNITED KINGDOM', 'FRANCE','BRAZIL', 'CHINA', 'UNITED STATES', 'JAPAN', 'JORDAN')
    AND o_orderdate >= DATE '1994-01-01'
GROUP BY
    1,2
ORDER BY
    nation ASC LIMIT 1000;

5. Table – Customer Value

SELECT
  customer_id AS `Customer ID #`,
  concat(
    '<div class="bg-',
    CASE
      WHEN total_revenue BETWEEN 0
      AND 1500000 THEN 'success'
      WHEN total_revenue BETWEEN 1500001
      AND 3000000 THEN 'warning'
      WHEN total_revenue BETWEEN 3000001
      AND 5000000 THEN 'danger'
      ELSE 'danger'
    END,
    '  text-center"> $',
    format_number(total_revenue, 0),
    '</div>'
  ) AS `Total Customer Revenue`
FROM
  (
    SELECT
      o_custkey AS customer_id,
      sum(o_totalprice) as total_revenue
    FROM
      `samples`.`tpch`.`orders`
    GROUP BY
      1
    HAVING
      total_revenue > 0
  )
ORDER BY
  1
LIMIT
  400

6. Line – Order Revenue

SELECT
  o_orderdate AS Date,
  o_orderpriority AS Priority,
  sum(o_totalprice) AS `Total Price`
FROM
  `samples`.`tpch`.`orders`
WHERE
  o_orderdate > '1994-01-01'
  AND o_orderdate < '1994-01-31'
GROUP BY
  1,
  2
ORDER BY
  1,
  2

NYC Taxi Trip Analysis – Databricks SQL

Explore NYC taxi rides over a one month time frame.

  1. ตาราง nyctaxi.trips
  2. Counter – Total Trips
  3. Table – Route Revenues
  4. Chart – Pickup Hour Distribution
  5. Scatter – Daily Fare to Distance Analysis

1. ตาราง nyctaxi.trips

SHOW CREATE TABLE samples.nyctaxi.trips
CREATE TABLE samples.nyctaxi.trips (
  tpep_pickup_datetime TIMESTAMP,
  tpep_dropoff_datetime TIMESTAMP,
  trip_distance DOUBLE,
  fare_amount DOUBLE,
  pickup_zip INT,
  dropoff_zip INT
) USING delta LOCATION 'dbfs:/databricks-datasets/nyctaxi-with-zipcodes/subsampled'
SELECT * 
FROM samples.nyctaxi.trips 
LIMIT 5
#tpep_pickup_datetimetpep_dropoff_datetimetrip_distancefare_amountpickup_zipdropoff_zip
12016-02-14 16:52:13.0002016-02-14 17:16:04.0004.9419.001028210171
22016-02-04 18:44:19.0002016-02-04 18:46:00.0000.283.501011010110
32016-02-17 17:13:57.0002016-02-17 17:17:55.0000.705.001010310023
42016-02-18 10:36:07.0002016-02-18 10:41:45.0000.806.001002210017
52016-02-22 14:14:41.0002016-02-22 14:31:52.0004.5117.001011010282

2. Counter – Total Trips

USE CATALOG SAMPLES;
SELECT
  count(*) as total_trips
FROM
  `samples`.`nyctaxi`.`trips`
WHERE
  tpep_pickup_datetime BETWEEN TIMESTAMP '{{ pickup_date.start }}'
  AND TIMESTAMP '{{ pickup_date.end }}'
  AND pickup_zip IN ({{ pickupzip }})

Counter

3. Table – Route Revenues

USE CATALOG SAMPLES;
SELECT
  T.route as `Route`,
  T.frequency as `Route Frequency`,
  T.total_fare as `Total Fares`
FROM
  (
    SELECT
      concat(pickup_zip, '-', dropoff_zip) AS route,
      count(*) as frequency,
      SUM(fare_amount) as total_fare
    FROM
      `samples`.`nyctaxi`.`trips`
    WHERE
      tpep_pickup_datetime BETWEEN TIMESTAMP '{{ pickup_date.start }}'
      AND TIMESTAMP '{{ pickup_date.end }}'
      AND pickup_zip IN ({{ pickupzip }})
    GROUP BY
      1
  ) T
ORDER BY
  1 ASC
LIMIT
  200

Table

4. Chart – Pickup Hour Distribution

USE CATALOG SAMPLES;
SELECT
  CASE
    WHEN T.pickup_hour = 0 THEN '00:00'
    WHEN T.pickup_hour = 1 THEN '01:00'
    WHEN T.pickup_hour = 2 THEN '02:00'
    WHEN T.pickup_hour = 3 THEN '03:00'
    WHEN T.pickup_hour = 4 THEN '04:00'
    WHEN T.pickup_hour = 5 THEN '05:00'
    WHEN T.pickup_hour = 6 THEN '06:00'
    WHEN T.pickup_hour = 7 THEN '07:00'
    WHEN T.pickup_hour = 8 THEN '08:00'
    WHEN T.pickup_hour = 9 THEN '09:00'
    WHEN T.pickup_hour = 10 THEN '10:00'
    WHEN T.pickup_hour = 11 THEN '11:00'
    WHEN T.pickup_hour = 12 THEN '12:00'
    WHEN T.pickup_hour = 13 THEN '13:00'
    WHEN T.pickup_hour = 14 THEN '14:00'
    WHEN T.pickup_hour = 15 THEN '15:00'
    WHEN T.pickup_hour = 16 THEN '16:00'
    WHEN T.pickup_hour = 17 THEN '17:00'
    WHEN T.pickup_hour = 18 THEN '18:00'
    WHEN T.pickup_hour = 19 THEN '19:00'
    WHEN T.pickup_hour = 20 THEN '20:00'
    WHEN T.pickup_hour = 21 THEN '21:00'
    WHEN T.pickup_hour = 22 THEN '22:00'
    WHEN T.pickup_hour = 23 THEN '23:00'
    ELSE 'N/A'
  END AS `Pickup Hour`,
  T.num AS `Number of Rides`
FROM
  (
    SELECT
      hour(tpep_pickup_datetime) AS pickup_hour,
      COUNT(*) AS num
    FROM
      `samples`.`nyctaxi`.`trips`
    WHERE
      tpep_pickup_datetime BETWEEN TIMESTAMP '{{ pickup_date.start }}'
      AND TIMESTAMP '{{ pickup_date.end }}'
      AND pickup_zip IN ({{ pickupzip }})
    GROUP BY
      1
  ) T

Chart

5. Scatter – Daily Fare to Distance Analysis

USE CATALOG SAMPLES;
SELECT
  T.weekday,
  CASE
    WHEN T.weekday = 1 THEN 'Sunday'
    WHEN T.weekday = 2 THEN 'Monday'
    WHEN T.weekday = 3 THEN 'Tuesday'
    WHEN T.weekday = 4 THEN 'Wednesday'
    WHEN T.weekday = 5 THEN 'Thursday'
    WHEN T.weekday = 6 THEN 'Friday'
    WHEN T.weekday = 7 THEN 'Saturday'
    ELSE 'N/A'
  END AS day_of_week,
  T.fare_amount,
  T.trip_distance
FROM
  (
    SELECT
      dayofweek(tpep_pickup_datetime) as weekday,
      *
    FROM
      `samples`.`nyctaxi`.`trips`
    WHERE
      (
        pickup_zip in ({{ pickupzip }})
        OR pickup_zip in (10018)
      )
      AND tpep_pickup_datetime BETWEEN TIMESTAMP '{{ pickup_date.start }}'
      AND TIMESTAMP '{{ pickup_date.end }}'
      AND trip_distance < 10
  ) T
ORDER BY
  T.weekday

Scatter

Databricks – Date and Time

  1. current_date function
  2. now function
  3. timestampadd function
  4. date_format function
  5. to_date function
  6. from_unixtime function
  7. to_unix_timestamp function

current_date function (Databricks SQL)

Returns the current date at the start of query evaluation.

current_date()

Arguments

This function takes no arguments.

Returns

A DATE.

The braces are optional.

Examples

> SELECT current_date()
 2022-08-23
> SELECT current_date;
 2022-08-23

now function (Databricks SQL)

Returns the current timestamp at the start of query evaluation.

now()

Arguments

This function takes no arguments.

Returns

A TIMESTAMP.

Examples

> SELECT now()
 2022-08-23T04:57:51.871+0000
> SELECT current_timestamp()
> SELECT current_timestamp

timestampadd function (Databricks SQL)

Adds value units to a timestamp expr.

timestampadd(unit, value, expr)

unit
 { MICROSECOND |
   MILLISECOND |
   SECOND |
   MINUTE |
   HOUR |
   DAY | DAYOFYEAR |
   WEEK |
   MONTH |
   QUARTER |
   YEAR }

Returns

A TIMESTAMP.

> SELECT timestampadd(MICROSECOND, 5, TIMESTAMP'2022-02-28 00:00:00');
 2022-02-28 00:00:00.000005

-- March 31. 2022 minus 1 month yields February 28. 2022
> SELECT timestampadd(MONTH, -1, TIMESTAMP'2022-03-31 00:00:00');
 2022-02-28 00:00:00.000000

> SELECT timestampadd(HOUR, +7, current_timestamp())
> SELECT timestampadd(HOUR, (+ 7), current_timestamp())
 2023-03-10T11:07:18.513+0000

date_format function (Databricks SQL)

Converts a timestamp to a string in the format fmt.

date_format(expr, fmt)

Returns

A STRING.

See Datetime patterns for details on valid formats.

> SELECT date_format('2016-04-08', 'y');
 2016

> SELECT date_format(TIMESTAMPADD(HOUR, +7, current_timestamp()),'yyyyMMdd') AS dt
 20230310

to_date function (Databricks SQL)

Returns expr cast to a date using an optional formatting.

to_date(expr [, fmt] )

Returns

A DATE.

> SELECT to_date('2022-08-24 07:00:00');
 2022-08-24
> SELECT to_date('2022-08-24', 'yyyy-MM-dd');
 2022-08-24

from_unixtime function (Databricks SQL)

Returns unixTime in fmt.

from_unixtime(unixTime [, fmt])

Arguments

  • unixTime: A BIGINT expression representing seconds elapsed since 1969-12-31 at 16:00:00 (แต่เหมือนจะเป็น 1970-01-01 at 00:00:00).
  • fmt: An optional STRING expression with a valid format.

Returns

A STRING.

See Datetime patterns (Databricks SQL) for valid formats. The ‘yyyy-MM-dd HH:mm:ss’ pattern is used if omitted.

Examples

> SELECT from_unixtime(0);
 1970-01-01 00:00:00
> SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');
 1970-01-01 00:00:00

to_unix_timestamp function (Databricks SQL)

Returns the timestamp in expr as a UNIX timestamp.

to_unix_timestamp(expr [, fmt] )

Arguments

  • expr: A STRING expression representing a timestamp.
  • fmt: An optional format STRING expression.

Returns

A BIGINT.

If fmt is supplied, it must conform with Datetime patterns (Databricks SQL).

If fmt is not supplied, the function is a synonym for cast(expr AS TIMESTAMP).

If fmt is malformed or its application does not result in a well formed timestamp, the function raises an error.

Examples

> SELECT to_unix_timestamp(current_date())
 1661299200
> SELECT to_unix_timestamp('2022-08-24', 'yyyy-MM-dd')
 1661299200
> SELECT to_unix_timestamp(current_timestamp())
 1661328640
> SELECT  to_unix_timestamp('2022-08-24 08:11:00', 'yyyy-MM-dd HH:mm:ss')
 1661328660
> SELECT to_unix_timestamp(current_timestamp()) - to_unix_timestamp(current_date())
 29230

Databricks SQL – Basic SQL

  1. Retrieving Data
  2. Column Expressions
  3. Updating Data
  4. Subqueries
  5. Joins
  6. Aggregations

1. Retrieving Data

SELECT

This simple command retrieves data from a table. The “*” represents “Select All,” so the command is selecting all data from the table

However, note that only 1,000 rows were retrieved. Databricks SQL defaults to only retrieving 1,000 rows from a table. If you wish to retrieve more, deselect the checkbox “LIMIT 1000”.

USE demo;
SELECT * FROM customers;

SELECT … AS

By adding the AS keyword, we can change the name of the column in the results.

Note that the column customer_name has been renamed Customer

USE demo;
SELECT customer_name AS Customer
FROM customers;

DISTINCT

If we add the DISTINCT keyword, we can ensure that we do not repeat data in the table.

There are more than 1,000 records that have a state in the state field. But, we only see 51 results because there are only 51 distinct state names.

USE demo;
SELECT DISTINCT state FROM customers;

WHERE

The WHERE keyword allows us to filter the data.

We are selecting from the customers table, but we are limiting the results to those customers who have a loyalty_segment of 3.

USE demo;
SELECT * FROM customers WHERE loyalty_segment = 3;

GROUP BY

We can run a simple COUNT aggregation by adding count() and GROUP BY to our query.

GROUP BY requires an aggregating function. We will discuss more aggregations later on.

USE demo;
SELECT loyalty_segment, count(loyalty_segment)
FROM customers
GROUP BY loyalty_segment;

ORDER BY

By adding ORDER BY to the query we just ran, we can place the results in a specific order.

ORDER BY defaults to ordering in ascending order. We can change the order to descending by adding DESC after the ORDER BY clause.

USE demo;
SELECT loyalty_segment, count(loyalty_segment)
FROM customers
GROUP BY loyalty_segment
ORDER BY loyalty_segment;

2. Column Expressions

Mathematical Expressions of Two Columns

In our queries, we can run calculations on the data in our tables. This can range from simple mathematical calculations to more complex computations involving built-in functions.

The results show that the Calculated Discount, the one we generated using Column Expressions, matches the Discounted Price.

USE demo;
SELECT sales_price - sales_price * promo_disc AS Calculated_Discount,
discounted_price AS Discounted_Price 
FROM promo_prices;

Built-In Functions — String Column Manipulation

There are many, many Built-In Functions. We are going to talk about just a handful, so you can get a feel for how they work.

We are going to use a built-in function called lower(). This function takes a string expression and returns the same expression with all characters changed to lowercase. Let’s have a look.

USE demo;
SELECT lower(city) AS City 
FROM customers;

Although the letters are now all lowercase, they are not the way the need to be. We want to have the first letter of each word capitalized.

USE demo;
SELECT initcap(city) AS City 
FROM customers;

Date Functions

We want to use a function to make the date more human-readable. Let’s use from_unixtime().

The date looks better, but let’s adjust the formatting. Formatting options for many of the date and time functions are available here.

USE demo;
SELECT from_unixtime(promo_began, 'd MMM, y') AS Beginning_Date 
FROM promo_prices;

Date Calculations

In this code, we are using the function current_date() to get today’s date. We are then nesting from_unixtime() inside to_date in order to convert promo_began to a date object. We can then run the calculation.

USE demo;
SELECT current_date() - to_date(from_unixtime(promo_began)) FROM promo_prices;

CASE … WHEN

Often, it is important for us to use conditional logic in our queries. CASE … WHEN provides us this ability.

This statement allows us to change numeric values that represent loyalty segments into human-readable strings. It is certainly true that this association would more-likely occur using a join on two tables, but we can still see the logic behind CASE … WHEN

USE demo;
SELECT customer_name, loyalty_segment,
    CASE 
        WHEN loyalty_segment = 0 THEN 'Rare'
        WHEN loyalty_segment = 1 THEN 'Occasional'
        WHEN loyalty_segment = 2 THEN 'Frequent'
        WHEN loyalty_segment = 3 THEN 'Daily'
    END AS Loyalty 
FROM customers;

3. Updating Data

UPDATE

Let’s make those changes.

The UPDATE does exactly what it sounds like: It updates the table based on the criteria specified.

USE demo;
UPDATE customers SET city = initcap(lower(city));

SELECT city FROM customers;

INSERT INTO

In addition to updating data, we can insert new data into the table.

INSERT INTO is a command for inserting data into a table.

USE demo;
INSERT INTO loyalty_segments 
    (loyalty_segment_id, loyalty_segment_description, unit_threshold, valid_from, valid_to)
VALUES
    (4, 'level_4', 100, current_date(), Null);

SELECT * FROM loyalty_segments;

INSERT TABLE

INSERT TABLE is a command for inserting entire tables into other tables. There are two tables suppliers and source_suppliers that currently have the exact same data.

After selecting from the table again, we note that the number of rows has doubled. This is because INSERT TABLE inserts all data in the source table, whether or not there are duplicates.

USE demo;
INSERT INTO suppliers TABLE source_suppliers;

SELECT * FROM suppliers;

INSERT OVERWRITE

If we want to completely replace the contents of a table, we can use INSERT OVERWRITE.

After running INSERT OVERWRITE and then retrieving a count(*) from the table, we see that we are back to the original count of rows in the table. INSERT OVERWRITE has replaced all the rows.

USE demo;
INSERT OVERWRITE suppliers TABLE source_suppliers;
SELECT * FROM suppliers;

4. Subqueries

Let’s create two new tables.

These two command use subqueries to SELECT from the customers table using specific criteria. The results are then fed into CREATE OR REPLACE TABLE and CREATE OR REPLACE TABLE statements. Incidentally, this type of statement is often called a CTAS statement for CREATE OR REPLACE TABLE … AS.

USE demo;
CREATE OR REPLACE TABLE high_loyalty_customers AS
    SELECT * FROM customers WHERE loyalty_segment = 3;
CREATE OR REPLACE TABLE low_loyalty_customers AS
    SELECT * FROM customers WHERE loyalty_segment = 1;

5. Joins

We are now going to run a couple of JOIN queries. The first is the most common JOIN, an INNER JOIN. Since INNER JOIN is the default, we can just write JOIN.

In this statement, we are joining the customers table and the loyalty_segments tables. When the loyalty_segment from the customers table matches the loyalty_segment_id from the loyalty_segments table, the rows are combined. We are then able to view the customer_name, loyalty_segment_description, and unit_threshold from both tables.

USE demo;
SELECT
    customer_name,
    loyalty_segment_description,
    unit_threshold
FROM customers
INNER JOIN loyalty_segments
ON customers.loyalty_segment = loyalty_segments.loyalty_segment_id;

CROSS JOIN

Even though the CROSS JOIN isn’t used very often, I wanted to demonstrate it.

First of all, note the use of UNION ALL. All this does is combine the results of all three queries, so we can view them all in one results set. The Customers row shows the count of rows in the customers table. Likewise, the Sales row shows the count of the sales table. Crossed shows the number of rows after performing the CROSS JOIN.

USE demo;
SELECT "Sales", count(*) FROM sales
UNION ALL
SELECT "Customers", count(*) FROM customers
UNION ALL
SELECT "Crossed", count(*) FROM customers
  CROSS JOIN sales;

6. Aggregations

Now, let’s move into aggregations. There are many aggregating functions you can use in your queries. Here are just a handful.

Again, we are viewing the results of a handful of queries using a UNION ALL.

USE demo;
SELECT "Sum" Function_Name, sum(units_purchased) AS Value
FROM customers 
WHERE state = 'CA'
UNION ALL
SELECT "Min", min(discounted_price) AS Lowest_Discounted_Price 
FROM promo_prices
UNION ALL
SELECT "Max", max(discounted_price) AS Highest_Discounted_Price 
FROM promo_prices
UNION ALL
SELECT "Avg", avg(total_price) AS Mean_Total_Price 
FROM sales
UNION ALL
SELECT "Standard Deviation", std(total_price) AS SD_Total_Price 
FROM sales
UNION ALL
SELECT "Variance", variance(total_price) AS Variance_Total_Price
FROM sales;

Databricks SQL – Delta Commands

SELECT on Delta Tables

So far, the SQL commands we have used are generic to most flavors of SQL. In the next few queries, we are going to look at commands that are specific to using SELECT on Delta tables.

Delta tables keep a log of changes that we can view by running the command below.

After running DESCRIBE HISTORY, we can see that we are on version number 0 and we can see a timestamp of when this change was made.

USE demo;
DESCRIBE HISTORY customers;

SELECT on Delta Tables — Updating the Table

We are going to make a change to the table.

The code uses an UPDATE statement to make a change to the table. We will be discussing UPDATE later on. For now, we just need to understand that a change was made to the table. We also reran our DESCRIBE HISTORY command, and note that we have a new version in the log, with a new timestamp.

USE demo;
UPDATE customers SET loyalty_segment = 10 WHERE loyalty_segment = 0;
DESCRIBE HISTORY customers;

SELECT on Delta Tables — VERSION AS OF

We can now use a special predicate for use with Delta tables: VERSION AS OF

By using VERSION AS OF, we can SELECT from specific versions of the table. This feature of Delta tables is called “Time Travel,” and is very powerful.

We can also use TIMESTAMP AS OF to SELECT based on a table’s state on a specific date, and you can find more information in the documentation.

USE demo;
SELECT loyalty_segment FROM customers VERSION AS OF 1;

MERGE INTO

Certainly, there are times when we want to insert new data but ensure we don’t re-insert matched data. This is where we use MERGE INTO. MERGE INTO will merge two tables together, but you specify in which column to look for matched data and what to do when a match is found. Let’s run the code and examine the command in more detail.

USE demo;
MERGE INTO suppliers
    USING source_suppliers
    ON suppliers.SUPPLIER_ID = source_suppliers.SUPPLIER_ID
    WHEN NOT MATCHED THEN INSERT *;
SELECT count(*) FROM suppliers;