Phaisarn

Databricks SQL – Basic SQL

Posted on August 18, 2022 by jack

Retrieving Data
Column Expressions
Updating Data
Subqueries
Joins
Aggregations

1. Retrieving Data

SELECT

This simple command retrieves data from a table. The “*” represents “Select All,” so the command is selecting all data from the table

However, note that only 1,000 rows were retrieved. Databricks SQL defaults to only retrieving 1,000 rows from a table. If you wish to retrieve more, deselect the checkbox “LIMIT 1000”.

USE demo;
SELECT * FROM customers;

SELECT … AS

By adding the AS keyword, we can change the name of the column in the results.

Note that the column customer_name has been renamed Customer

USE demo;
SELECT customer_name AS Customer
FROM customers;

DISTINCT

If we add the DISTINCT keyword, we can ensure that we do not repeat data in the table.

There are more than 1,000 records that have a state in the state field. But, we only see 51 results because there are only 51 distinct state names.

USE demo;
SELECT DISTINCT state FROM customers;

WHERE

The WHERE keyword allows us to filter the data.

We are selecting from the customers table, but we are limiting the results to those customers who have a loyalty_segment of 3.

USE demo;
SELECT * FROM customers WHERE loyalty_segment = 3;

GROUP BY

We can run a simple COUNT aggregation by adding count() and GROUP BY to our query.

GROUP BY requires an aggregating function. We will discuss more aggregations later on.

USE demo;
SELECT loyalty_segment, count(loyalty_segment)
FROM customers
GROUP BY loyalty_segment;

ORDER BY

By adding ORDER BY to the query we just ran, we can place the results in a specific order.

ORDER BY defaults to ordering in ascending order. We can change the order to descending by adding DESC after the ORDER BY clause.

USE demo;
SELECT loyalty_segment, count(loyalty_segment)
FROM customers
GROUP BY loyalty_segment
ORDER BY loyalty_segment;

2. Column Expressions

Mathematical Expressions of Two Columns

In our queries, we can run calculations on the data in our tables. This can range from simple mathematical calculations to more complex computations involving built-in functions.

The results show that the Calculated Discount, the one we generated using Column Expressions, matches the Discounted Price.

USE demo;
SELECT sales_price - sales_price * promo_disc AS Calculated_Discount,
discounted_price AS Discounted_Price 
FROM promo_prices;

Built-In Functions — String Column Manipulation

There are many, many Built-In Functions. We are going to talk about just a handful, so you can get a feel for how they work.

We are going to use a built-in function called lower(). This function takes a string expression and returns the same expression with all characters changed to lowercase. Let’s have a look.

USE demo;
SELECT lower(city) AS City 
FROM customers;

Although the letters are now all lowercase, they are not the way the need to be. We want to have the first letter of each word capitalized.

initcap function (Databricks SQL) | Databricks on AWS

USE demo;
SELECT initcap(city) AS City 
FROM customers;

Date Functions

We want to use a function to make the date more human-readable. Let’s use from_unixtime().

from_unixtime function (Databricks SQL) | Databricks on AWS

The date looks better, but let’s adjust the formatting. Formatting options for many of the date and time functions are available here.

USE demo;
SELECT from_unixtime(promo_began, 'd MMM, y') AS Beginning_Date 
FROM promo_prices;

Date Calculations

In this code, we are using the function current_date() to get today’s date. We are then nesting from_unixtime() inside to_date in order to convert promo_began to a date object. We can then run the calculation.

USE demo;
SELECT current_date() - to_date(from_unixtime(promo_began)) FROM promo_prices;

CASE … WHEN

Often, it is important for us to use conditional logic in our queries. CASE … WHEN provides us this ability.

This statement allows us to change numeric values that represent loyalty segments into human-readable strings. It is certainly true that this association would more-likely occur using a join on two tables, but we can still see the logic behind CASE … WHEN

USE demo;
SELECT customer_name, loyalty_segment,
    CASE 
        WHEN loyalty_segment = 0 THEN 'Rare'
        WHEN loyalty_segment = 1 THEN 'Occasional'
        WHEN loyalty_segment = 2 THEN 'Frequent'
        WHEN loyalty_segment = 3 THEN 'Daily'
    END AS Loyalty 
FROM customers;

3. Updating Data

UPDATE

Let’s make those changes.

The UPDATE does exactly what it sounds like: It updates the table based on the criteria specified.

USE demo;
UPDATE customers SET city = initcap(lower(city));

SELECT city FROM customers;

INSERT INTO

In addition to updating data, we can insert new data into the table.

INSERT INTO is a command for inserting data into a table.

USE demo;
INSERT INTO loyalty_segments 
    (loyalty_segment_id, loyalty_segment_description, unit_threshold, valid_from, valid_to)
VALUES
    (4, 'level_4', 100, current_date(), Null);

SELECT * FROM loyalty_segments;

INSERT TABLE

INSERT TABLE is a command for inserting entire tables into other tables. There are two tables suppliers and source_suppliers that currently have the exact same data.

After selecting from the table again, we note that the number of rows has doubled. This is because INSERT TABLE inserts all data in the source table, whether or not there are duplicates.

USE demo;
INSERT INTO suppliers TABLE source_suppliers;

SELECT * FROM suppliers;

INSERT OVERWRITE

If we want to completely replace the contents of a table, we can use INSERT OVERWRITE.

After running INSERT OVERWRITE and then retrieving a count(*) from the table, we see that we are back to the original count of rows in the table. INSERT OVERWRITE has replaced all the rows.

USE demo;
INSERT OVERWRITE suppliers TABLE source_suppliers;
SELECT * FROM suppliers;

4. Subqueries

Let’s create two new tables.

These two command use subqueries to SELECT from the customers table using specific criteria. The results are then fed into CREATE OR REPLACE TABLE and CREATE OR REPLACE TABLE statements. Incidentally, this type of statement is often called a CTAS statement for CREATE OR REPLACE TABLE … AS.

USE demo;
CREATE OR REPLACE TABLE high_loyalty_customers AS
    SELECT * FROM customers WHERE loyalty_segment = 3;
CREATE OR REPLACE TABLE low_loyalty_customers AS
    SELECT * FROM customers WHERE loyalty_segment = 1;

5. Joins

We are now going to run a couple of JOIN queries. The first is the most common JOIN, an INNER JOIN. Since INNER JOIN is the default, we can just write JOIN.

In this statement, we are joining the customers table and the loyalty_segments tables. When the loyalty_segment from the customers table matches the loyalty_segment_id from the loyalty_segments table, the rows are combined. We are then able to view the customer_name, loyalty_segment_description, and unit_threshold from both tables.

USE demo;
SELECT
    customer_name,
    loyalty_segment_description,
    unit_threshold
FROM customers
INNER JOIN loyalty_segments
ON customers.loyalty_segment = loyalty_segments.loyalty_segment_id;

CROSS JOIN

Even though the CROSS JOIN isn’t used very often, I wanted to demonstrate it.

First of all, note the use of UNION ALL. All this does is combine the results of all three queries, so we can view them all in one results set. The Customers row shows the count of rows in the customers table. Likewise, the Sales row shows the count of the sales table. Crossed shows the number of rows after performing the CROSS JOIN.

USE demo;
SELECT "Sales", count(*) FROM sales
UNION ALL
SELECT "Customers", count(*) FROM customers
UNION ALL
SELECT "Crossed", count(*) FROM customers
  CROSS JOIN sales;

6. Aggregations

Now, let’s move into aggregations. There are many aggregating functions you can use in your queries. Here are just a handful.

Again, we are viewing the results of a handful of queries using a UNION ALL.

USE demo;
SELECT "Sum" Function_Name, sum(units_purchased) AS Value
FROM customers 
WHERE state = 'CA'
UNION ALL
SELECT "Min", min(discounted_price) AS Lowest_Discounted_Price 
FROM promo_prices
UNION ALL
SELECT "Max", max(discounted_price) AS Highest_Discounted_Price 
FROM promo_prices
UNION ALL
SELECT "Avg", avg(total_price) AS Mean_Total_Price 
FROM sales
UNION ALL
SELECT "Standard Deviation", std(total_price) AS SD_Total_Price 
FROM sales
UNION ALL
SELECT "Variance", variance(total_price) AS Variance_Total_Price
FROM sales;

Databricks SQL – Delta Commands

Posted on August 18, 2022 by jack

SELECT on Delta Tables

So far, the SQL commands we have used are generic to most flavors of SQL. In the next few queries, we are going to look at commands that are specific to using SELECT on Delta tables.

Delta tables keep a log of changes that we can view by running the command below.

After running DESCRIBE HISTORY, we can see that we are on version number 0 and we can see a timestamp of when this change was made.

USE demo;
DESCRIBE HISTORY customers;

SELECT on Delta Tables — Updating the Table

We are going to make a change to the table.

The code uses an UPDATE statement to make a change to the table. We will be discussing UPDATE later on. For now, we just need to understand that a change was made to the table. We also reran our DESCRIBE HISTORY command, and note that we have a new version in the log, with a new timestamp.

USE demo;
UPDATE customers SET loyalty_segment = 10 WHERE loyalty_segment = 0;
DESCRIBE HISTORY customers;

SELECT on Delta Tables — VERSION AS OF

We can now use a special predicate for use with Delta tables: VERSION AS OF

By using VERSION AS OF, we can SELECT from specific versions of the table. This feature of Delta tables is called “Time Travel,” and is very powerful.

We can also use TIMESTAMP AS OF to SELECT based on a table’s state on a specific date, and you can find more information in the documentation.

USE demo;
SELECT loyalty_segment FROM customers VERSION AS OF 1;

MERGE INTO

Certainly, there are times when we want to insert new data but ensure we don’t re-insert matched data. This is where we use MERGE INTO. MERGE INTO will merge two tables together, but you specify in which column to look for matched data and what to do when a match is found. Let’s run the code and examine the command in more detail.

USE demo;
MERGE INTO suppliers
    USING source_suppliers
    ON suppliers.SUPPLIER_ID = source_suppliers.SUPPLIER_ID
    WHEN NOT MATCHED THEN INSERT *;
SELECT count(*) FROM suppliers;

SQL warehouses

Posted on August 11, 2022 by jack

What is a Databricks SQL warehouse? – Azure Databricks – Databricks SQL | Microsoft Docs

What is a Databricks SQL warehouse?

This article introduces SQL warehouses (formerly SQL endpoints) and describes how to work with them using the Databricks SQL UI. A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. Compute resources are infrastructure resources that provide processing capabilities in the cloud.

SQL endpoints name changed to SQL warehouses

Databricks changed the name from SQL endpoint to SQL warehouse because it is more than just an API entry point for running SQL commands. A SQL warehouse is a compute resource for all your data warehousing needs, an integral part of the Lakehouse Platform.

Other compute resource types include Azure Databricks clusters. To work with SQL warehouses using the API, see SQL Warehouses APIs 2.0.

Cluster size

The table in this section maps SQL warehouse cluster sizes to Azure Databricks cluster driver size and worker counts. The driver size only applies to Classic SQL warehouses.

Cluster size	Instance type for driver (applies only to Classic warehouses)	Worker count
2X-Small	Standard_E8ds_v4	1
X-Small	Standard_E8ds_v4	2
Small	Standard_E16ds_v4	4
Medium	Standard_E32ds_v4	8
Large	Standard_E32ds_v4	16
X-Large	Standard_E64ds_v4	32
2X-Large	Standard_E64ds_v4	64
3X-Large	Standard_E64ds_v4	128
4X-Large	Standard_E64ds_v4	256

The instance size of all workers is Standard_E8ds_v4.

Each driver and worker has eight 128 GB Standard LRS managed disks attached. Attached disks are charged hourly.

Keyboard shortcuts

Posted on August 11, 2022 by jack

Moving from Mac (Unix) to Windows | Microsoft Docs

Keyboard shortcuts

You can use PowerToys Keyboard Manager to map Windows shortcuts to the shortcuts you use on a Mac.

Operation	Mac	Windows
Copy	Command+C	Ctrl+C
Cut	Command+X	Ctrl+X
Paste	Command+V	Ctrl+V
Undo	Command+Z	Ctrl+Z
Save	Command+S	Ctrl+S
Open	Command+O	Ctrl+O
Lock computer	Command+Control+Q	WindowsKey+L
Show desktop	Command+F3	WindowsKey+D
Open file browser	Command+N	WindowsKey+E
Minimize windows	Command+M	WindowsKey+M
Search	Command+Space	WindowsKey
Close active window	Command+W	Control+W
Switch current task	Command+Tab	Alt+Tab
Maximize a window to full screen	Control+Command+F	WindowsKey+Up
Save screen (Screenshot)	Command+Shift+3	WindowsKey+Shift+S
Save window	Command+Shift+4	WindowsKey+Shift+S
View item information or properties	Command+I	Alt+Enter
Select all items	Command+A	Ctrl+A
Select more than one item in a list (noncontiguous)	Command, then click each item	Control, then click each item
Type special characters	Option+ character key	Alt+ character key

Trackpad shortcuts

Some of these shortcuts require a “Precision Trackpad”, such as the trackpad on Surface devices and some other third-party laptops.

Trackpad options are configurable on both platforms.

Operation	Mac	Windows
Scroll	Two finger vertical swipe	Two finger vertical swipe
Zoom	Two finger pinch in and out	Two finger pinch in and out
Swipe back and forward between views	Two finger sideways swipe	Two finger sideways swipe
Switch virtual workspaces	Four fingers sideways swipe	Four fingers sideways swipe
Display currently open apps	Four fingers upward swipe	Three fingers upward swipe
Switch between apps	N/A	Slow three finger sideways swipe
Go to desktop	Spread out four fingers	Three finger swipe downwards
Open Cortana / Action center	Two finger slide from right	Three finger tap
Open extra information	Three finger tap	N/A
Show launchpad / start an app	Pinch with four fingers	Tap with four fingers

Python cryptography

Posted on August 11, 2022 by jack

Installation

pip install cryptography

Introduction

cryptography includes both high level recipes and low level interfaces to common cryptographic algorithms such as symmetric ciphers, message digests, and key derivation functions. For example, to encrypt something with cryptography’s high level symmetric encryption recipe:

from cryptography.fernet import Fernet
# Put this somewhere safe!
key = Fernet.generate_key()
print(key.hex())
f = Fernet(key)
token = f.encrypt(b"A really secret message. Not for prying eyes.")
print(token)
# b'...'
print(f.decrypt(token))
# b'A really secret message. Not for prying eyes.'
print(token.hex())

Layout

cryptography is broadly divided into two levels. One with safe cryptographic recipes that require little to no configuration choices. These are safe and easy to use and don’t require developers to make many decisions.

The other level is low-level cryptographic primitives. These are often dangerous and can be used incorrectly. They require making decisions and having an in-depth knowledge of the cryptographic concepts at work. Because of the potential danger in working at this level, this is referred to as the “hazardous materials” or “hazmat” layer. These live in the cryptography.hazmat package, and their documentation will always contain an admonition at the top.

We recommend using the recipes layer whenever possible, and falling back to the hazmat layer only when necessary.

Example

Secure Your Data in Under 5 Minutes with AES128-bit Encryption in Python | by Jake Krajewski | Level Up Coding (gitconnected.com)

สร้าง key แล้วเก็บไว้ในไฟล์ secret.key

from cryptography.fernet import Fernet

# Use Fernet to generate the key file.
key = Fernet.generate_key()

# Store the file to disk to be accessed for en/de:crypting later.
with open('secret.key', 'wb') as new_key_file:
    new_key_file.write(key)
print(key)
# print(key.decode('utf-8'))
print(key.hex())

นำ key จากในไฟล์ secret.key มา encrypt ข้อความ

from cryptography.fernet import Fernet
# Load the private key from a file.
with open('secret.key', 'rb') as my_private_key:
    key = my_private_key.read()

msg = "Into the valley of death, rode the 600."
# Encode this as bytes to feed into the algorithm.
# (Refer to Encoding types above).
msg = msg.encode()
# print(msg)

# Instantiate the object with your key.
f = Fernet(key)
# Pass your bytes type message into encrypt.
ciphertext = f.encrypt(msg)
print(ciphertext)

นำ key จากในไฟล์ secret.key มา decrypt ข้อความ

from cryptography.fernet import Fernet
# Load the private key from a file.
with open('secret.key', 'rb') as my_private_key:
    key = my_private_key.read()

# Instantiate Fernet on the recip system.
f = Fernet(key)

ciphertext = b'gAAAAABi9LSeFJbUUaMH3Ra-xDJkj_U_1xbs2dpzhoHQT2mTCSLIYQYBMvqYIJtrkzu1jI-IyTQCiLGCvKZfcB09Fq_wfUuAVZneIn3mg3Nz3QV4vUvT3N5xk0Iu1VfpGmdWIkrUET8_'

# Decrypt the message.
cleartext = f.decrypt(ciphertext)
# Decode the bytes back into a string.
cleartext = cleartext.decode()
print(cleartext)

Using passwords with Fernet

Fernet (symmetric encryption) — Cryptography 38.0.0.dev1 documentation

import base64
import os
from cryptography.fernet import Fernet
from cryptography.hazmat.primitives import hashes
from cryptography.hazmat.primitives.kdf.pbkdf2 import PBKDF2HMAC
password = b"password"
salt = os.urandom(16)  # IV
print(salt.hex())
kdf = PBKDF2HMAC(
    algorithm=hashes.SHA256(),
    length=32,
    salt=salt,
    iterations=390000,
)
key = base64.urlsafe_b64encode(kdf.derive(password))
f = Fernet(key)
token = f.encrypt(b"Secret message!")
print(token)
# b'...'
print(f.decrypt(token))
# b'Secret message!'

In this scheme, the salt has to be stored in a retrievable location in order to derive the same key from the password in the future.

The iteration count used should be adjusted to be as high as your server can tolerate. A good default is at least 480,000 iterations, which is what Django recommends as of July 2022.

Implementation

Fernet is built on top of a number of standard cryptographic primitives. Specifically it uses:

AES in CBC mode with a 128-bit key for encryption; using PKCS7 padding.
HMAC using SHA256 for authentication.
Initialization vectors are generated using os.urandom().

The CBC mode requires an IV, and this IV is generated by os.urandom(). (Python 3 Fernet encrypting same message different ways – Stack Overflow)

Python ECDSA and ECDH

Posted on August 11, 2022 by jack

The Elliptic Curve Diffie Hellman (ECDH) for key agreement and Elliptic Curve Digital Signature Algorithm (ECDSA) for signing/verifying.

This is an easy-to-use implementation of ECC (Elliptic Curve Cryptography) with support for ECDSA (Elliptic Curve Digital Signature Algorithm), EdDSA (Edwards-curve Digital Signature Algorithm) and ECDH (Elliptic Curve Diffie-Hellman), implemented purely in Python, released under the MIT license. With this library, you can quickly create key pairs (signing key and verifying key), sign messages, and verify the signatures. You can also agree on a shared secret key based on exchanged public keys. The keys and signatures are very short, making them easy to handle and incorporate into other protocols.

NOTE: This library should not be used in production settings, see Security for more details.

Installation

pip install ecdsa

Usage

You start by creating a SigningKey. You can use this to sign data, by passing in data as a byte string and getting back the signature (also a byte string). You can also ask a SigningKey to give you the corresponding VerifyingKey. The VerifyingKey can be used to verify a signature, by passing it both the data string and the signature byte string: it either returns True or raises BadSignatureError.

from ecdsa import SigningKey
sk = SigningKey.generate() # uses NIST192p
vk = sk.verifying_key
signature = sk.sign(b"message")
assert vk.verify(signature, b"message")

from ecdsa import SigningKey
sk = SigningKey.generate()  # uses NIST192p
vk = sk.verifying_key
bytes_text = bytes("The first sample string|The second sample string", 'utf-8')
signature = sk.sign(bytes_text)
assert vk.verify(signature, bytes_text)

Each SigningKey/VerifyingKey is associated with a specific curve, like NIST192p (the default one). Longer curves are more secure, but take longer to use, and result in longer keys and signatures.

from ecdsa import SigningKey, NIST384p
sk = SigningKey.generate(curve=NIST384p)
vk = sk.verifying_key
signature = sk.sign(b"message")
assert vk.verify(signature, b"message")

The SigningKey can be serialized into several different formats: the shortest is to call s=sk.to_string(), and then re-create it with SigningKey.from_string(s, curve) . This short form does not record the curve, so you must be sure to pass to from_string() the same curve you used for the original key. The short form of a NIST192p-based signing key is just 24 bytes long. If a point encoding is invalid or it does not lie on the specified curve, from_string() will raise MalformedPointError.

from ecdsa import SigningKey, NIST384p
sk = SigningKey.generate(curve=NIST384p)
sk_string = sk.to_string()
sk2 = SigningKey.from_string(sk_string, curve=NIST384p)
print(sk_string.hex())
print(sk2.to_string().hex())

from ecdsa import SigningKey, NIST256p
sk = SigningKey.generate(curve=NIST256p)
vk = sk.verifying_key
signature = sk.sign(b"message")
assert vk.verify(signature, b"message")
print("OK")

sk_string = sk.to_string()
sk_hex = sk_string.hex()
print(sk_hex)

sig_hex = signature.hex()
print(sig_hex)

Note: while the methods are called to_string() the type they return is actually bytes, the “string” part is leftover from Python 2.

sk.to_pem() and sk.to_der() will serialize the signing key into the same formats that OpenSSL uses. The PEM file looks like the familiar ASCII-armored "-----BEGIN EC PRIVATE KEY-----" base64-encoded format, and the DER format is a shorter binary form of the same data. SigningKey.from_pem()/.from_der() will undo this serialization. These formats include the curve name, so you do not need to pass in a curve identifier to the deserializer. In case the file is malformed from_der() and from_pem() will raise UnexpectedDER or MalformedPointError.

from ecdsa import SigningKey, NIST384p
sk = SigningKey.generate(curve=NIST384p)
sk_pem = sk.to_pem()
sk2 = SigningKey.from_pem(sk_pem)
# sk and sk2 are the same key

Likewise, the VerifyingKey can be serialized in the same way: vk.to_string()/VerifyingKey.from_string(), to_pem()/from_pem(), and to_der()/from_der(). The same curve= argument is needed for VerifyingKey.from_string().

from ecdsa import SigningKey, VerifyingKey, NIST384p
sk = SigningKey.generate(curve=NIST384p)
vk = sk.verifying_key
vk_string = vk.to_string()
vk2 = VerifyingKey.from_string(vk_string, curve=NIST384p)
# vk and vk2 are the same key

from ecdsa import SigningKey, VerifyingKey, NIST384p
sk = SigningKey.generate(curve=NIST384p)
vk = sk.verifying_key
vk_pem = vk.to_pem()
vk2 = VerifyingKey.from_pem(vk_pem)
# vk and vk2 are the same key

There are a couple of different ways to compute a signature. Fundamentally, ECDSA takes a number that represents the data being signed, and returns a pair of numbers that represent the signature. The hashfunc= argument to sk.sign() and vk.verify() is used to turn an arbitrary string into a fixed-length digest, which is then turned into a number that ECDSA can sign, and both sign and verify must use the same approach. The default value is hashlib.sha1, but if you use NIST256p or a longer curve, you can use hashlib.sha256 instead.

Databricks SQL – Ingest Data

Posted on August 9, 2022 by jack

GitHub – databricks-academy/data-analysis-with-databricks-sql

Ingest Data Using Databricks SQL

CREATE TABLE [USING] (Databricks SQL) | Databricks on AWS

We can ingest data into Databricks SQL in a variety of ways. In this first example, we are going to start with a .csv file that exists in an object store and finish with a Delta table that contains all the data from that .csv file.

There are several things to note in these commands:

Since the file we are ingesting is a .csv file, we state USING csv
We are setting three options:
- path – the path to the object store
- header – whether or not the .csv file contains a header row
- inferSchema – whether or not Databricks should infer the schema from the contents of the file
We are creating a second table, with default settings from the contents of the first table, using a CTAS statement (CREATE TABLE AS SELECT)).

The reason we are creating a second table that is a copy of the first table is because we want the resulting table to be in Delta format, which gives us the most options. Because the second table uses default options, it will be a Delta table.

USE demo;
DROP TABLE IF EXISTS web_events_csv;
CREATE TABLE web_events_csv 
    USING csv 
    OPTIONS (
        path='wasb://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/web_events/web-events.csv',
        header="true",
        inferSchema="true"
    );
DROP TABLE IF EXISTS web_events;
CREATE OR REPLACE TABLE web_events AS
    SELECT * FROM web_events_csv;

The LOCATION Keyword

We can ingest data in-place using the LOCATION keyword to create an external/unmanaged table. There is a dataset, called Sales, that is currently in an object store. Running the command below creates an external table that is associated with this dataset.

The table’s data is stored in the external location, but the table itself is registered in the metastore. We can query the data just like any other table in the schema.

USE demo;
DROP TABLE IF EXISTS external_table;
CREATE TABLE external_table
    LOCATION 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales';
SELECT * FROM external_table;

COPY INTO

We use COPY INTO to load data from a file location into a Delta table. COPY INTO is a re-triable and idempotent operation, so files in the source location that have already been loaded are skipped.

The first command creates an empty Delta table. Note that you must specify a schema when creating an empty Delta table.

The second command copies data from an object store location into the web-events table. Note that the file type for the files in the object store location is specified as “JSON”. The last part of the COPY INTO command is a file name, a list of file names, or a directory of files.

USE demo;
DROP TABLE IF EXISTS gym_logs;
CREATE TABLE gym_logs (first_timestamp DOUBLE, gym Long, last_timestamp DOUBLE, mac STRING);
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');

We can run the same COPY INTO command as above and add a second one. The first one will be skipped because the file has already been loaded.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');

Do a count on the table you just populated using COPY INTO. Then, rerun the COPY INTO code above, and do another count. Notice that ‘COPY INTO’ will not input data from files that have already been ingested using COPY INTO.

Next, let’s add another COPY INTO command, but this time, we will use the PATTERN option. This allows us to load any file that fits a specific pattern.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    PATTERN = '20191201_[0-9].json';

Privileges

SHOW GRANTS | Databricks on AWS

Data object owners and Databricks administrators can grant and revoke a variety of privileges on securable objects. These objects include functions, files, tables, views, and more. These privileges can be granted using SQL or using the Data Explorer.

This statement returns the privileges granted on this schema.

SHOW GRANT ON SCHEMA demo;

GRANT

GRANT | Databricks on AWS

If we wish to grant privileges to a user, we can run GRANT commands for the specific privileges we wish to grant. The privileges available include USAGE, SELECT, CREATE, READ FILES, and more.

We need to start by granting USAGE to a user. We can then grant other privileges. If we want to grant all privileges, we can use GRANT ALL PRIVILEGES.

GRANT USAGE ON SCHEMA demo TO `phaisarn`;
GRANT SELECT ON SCHEMA demo TO `phaisarn`;

Examples

> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`;

> GRANT ALL PRIVILEGES ON TABLE forecasts TO finance;

> GRANT SELECT ON TABLE sample_data TO USERS;

REVOKE

REVOKE (Databricks SQL) | Databricks on AWS

We can revoke privileges on securable objects in the exact same way.

We don’t want to actually run the command because we don’t want to try to revoke our own privileges, but here is the command:

REVOKE ALL PRIVILEGES ON SCHEMA `schema_name` from `user_name`;

Examples

> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`;

> REVOKE SELECT ON TABLE t FROM aliens;

Data Explorer

Data explorer – Azure Databricks – Databricks SQL | Microsoft Docs

While we can certainly grant and revoke privileges using SQL, we can also use the Data Explorer.

Click “Data” in the sidebar menu.
If needed, select your schema in the dropdown
Select the table, “gym_logs” from the list
Click “Permissions”
Use the “Grant” and “Revoke” buttons to change permission settings.

Databricks SQL – CTEs

Posted on August 9, 2022 by jack

GitHub – databricks-academy/data-analysis-with-databricks-sql

Common Table Expressions (CTEs)

Here is an example of a CTE. We are going to query the view we just created, but we are going to filter based on the total_price being above 20000. We are calling that sales_below_20000 and then immediately querying the DISTINCT customer_name from that CTE.

USE demo;
WITH sales_below_20000 AS
    (SELECT *
     FROM high_sales
     WHERE total_price < 20000)
SELECT DISTINCT customer_name FROM sales_below_20000;

Common table expression (CTE) (Databricks SQL)

Common table expression (CTE) (Databricks SQL) | Databricks on AWS

Defines a temporary result set that you can reference possibly multiple times within the scope of a SQL statement. A CTE is used mainly in a SELECT statement.

WITH common_table_expression [, ...]

common_table_expression
  view_identifier [ ( column_identifier [, ...] ) ] [ AS ] ( query )

Examples

-- CTE with multiple column aliases
> WITH t(x, y) AS (SELECT 1, 2)
  SELECT * FROM t WHERE x = 1 AND y = 2;
   1   2

-- CTE in CTE definition
> WITH t AS (
    WITH t2 AS (SELECT 1)
    SELECT * FROM t2)
  SELECT * FROM t;
   1

-- CTE in subquery
> SELECT max(c) FROM (
    WITH t(c) AS (SELECT 1)
    SELECT * FROM t);
      1

-- CTE in subquery expression
> SELECT (WITH t AS (SELECT 1)
          SELECT * FROM t);
                1

-- CTE in CREATE VIEW statement
> CREATE VIEW v AS
    WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4)
    SELECT * FROM t;
> SELECT * FROM v;
   1   2   3   4

-- CTE names are scoped
> WITH t  AS (SELECT 1),
       t2 AS (
        WITH t AS (SELECT 2)
        SELECT * FROM t)
SELECT * FROM t2;
   2

WITH column_date AS (
  SELECT date_format(TIMESTAMPADD(HOUR, +7, current_timestamp()),'yyyyMMdd') AS T0
    ,    date_format(TIMESTAMPADD(HOUR, +7-24, current_timestamp()),'yyyyMMdd') AS T1
    ,    date_format(TIMESTAMPADD(HOUR, +7-48, current_timestamp()),'yyyyMMdd') AS T2
    ,    date_format(TIMESTAMPADD(HOUR, +7-72, current_timestamp()),'yyyyMMdd') AS T3
)
SELECT *
FROM   column_date;

---------------------------------------------
|   T0     |    T1    |   T2     |   T3     |
---------------------------------------------
| 20230310 | 20230309 | 20230308 | 20230307 |

Databricks SQL – Views

Posted on August 9, 2022 by jack

USE demo;
CREATE OR REPLACE VIEW high_sales AS
    SELECT * FROM delta.`wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales` 
        WHERE total_price > 10000;
SELECT * FROM high_sales;

CREATE VIEW (Databricks SQL)

CREATE VIEW (Databricks SQL) | Databricks on AWS

CREATE [ OR REPLACE ] [ TEMPORARY ] VIEW [ IF NOT EXISTS ] view_name
    [ column_list ]
    [ COMMENT view_comment ]
    [ TBLPROPERTIES clause ]
    AS query

column_list
   ( { column_alias [ COMMENT column_comment ] } [, ...] )

Parameters

OR REPLACE – If a view of the same name already exists, it is replaced. To replace an existing view you must be its owner.
TEMPORARY – TEMPORARY views are visible only to the session that created them and are dropped when the session ends.
IF NOT EXISTS – Creates the view only if it does not exist. If a view by this name already exists the CREATE VIEW statement is ignored.You may specify at most one of IF NOT EXISTS or OR REPLACE.
view_name – The name of the newly created view. A temporary view’s name must not be qualified. A the fully qualified view name must be unique.
column_list – Optionally labels the columns in the query result of the view. If you provide a column list the number of column aliases must match the number of expressions in the query. In case no column list is specified aliases are derived from the body of the view.
- column_alias – The column aliases must be unique.
- column_comment – An optional STRING literal describing the column alias.
view_comment – An optional STRING literal providing a view-level comments.
TBLPROPERTIES – Optionally sets one or more user defined properties.
AS query – A query that constructs the view from base tables or other views.

Examples

-- Create or replace view for `experienced_employee` with comments.
> CREATE OR REPLACE VIEW experienced_employee
    (id COMMENT 'Unique identification number', Name)
    COMMENT 'View for experienced employees'
    AS SELECT id, name
         FROM all_employee
        WHERE working_years > 5;

-- Create a temporary view `subscribed_movies`.
> CREATE TEMPORARY VIEW subscribed_movies
    AS SELECT mo.member_id, mb.full_name, mo.movie_title
         FROM movies AS mo
         INNER JOIN members AS mb
            ON mo.member_id = mb.id;

Databricks SQL – Tables

Posted on August 9, 2022 by jack

Note that I have “USING DELTA” at the end of the CREATE statment. This is optional because Delta is the default table type.

USE demo;
CREATE OR REPLACE TABLE managed_table (width INT, length INT, height INT) USING DELTA;
INSERT INTO managed_table VALUES (3, 2, 1);
SELECT * FROM managed_table;

USE demo;
DESCRIBE EXTENDED managed_table;
DESCRIBE TABLE EXTENDED managed_table;

USE demo;
SHOW CREATE TABLE managed_table;

USE demo;
DESCRIBE HISTORY managed_table;

USE demo;
SELECT *
FROM managed_table
VERSION AS OF 0

RESTORE TABLE employee TO TIMESTAMP AS OF '2022-08-02 00:00:00';
RESTORE TABLE employee TO VERSION AS OF 1;

USE demo;
DROP TABLE IF EXISTS managed_table;

CREATE TABLE [USING] (Databricks SQL)

CREATE TABLE [USING] (Databricks SQL) | Databricks on AWS

Defines a managed or external table, optionally using a data source.

{ { [CREATE OR] REPLACE TABLE | CREATE TABLE [ IF NOT EXISTS ] }
  table_name
  [ column_specification ] [ USING data_source ]
  [ table_clauses ]
  [ AS query ] }

column_specification
  ( { column_identifier column_type [ NOT NULL ]
      [ GENERATED ALWAYS AS ( expr ) |
        GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] ]
      [ COMMENT column_comment ] } [, ...] )

table_clauses
  { OPTIONS clause |
    PARTITIONED BY clause |
    clustered_by_clause |
    LOCATION path [ WITH ( CREDENTIAL credential_name ) ] |
    COMMENT table_comment |
    TBLPROPERTIES clause } [...]

clustered_by_clause
  { CLUSTERED BY ( cluster_column [, ...] )
    [ SORTED BY ( { sort_column [ ASC | DESC ] } [, ...] ) ]
    INTO num_buckets BUCKETS }

Examples

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/mnt/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

DESCRIBE TABLE (Databricks SQL)

DESCRIBE TABLE (Databricks SQL) | Databricks on AWS

Returns the basic metadata information of a table. The metadata information includes column name, column type and column comment. Optionally you can specify a partition spec or column name to return the metadata pertaining to a partition or column respectively.

{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] }

SHOW CREATE TABLE (Databricks SQL)

SHOW CREATE TABLE (Databricks SQL) | Databricks on AWS

Returns the CREATE TABLE statement or CREATE VIEW statement that was used to create a given table or view.

SHOW CREATE TABLE { table_name | view_name }

Examples

> CREATE TABLE test (c INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES ('prop1' = 'value1', 'prop2' = 'value2');

> SHOW CREATE TABLE test;
                                       createtab_stmt
 ----------------------------------------------------
 CREATE TABLE `default`.`test` (`c` INT)
 USING text
 TBLPROPERTIES (
   'transient_lastDdlTime' = '1586269021',
   'prop1' = 'value1',
   'prop2' = 'value2')

DESCRIBE HISTORY (Databricks SQL)

You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta table by running the history command. The operations are returned in reverse chronological order. By default table history is retained for 30 days.

ดู version และ timestamp ของ delta table

DESCRIBE HISTORY table_name

Examples

DESCRIBE HISTORY '/data/events/'          -- get the full history of the table

DESCRIBE HISTORY delta.`/data/events/`

DESCRIBE HISTORY '/data/events/' LIMIT 1  -- get the last operation only

DESCRIBE HISTORY eventsTable

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)

fullHistoryDF = deltaTable.history()    # get the full history of the table

lastOperationDF = deltaTable.history(1) # get the last operation

RESTORE (Delta Lake on Databricks)

Restores a Delta table to an earlier state. Restoring to an earlier version number or a timestamp is supported.

RESTORE (Delta Lake on Databricks) | Databricks on AWS

RESTORE [ TABLE ] table_name [ TO ] time_travel_version

time_travel_version
 { TIMESTAMP AS OF timestamp_expression |
   VERSION AS OF version }

DROP TABLE (Databricks SQL)

DROP TABLE (Databricks SQL) | Databricks on AWS

DROP TABLE [ IF EXISTS ] table_name

Examples

-- Assumes a table named `employeetable` exists.
> DROP TABLE employeetable;

-- Assumes a table named `employeetable` exists in the `userdb` schema
> DROP TABLE userdb.employeetable;

-- Assumes a table named `employeetable` does not exist.
-- Throws exception
> DROP TABLE employeetable;
  Error: Table or view not found: employeetable;

-- Assumes a table named `employeetable` does not exist,Try with IF EXISTS
-- this time it will not throw exception
> DROP TABLE IF EXISTS employeetable;

Dropping External Tables Does Not Delete Data

Even though we dropped the table, we are still able to query the data directly in the filesystem because it still exists in the object store.

Now, this is one of the coolest features of Databricks SQL. We’ve talked about how the use of schemas and tables is just an organizational contruct. The data files located in this location can be queried directly, even though they are not part of a table or schema. We use tables and schemas simply to organize data in a way familiar to you.

SELECT * FROM delta.`wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales`;