Phaisarn

SQL warehouses

Posted on August 11, 2022 by jack

What is a Databricks SQL warehouse? – Azure Databricks – Databricks SQL | Microsoft Docs

What is a Databricks SQL warehouse?

This article introduces SQL warehouses (formerly SQL endpoints) and describes how to work with them using the Databricks SQL UI. A SQL warehouse is a compute resource that lets you run SQL commands on data objects within Databricks SQL. Compute resources are infrastructure resources that provide processing capabilities in the cloud.

SQL endpoints name changed to SQL warehouses

Databricks changed the name from SQL endpoint to SQL warehouse because it is more than just an API entry point for running SQL commands. A SQL warehouse is a compute resource for all your data warehousing needs, an integral part of the Lakehouse Platform.

Other compute resource types include Azure Databricks clusters. To work with SQL warehouses using the API, see SQL Warehouses APIs 2.0.

Cluster size

The table in this section maps SQL warehouse cluster sizes to Azure Databricks cluster driver size and worker counts. The driver size only applies to Classic SQL warehouses.

Cluster size	Instance type for driver (applies only to Classic warehouses)	Worker count
2X-Small	Standard_E8ds_v4	1
X-Small	Standard_E8ds_v4	2
Small	Standard_E16ds_v4	4
Medium	Standard_E32ds_v4	8
Large	Standard_E32ds_v4	16
X-Large	Standard_E64ds_v4	32
2X-Large	Standard_E64ds_v4	64
3X-Large	Standard_E64ds_v4	128
4X-Large	Standard_E64ds_v4	256

The instance size of all workers is Standard_E8ds_v4.

Each driver and worker has eight 128 GB Standard LRS managed disks attached. Attached disks are charged hourly.

Databricks SQL – Ingest Data

Posted on August 9, 2022 by jack

GitHub – databricks-academy/data-analysis-with-databricks-sql

Ingest Data Using Databricks SQL

CREATE TABLE [USING] (Databricks SQL) | Databricks on AWS

We can ingest data into Databricks SQL in a variety of ways. In this first example, we are going to start with a .csv file that exists in an object store and finish with a Delta table that contains all the data from that .csv file.

There are several things to note in these commands:

Since the file we are ingesting is a .csv file, we state USING csv
We are setting three options:
- path – the path to the object store
- header – whether or not the .csv file contains a header row
- inferSchema – whether or not Databricks should infer the schema from the contents of the file
We are creating a second table, with default settings from the contents of the first table, using a CTAS statement (CREATE TABLE AS SELECT)).

The reason we are creating a second table that is a copy of the first table is because we want the resulting table to be in Delta format, which gives us the most options. Because the second table uses default options, it will be a Delta table.

USE demo;
DROP TABLE IF EXISTS web_events_csv;
CREATE TABLE web_events_csv 
    USING csv 
    OPTIONS (
        path='wasb://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/web_events/web-events.csv',
        header="true",
        inferSchema="true"
    );
DROP TABLE IF EXISTS web_events;
CREATE OR REPLACE TABLE web_events AS
    SELECT * FROM web_events_csv;

The LOCATION Keyword

We can ingest data in-place using the LOCATION keyword to create an external/unmanaged table. There is a dataset, called Sales, that is currently in an object store. Running the command below creates an external table that is associated with this dataset.

The table’s data is stored in the external location, but the table itself is registered in the metastore. We can query the data just like any other table in the schema.

USE demo;
DROP TABLE IF EXISTS external_table;
CREATE TABLE external_table
    LOCATION 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales';
SELECT * FROM external_table;

COPY INTO

We use COPY INTO to load data from a file location into a Delta table. COPY INTO is a re-triable and idempotent operation, so files in the source location that have already been loaded are skipped.

The first command creates an empty Delta table. Note that you must specify a schema when creating an empty Delta table.

The second command copies data from an object store location into the web-events table. Note that the file type for the files in the object store location is specified as “JSON”. The last part of the COPY INTO command is a file name, a list of file names, or a directory of files.

USE demo;
DROP TABLE IF EXISTS gym_logs;
CREATE TABLE gym_logs (first_timestamp DOUBLE, gym Long, last_timestamp DOUBLE, mac STRING);
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');

We can run the same COPY INTO command as above and add a second one. The first one will be skipped because the file has already been loaded.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');

Do a count on the table you just populated using COPY INTO. Then, rerun the COPY INTO code above, and do another count. Notice that ‘COPY INTO’ will not input data from files that have already been ingested using COPY INTO.

Next, let’s add another COPY INTO command, but this time, we will use the PATTERN option. This allows us to load any file that fits a specific pattern.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    PATTERN = '20191201_[0-9].json';

Privileges

SHOW GRANTS | Databricks on AWS

Data object owners and Databricks administrators can grant and revoke a variety of privileges on securable objects. These objects include functions, files, tables, views, and more. These privileges can be granted using SQL or using the Data Explorer.

This statement returns the privileges granted on this schema.

SHOW GRANT ON SCHEMA demo;

GRANT

GRANT | Databricks on AWS

If we wish to grant privileges to a user, we can run GRANT commands for the specific privileges we wish to grant. The privileges available include USAGE, SELECT, CREATE, READ FILES, and more.

We need to start by granting USAGE to a user. We can then grant other privileges. If we want to grant all privileges, we can use GRANT ALL PRIVILEGES.

GRANT USAGE ON SCHEMA demo TO `phaisarn`;
GRANT SELECT ON SCHEMA demo TO `phaisarn`;

Examples

> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`;

> GRANT ALL PRIVILEGES ON TABLE forecasts TO finance;

> GRANT SELECT ON TABLE sample_data TO USERS;

REVOKE

REVOKE (Databricks SQL) | Databricks on AWS

We can revoke privileges on securable objects in the exact same way.

We don’t want to actually run the command because we don’t want to try to revoke our own privileges, but here is the command:

REVOKE ALL PRIVILEGES ON SCHEMA `schema_name` from `user_name`;

Examples

> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`;

> REVOKE SELECT ON TABLE t FROM aliens;

Data Explorer

Data explorer – Azure Databricks – Databricks SQL | Microsoft Docs

While we can certainly grant and revoke privileges using SQL, we can also use the Data Explorer.

Click “Data” in the sidebar menu.
If needed, select your schema in the dropdown
Select the table, “gym_logs” from the list
Click “Permissions”
Use the “Grant” and “Revoke” buttons to change permission settings.

Databricks SQL – CTEs

Posted on August 9, 2022 by jack

GitHub – databricks-academy/data-analysis-with-databricks-sql

Common Table Expressions (CTEs)

Here is an example of a CTE. We are going to query the view we just created, but we are going to filter based on the total_price being above 20000. We are calling that sales_below_20000 and then immediately querying the DISTINCT customer_name from that CTE.

USE demo;
WITH sales_below_20000 AS
    (SELECT *
     FROM high_sales
     WHERE total_price < 20000)
SELECT DISTINCT customer_name FROM sales_below_20000;

Common table expression (CTE) (Databricks SQL)

Common table expression (CTE) (Databricks SQL) | Databricks on AWS

Defines a temporary result set that you can reference possibly multiple times within the scope of a SQL statement. A CTE is used mainly in a SELECT statement.

WITH common_table_expression [, ...]

common_table_expression
  view_identifier [ ( column_identifier [, ...] ) ] [ AS ] ( query )

Examples

-- CTE with multiple column aliases
> WITH t(x, y) AS (SELECT 1, 2)
  SELECT * FROM t WHERE x = 1 AND y = 2;
   1   2

-- CTE in CTE definition
> WITH t AS (
    WITH t2 AS (SELECT 1)
    SELECT * FROM t2)
  SELECT * FROM t;
   1

-- CTE in subquery
> SELECT max(c) FROM (
    WITH t(c) AS (SELECT 1)
    SELECT * FROM t);
      1

-- CTE in subquery expression
> SELECT (WITH t AS (SELECT 1)
          SELECT * FROM t);
                1

-- CTE in CREATE VIEW statement
> CREATE VIEW v AS
    WITH t(a, b, c, d) AS (SELECT 1, 2, 3, 4)
    SELECT * FROM t;
> SELECT * FROM v;
   1   2   3   4

-- CTE names are scoped
> WITH t  AS (SELECT 1),
       t2 AS (
        WITH t AS (SELECT 2)
        SELECT * FROM t)
SELECT * FROM t2;
   2

WITH column_date AS (
  SELECT date_format(TIMESTAMPADD(HOUR, +7, current_timestamp()),'yyyyMMdd') AS T0
    ,    date_format(TIMESTAMPADD(HOUR, +7-24, current_timestamp()),'yyyyMMdd') AS T1
    ,    date_format(TIMESTAMPADD(HOUR, +7-48, current_timestamp()),'yyyyMMdd') AS T2
    ,    date_format(TIMESTAMPADD(HOUR, +7-72, current_timestamp()),'yyyyMMdd') AS T3
)
SELECT *
FROM   column_date;

---------------------------------------------
|   T0     |    T1    |   T2     |   T3     |
---------------------------------------------
| 20230310 | 20230309 | 20230308 | 20230307 |

Databricks SQL – Views

Posted on August 9, 2022 by jack

USE demo;
CREATE OR REPLACE VIEW high_sales AS
    SELECT * FROM delta.`wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales` 
        WHERE total_price > 10000;
SELECT * FROM high_sales;

CREATE VIEW (Databricks SQL)

CREATE VIEW (Databricks SQL) | Databricks on AWS

CREATE [ OR REPLACE ] [ TEMPORARY ] VIEW [ IF NOT EXISTS ] view_name
    [ column_list ]
    [ COMMENT view_comment ]
    [ TBLPROPERTIES clause ]
    AS query

column_list
   ( { column_alias [ COMMENT column_comment ] } [, ...] )

Parameters

OR REPLACE – If a view of the same name already exists, it is replaced. To replace an existing view you must be its owner.
TEMPORARY – TEMPORARY views are visible only to the session that created them and are dropped when the session ends.
IF NOT EXISTS – Creates the view only if it does not exist. If a view by this name already exists the CREATE VIEW statement is ignored.You may specify at most one of IF NOT EXISTS or OR REPLACE.
view_name – The name of the newly created view. A temporary view’s name must not be qualified. A the fully qualified view name must be unique.
column_list – Optionally labels the columns in the query result of the view. If you provide a column list the number of column aliases must match the number of expressions in the query. In case no column list is specified aliases are derived from the body of the view.
- column_alias – The column aliases must be unique.
- column_comment – An optional STRING literal describing the column alias.
view_comment – An optional STRING literal providing a view-level comments.
TBLPROPERTIES – Optionally sets one or more user defined properties.
AS query – A query that constructs the view from base tables or other views.

Examples

-- Create or replace view for `experienced_employee` with comments.
> CREATE OR REPLACE VIEW experienced_employee
    (id COMMENT 'Unique identification number', Name)
    COMMENT 'View for experienced employees'
    AS SELECT id, name
         FROM all_employee
        WHERE working_years > 5;

-- Create a temporary view `subscribed_movies`.
> CREATE TEMPORARY VIEW subscribed_movies
    AS SELECT mo.member_id, mb.full_name, mo.movie_title
         FROM movies AS mo
         INNER JOIN members AS mb
            ON mo.member_id = mb.id;

Databricks SQL – Tables

Posted on August 9, 2022 by jack

Note that I have “USING DELTA” at the end of the CREATE statment. This is optional because Delta is the default table type.

USE demo;
CREATE OR REPLACE TABLE managed_table (width INT, length INT, height INT) USING DELTA;
INSERT INTO managed_table VALUES (3, 2, 1);
SELECT * FROM managed_table;

USE demo;
DESCRIBE EXTENDED managed_table;
DESCRIBE TABLE EXTENDED managed_table;

USE demo;
SHOW CREATE TABLE managed_table;

USE demo;
DESCRIBE HISTORY managed_table;

USE demo;
SELECT *
FROM managed_table
VERSION AS OF 0

RESTORE TABLE employee TO TIMESTAMP AS OF '2022-08-02 00:00:00';
RESTORE TABLE employee TO VERSION AS OF 1;

USE demo;
DROP TABLE IF EXISTS managed_table;

CREATE TABLE [USING] (Databricks SQL)

CREATE TABLE [USING] (Databricks SQL) | Databricks on AWS

Defines a managed or external table, optionally using a data source.

{ { [CREATE OR] REPLACE TABLE | CREATE TABLE [ IF NOT EXISTS ] }
  table_name
  [ column_specification ] [ USING data_source ]
  [ table_clauses ]
  [ AS query ] }

column_specification
  ( { column_identifier column_type [ NOT NULL ]
      [ GENERATED ALWAYS AS ( expr ) |
        GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] ]
      [ COMMENT column_comment ] } [, ...] )

table_clauses
  { OPTIONS clause |
    PARTITIONED BY clause |
    clustered_by_clause |
    LOCATION path [ WITH ( CREDENTIAL credential_name ) ] |
    COMMENT table_comment |
    TBLPROPERTIES clause } [...]

clustered_by_clause
  { CLUSTERED BY ( cluster_column [, ...] )
    [ SORTED BY ( { sort_column [ ASC | DESC ] } [, ...] ) ]
    INTO num_buckets BUCKETS }

Examples

-- Creates a Delta table
> CREATE TABLE student (id INT, name STRING, age INT);

-- Use data from another table
> CREATE TABLE student_copy AS SELECT * FROM student;

-- Creates a CSV table from an external directory
> CREATE TABLE student USING CSV LOCATION '/mnt/csv_files';

-- Specify table comment and properties
> CREATE TABLE student (id INT, name STRING, age INT)
    COMMENT 'this is a comment'
    TBLPROPERTIES ('foo'='bar');

-- Specify table comment and properties with different clauses order
> CREATE TABLE student (id INT, name STRING, age INT)
    TBLPROPERTIES ('foo'='bar')
    COMMENT 'this is a comment';

-- Create partitioned table
> CREATE TABLE student (id INT, name STRING, age INT)
    PARTITIONED BY (age);

-- Create a table with a generated column
> CREATE TABLE rectangles(a INT, b INT,
                          area INT GENERATED ALWAYS AS (a * b));

DESCRIBE TABLE (Databricks SQL)

DESCRIBE TABLE (Databricks SQL) | Databricks on AWS

Returns the basic metadata information of a table. The metadata information includes column name, column type and column comment. Optionally you can specify a partition spec or column name to return the metadata pertaining to a partition or column respectively.

{ DESC | DESCRIBE } [ TABLE ] [ EXTENDED | FORMATTED ] table_name { [ PARTITION clause ] | [ column_name ] }

SHOW CREATE TABLE (Databricks SQL)

SHOW CREATE TABLE (Databricks SQL) | Databricks on AWS

Returns the CREATE TABLE statement or CREATE VIEW statement that was used to create a given table or view.

SHOW CREATE TABLE { table_name | view_name }

Examples

> CREATE TABLE test (c INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
    STORED AS TEXTFILE
    TBLPROPERTIES ('prop1' = 'value1', 'prop2' = 'value2');

> SHOW CREATE TABLE test;
                                       createtab_stmt
 ----------------------------------------------------
 CREATE TABLE `default`.`test` (`c` INT)
 USING text
 TBLPROPERTIES (
   'transient_lastDdlTime' = '1586269021',
   'prop1' = 'value1',
   'prop2' = 'value2')

DESCRIBE HISTORY (Databricks SQL)

You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta table by running the history command. The operations are returned in reverse chronological order. By default table history is retained for 30 days.

ดู version และ timestamp ของ delta table

DESCRIBE HISTORY table_name

Examples

DESCRIBE HISTORY '/data/events/'          -- get the full history of the table

DESCRIBE HISTORY delta.`/data/events/`

DESCRIBE HISTORY '/data/events/' LIMIT 1  -- get the last operation only

DESCRIBE HISTORY eventsTable

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, pathToTable)

fullHistoryDF = deltaTable.history()    # get the full history of the table

lastOperationDF = deltaTable.history(1) # get the last operation

RESTORE (Delta Lake on Databricks)

Restores a Delta table to an earlier state. Restoring to an earlier version number or a timestamp is supported.

RESTORE (Delta Lake on Databricks) | Databricks on AWS

RESTORE [ TABLE ] table_name [ TO ] time_travel_version

time_travel_version
 { TIMESTAMP AS OF timestamp_expression |
   VERSION AS OF version }

DROP TABLE (Databricks SQL)

DROP TABLE (Databricks SQL) | Databricks on AWS

DROP TABLE [ IF EXISTS ] table_name

Examples

-- Assumes a table named `employeetable` exists.
> DROP TABLE employeetable;

-- Assumes a table named `employeetable` exists in the `userdb` schema
> DROP TABLE userdb.employeetable;

-- Assumes a table named `employeetable` does not exist.
-- Throws exception
> DROP TABLE employeetable;
  Error: Table or view not found: employeetable;

-- Assumes a table named `employeetable` does not exist,Try with IF EXISTS
-- this time it will not throw exception
> DROP TABLE IF EXISTS employeetable;

Dropping External Tables Does Not Delete Data

Even though we dropped the table, we are still able to query the data directly in the filesystem because it still exists in the object store.

Now, this is one of the coolest features of Databricks SQL. We’ve talked about how the use of schemas and tables is just an organizational contruct. The data files located in this location can be queried directly, even though they are not part of a table or schema. We use tables and schemas simply to organize data in a way familiar to you.

SELECT * FROM delta.`wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales`;

Databricks SQL – Schema

Posted on August 9, 2022 by jack

GitHub – databricks-academy/data-analysis-with-databricks-sql

The code runs three statements. The first drops the schema just in case we are running this twice. The second creates the schema.

Note that there is no CATALOG provided. Databricks is set up to use a default catalog, and this is set up by your Databricks Administrator.

The third statement runs a DESCRIBE SCHEMA EXTENDED, which gives us information about the schema, including the location where managed table data will be stored.

DROP SCHEMA IF EXISTS demo_schema CASCADE;
CREATE SCHEMA IF NOT EXISTS demo_schema;
DESCRIBE SCHEMA EXTENDED demo_schema;

DROP SCHEMA (Databricks SQL)

DROP SCHEMA (Databricks SQL) | Databricks on AWS

Drops a schema and deletes the directory associated with the schema from the file system. An exception is thrown if the schema does not exist in the system.

While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

DROP SCHEMA [ IF EXISTS ] schema_name [ RESTRICT | CASCADE ]

Parameters

IF EXISTS – If specified, no exception is thrown when the schema does not exist.
schema_name – The name of an existing schemas in the system. If the name does not exist, an exception is thrown.
RESTRICT – If specified, will restrict dropping a non-empty schema and is enabled by default.
CASCADE – If specified, will drop all the associated tables and functions.

CREATE SCHEMA (Databricks SQL)

CREATE SCHEMA (Databricks SQL) | Databricks on AWS

Creates a schema with the specified name. If a schema with the same name already exists, an exception is thrown.

CREATE SCHEMA [ IF NOT EXISTS ] schema_name
    [ COMMENT schema_comment ]
    [ LOCATION schema_directory ]
    [ WITH DBPROPERTIES ( property_name = property_value [ , ... ] ) ]

Parameters

schema_name – The name of the schema to be created.
IF NOT EXISTS – Creates a schema with the given name if it does not exist. If a schema with the same name already exists, nothing will happen.
schema_directory – Path of the file system in which the specified schema is to be created. If the specified path does not exist in the underlying file system, creates a directory with the path. If the location is not specified, the schema is created in the default warehouse directory, whose path is configured by the static configuration spark.sql.warehouse.dir.
schema_comment – The description for the schema.
WITH DBPROPERTIES ( property_name = property_value [ , … ] ) – The properties for the schema in key-value pairs.

DESCRIBE SCHEMA (Databricks SQL)

DESCRIBE SCHEMA (Databricks SQL) | Databricks on AWS

Returns the metadata of an existing schema. The metadata information includes the schema’s name, comment, and location on the filesystem. If the optional EXTENDED option is specified, schema properties are also returned.

While usage of SCHEMA and DATABASE is interchangeable, SCHEMA is preferred.

{ DESC | DESCRIBE } SCHEMA [ EXTENDED ] schema_name

Parameters

schema_name – The name of an existing schema (schema) in the system. If the name does not exist, an exception is thrown.