Databricks SQL – Ingest Data

Ingest Data Using Databricks SQL

We can ingest data into Databricks SQL in a variety of ways. In this first example, we are going to start with a .csv file that exists in an object store and finish with a Delta table that contains all the data from that .csv file.

There are several things to note in these commands:

  • Since the file we are ingesting is a .csv file, we state USING csv
  • We are setting three options:
    • path – the path to the object store
    • header – whether or not the .csv file contains a header row
    • inferSchema – whether or not Databricks should infer the schema from the contents of the file
  • We are creating a second table, with default settings from the contents of the first table, using a CTAS statement (CREATE TABLE AS SELECT)).

The reason we are creating a second table that is a copy of the first table is because we want the resulting table to be in Delta format, which gives us the most options. Because the second table uses default options, it will be a Delta table.

USE demo;
DROP TABLE IF EXISTS web_events_csv;
CREATE TABLE web_events_csv 
    USING csv 
    OPTIONS (
        path='wasb://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/web_events/web-events.csv',
        header="true",
        inferSchema="true"
    );
DROP TABLE IF EXISTS web_events;
CREATE OR REPLACE TABLE web_events AS
    SELECT * FROM web_events_csv;

The LOCATION Keyword

We can ingest data in-place using the LOCATION keyword to create an external/unmanaged table. There is a dataset, called Sales, that is currently in an object store. Running the command below creates an external table that is associated with this dataset.

The table’s data is stored in the external location, but the table itself is registered in the metastore. We can query the data just like any other table in the schema.

USE demo;
DROP TABLE IF EXISTS external_table;
CREATE TABLE external_table
    LOCATION 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales';
SELECT * FROM external_table;

COPY INTO

We use COPY INTO to load data from a file location into a Delta table. COPY INTO is a re-triable and idempotent operation, so files in the source location that have already been loaded are skipped.

The first command creates an empty Delta table. Note that you must specify a schema when creating an empty Delta table.

The second command copies data from an object store location into the web-events table. Note that the file type for the files in the object store location is specified as “JSON”. The last part of the COPY INTO command is a file name, a list of file names, or a directory of files.

USE demo;
DROP TABLE IF EXISTS gym_logs;
CREATE TABLE gym_logs (first_timestamp DOUBLE, gym Long, last_timestamp DOUBLE, mac STRING);
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');

We can run the same COPY INTO command as above and add a second one. The first one will be skipped because the file has already been loaded.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');

Do a count on the table you just populated using COPY INTO. Then, rerun the COPY INTO code above, and do another count. Notice that ‘COPY INTO’ will not input data from files that have already been ingested using COPY INTO.

Next, let’s add another COPY INTO command, but this time, we will use the PATTERN option. This allows us to load any file that fits a specific pattern.

USE demo;
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_2.json');
COPY INTO gym_logs
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    FILES = ('20191201_3.json');
COPY INTO gym_logs 
    FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs'
    FILEFORMAT = JSON
    PATTERN = '20191201_[0-9].json';

Privileges

Data object owners and Databricks administrators can grant and revoke a variety of privileges on securable objects. These objects include functions, files, tables, views, and more. These privileges can be granted using SQL or using the Data Explorer.

This statement returns the privileges granted on this schema.

SHOW GRANT ON SCHEMA demo;

GRANT

If we wish to grant privileges to a user, we can run GRANT commands for the specific privileges we wish to grant. The privileges available include USAGESELECTCREATEREAD FILES, and more.

We need to start by granting USAGE to a user. We can then grant other privileges. If we want to grant all privileges, we can use GRANT ALL PRIVILEGES.

GRANT USAGE ON SCHEMA demo TO `phaisarn`;
GRANT SELECT ON SCHEMA demo TO `phaisarn`;

Examples

> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`;

> GRANT ALL PRIVILEGES ON TABLE forecasts TO finance;

> GRANT SELECT ON TABLE sample_data TO USERS;

REVOKE

We can revoke privileges on securable objects in the exact same way.

We don’t want to actually run the command because we don’t want to try to revoke our own privileges, but here is the command:

REVOKE ALL PRIVILEGES ON SCHEMA `schema_name` from `user_name`;

Examples

> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`;

> REVOKE SELECT ON TABLE t FROM aliens;

Data Explorer

While we can certainly grant and revoke privileges using SQL, we can also use the Data Explorer.

  1. Click “Data” in the sidebar menu.
  2. If needed, select your schema in the dropdown
  3. Select the table, “gym_logs” from the list
  4. Click “Permissions”
  5. Use the “Grant” and “Revoke” buttons to change permission settings.