Ingest Data Using Databricks SQL
We can ingest data into Databricks SQL in a variety of ways. In this first example, we are going to start with a .csv file that exists in an object store and finish with a Delta table that contains all the data from that .csv file.
There are several things to note in these commands:
- Since the file we are ingesting is a .csv file, we state USING csv
- We are setting three options:
path
– the path to the object storeheader
– whether or not the .csv file contains a header rowinferSchema
– whether or not Databricks should infer the schema from the contents of the file
- We are creating a second table, with default settings from the contents of the first table, using a CTAS statement (CREATE TABLE AS SELECT)).
The reason we are creating a second table that is a copy of the first table is because we want the resulting table to be in Delta format, which gives us the most options. Because the second table uses default options, it will be a Delta table.
USE demo; DROP TABLE IF EXISTS web_events_csv; CREATE TABLE web_events_csv USING csv OPTIONS ( path='wasb://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/web_events/web-events.csv', header="true", inferSchema="true" ); DROP TABLE IF EXISTS web_events; CREATE OR REPLACE TABLE web_events AS SELECT * FROM web_events_csv;
The LOCATION Keyword
We can ingest data in-place using the LOCATION keyword to create an external/unmanaged table. There is a dataset, called Sales, that is currently in an object store. Running the command below creates an external table that is associated with this dataset.
The table’s data is stored in the external location, but the table itself is registered in the metastore. We can query the data just like any other table in the schema.
USE demo; DROP TABLE IF EXISTS external_table; CREATE TABLE external_table LOCATION 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/sales'; SELECT * FROM external_table;
COPY INTO
- Load data with COPY INTO | Databricks on AWS
- COPY INTO – Azure Databricks | Microsoft Docs
- Common data loading patterns with COPY INTO – Azure Databricks | Microsoft Docs
We use COPY INTO to load data from a file location into a Delta table. COPY INTO is a re-triable and idempotent operation, so files in the source location that have already been loaded are skipped.
The first command creates an empty Delta table. Note that you must specify a schema when creating an empty Delta table.
The second command copies data from an object store location into the web-events table. Note that the file type for the files in the object store location is specified as “JSON”. The last part of the COPY INTO command is a file name, a list of file names, or a directory of files.
USE demo; DROP TABLE IF EXISTS gym_logs; CREATE TABLE gym_logs (first_timestamp DOUBLE, gym Long, last_timestamp DOUBLE, mac STRING); COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON FILES = ('20191201_2.json');
We can run the same COPY INTO command as above and add a second one. The first one will be skipped because the file has already been loaded.
USE demo; COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON FILES = ('20191201_2.json'); COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON FILES = ('20191201_3.json');
Do a count on the table you just populated using COPY INTO. Then, rerun the COPY INTO code above, and do another count. Notice that ‘COPY INTO’ will not input data from files that have already been ingested using COPY INTO.
Next, let’s add another COPY INTO command, but this time, we will use the PATTERN
option. This allows us to load any file that fits a specific pattern.
USE demo; COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON FILES = ('20191201_2.json'); COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON FILES = ('20191201_3.json'); COPY INTO gym_logs FROM 'wasbs://courseware@dbacademy.blob.core.windows.net/data-analysis-with-databricks/v01/gym-logs' FILEFORMAT = JSON PATTERN = '20191201_[0-9].json';
Privileges
Data object owners and Databricks administrators can grant and revoke a variety of privileges on securable objects. These objects include functions, files, tables, views, and more. These privileges can be granted using SQL or using the Data Explorer.
This statement returns the privileges granted on this schema.
SHOW GRANT ON SCHEMA demo;
GRANT
If we wish to grant privileges to a user, we can run GRANT
commands for the specific privileges we wish to grant. The privileges available include USAGE
, SELECT
, CREATE
, READ
FILES
, and more.
We need to start by granting USAGE
to a user. We can then grant other privileges. If we want to grant all privileges, we can use GRANT ALL PRIVILEGES
.
GRANT USAGE ON SCHEMA demo TO `phaisarn`; GRANT SELECT ON SCHEMA demo TO `phaisarn`;
Examples
> GRANT CREATE ON SCHEMA <schema-name> TO `alf@melmak.et`; > GRANT ALL PRIVILEGES ON TABLE forecasts TO finance; > GRANT SELECT ON TABLE sample_data TO USERS;
REVOKE
We can revoke privileges on securable objects in the exact same way.
We don’t want to actually run the command because we don’t want to try to revoke our own privileges, but here is the command:
REVOKE ALL PRIVILEGES ON SCHEMA `schema_name` from `user_name`;
Examples
> REVOKE ALL PRIVILEGES ON SCHEMA default FROM `alf@melmak.et`; > REVOKE SELECT ON TABLE t FROM aliens;
Data Explorer
While we can certainly grant and revoke privileges using SQL, we can also use the Data Explorer.
- Click “Data” in the sidebar menu.
- If needed, select your schema in the dropdown
- Select the table, “gym_logs” from the list
- Click “Permissions”
- Use the “Grant” and “Revoke” buttons to change permission settings.