Delta Lake Features

1. สร้างตาราง students

%sql
CREATE TABLE students (
  id INT, name STRING, value DOUBLE);

ดูคำสั่งสร้างตารางนี้ด้วย SHOW CREATE TABLE

%sql
SHOW CREATE TABLE students
%sql
CREATE TABLE spark_catalog.default.students (
  id INT,
  name STRING,
  value DOUBLE)
USING delta
TBLPROPERTIES (
  'delta.minReaderVersion' = '1',
  'delta.minWriterVersion' = '2')

Using DESCRIBE EXTENDED allows us to see important metadata about our table.

%sql
DESCRIBE EXTENDED students
%python
df1 = spark.sql('DESCRIBE EXTENDED students')
df1.show()
+--------------------+--------------------+-------+
|            col_name|           data_type|comment|
+--------------------+--------------------+-------+
|                  id|                 int|   null|
|                name|              string|   null|
|               value|              double|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|             default|       |
|               Table|            students|       |
|                Type|             MANAGED|       |
|            Location|dbfs:/user/hive/w...|       |
|            Provider|               delta|       |
|               Owner|                root|       |
| Is_managed_location|                true|       |
|    Table Properties|[delta.minReaderV...|       |
+--------------------+--------------------+-------+

DESCRIBE DETAIL is another command that allows us to explore table metadata.

%sql
DESCRIBE DETAIL students
%python
df2 = spark.sql('DESCRIBE DETAIL students')
df2.show(vertical=True)
-RECORD 0--------------------------------
 format           | delta                
 id               | 46681f33-7201-4c6... 
 name             | spark_catalog.def... 
 description      | null                 
 location         | dbfs:/user/hive/w... 
 createdAt        | 2023-04-17 16:24:... 
 lastModified     | 2023-04-17 16:24:32  
 partitionColumns | []                   
 numFiles         | 0                    
 sizeInBytes      | 0                    
 properties       | {}                   
 minReaderVersion | 1                    
 minWriterVersion | 2        

ดู Delta Lake Files

%python
display(dbutils.fs.ls('dbfs:/user/hive/warehouse/students'))
%python
li_file = dbutils.fs.ls('dbfs:/user/hive/warehouse/students')
df3 = sqlContext.createDataFrame(li_file)
df3.show()
+--------------------+--------------------+----+----------------+
|                path|                name|size|modificationTime|
+--------------------+--------------------+----+----------------+
|dbfs:/user/hive/w...|         _delta_log/|   0|   1681750558289|
+--------------------+--------------------+----+----------------+

Reviewing Delta Lake Transactions

%sql
DESCRIBE HISTORY students
%python
df4 = spark.sql('DESCRIBE HISTORY students')
df4.show(vertical=True)
-RECORD 0-----------------------------------
 version             | 0                    
 timestamp           | 2023-04-17 16:24:32  
 userId              | 8501686721698164     
 userName            | odl_user_915759@d... 
 operation           | CREATE TABLE         
 operationParameters | {isManaged -> tru... 
 job                 | null                 
 notebook            | {1477724271071511}   
 clusterId           | 0415-162149-6ai590aw 
 readVersion         | null                 
 isolationLevel      | WriteSerializable    
 isBlindAppend       | true                 
 operationMetrics    | {}                   
 userMetadata        | null                 
 engineInfo          | Databricks-Runtim... 

2. เพิ่มข้อมูล 3 ครั้ง

%sql
INSERT INTO students VALUES (1, "Yve", 1.0);
INSERT INTO students VALUES (2, "Omar", 2.5);
INSERT INTO students VALUES (3, "Elia", 3.3);
%python
df2 = spark.sql('DESCRIBE DETAIL students')
df2.show(vertical=True)
RECORD 0--------------------------------
 format           | delta                
 id               | 46681f33-7201-4c6... 
 name             | spark_catalog.def... 
 description      | null                 
 location         | dbfs:/user/hive/w... 
 createdAt        | 2023-04-17 16:24:... 
 lastModified     | 2023-04-17 16:51:01  
 partitionColumns | []                   
 numFiles         | 3                    
 sizeInBytes      | 2613                 
 properties       | {}                   
 minReaderVersion | 1                    
 minWriterVersion | 2  
%python
li_file = dbutils.fs.ls('dbfs:/user/hive/warehouse/students')
df3 = sqlContext.createDataFrame(li_file)
df3.show()
+--------------------+--------------------+----+----------------+
|                path|                name|size|modificationTime|
+--------------------+--------------------+----+----------------+
|dbfs:/user/hive/w...|         _delta_log/|   0|   1681750558289|
|dbfs:/user/hive/w...|part-00000-1d6df3...| 868|   1681750256000|
|dbfs:/user/hive/w...|part-00000-57bc12...| 872|   1681750261000|
|dbfs:/user/hive/w...|part-00000-ec8db3...| 873|   1681750259000|
+--------------------+--------------------+----+----------------+
%python
display(spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/students/_delta_log/00000000000000000001.json`"))

รันคำสั่งนี้จะ error

%sql
SELECT * FROM parquet.`dbfs:/user/hive/warehouse/students/part-00000-1d6df344-3187-42fd-8591-df4ed47b403f.c000.snappy.parquet`
AnalysisException: Incompatible format detected.

A transaction log for Delta was found at `dbfs:/user/hive/warehouse/students/_delta_log`,
but you are trying to read from `dbfs:/user/hive/warehouse/students/part-00000-57bc1277-813b-41c0-990d-4ba8fd05c9b1.c000.snappy.parquet` using format("parquet"). You must use
'format("delta")' when reading and writing to a delta table.

To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html; line 1 pos 14

ให้ SET ค่านี้ก่อน

%sql
SET spark.databricks.delta.formatCheck.enabled=false

รันใหม่จะได้ละ

%sql
SELECT * FROM parquet.`dbfs:/user/hive/warehouse/students/part-00000-1d6df344-3187-42fd-8591-df4ed47b403f.c000.snappy.parquet`