Delta Lake Features

1. สร้างตาราง students

CREATE TABLE students (
  id INT, name STRING, value DOUBLE);

ดูคำสั่งสร้างตารางนี้ด้วย SHOW CREATE TABLE

CREATE TABLE spark_catalog.default.students (
  id INT,
  name STRING,
  value DOUBLE)
USING delta
  'delta.minReaderVersion' = '1',
  'delta.minWriterVersion' = '2')

Using DESCRIBE EXTENDED allows us to see important metadata about our table.

df1 = spark.sql('DESCRIBE EXTENDED students')
|            col_name|           data_type|comment|
|                  id|                 int|   null|
|                name|              string|   null|
|               value|              double|   null|
|                    |                    |       |
|# Detailed Table ...|                    |       |
|             Catalog|       spark_catalog|       |
|            Database|             default|       |
|               Table|            students|       |
|                Type|             MANAGED|       |
|            Location|dbfs:/user/hive/w...|       |
|            Provider|               delta|       |
|               Owner|                root|       |
| Is_managed_location|                true|       |
|    Table Properties|[delta.minReaderV...|       |

DESCRIBE DETAIL is another command that allows us to explore table metadata.

df2 = spark.sql('DESCRIBE DETAIL students')
-RECORD 0--------------------------------
 format           | delta                
 id               | 46681f33-7201-4c6... 
 name             | spark_catalog.def... 
 description      | null                 
 location         | dbfs:/user/hive/w... 
 createdAt        | 2023-04-17 16:24:... 
 lastModified     | 2023-04-17 16:24:32  
 partitionColumns | []                   
 numFiles         | 0                    
 sizeInBytes      | 0                    
 properties       | {}                   
 minReaderVersion | 1                    
 minWriterVersion | 2        

ดู Delta Lake Files

li_file ='dbfs:/user/hive/warehouse/students')
df3 = sqlContext.createDataFrame(li_file)
|                path|                name|size|modificationTime|
|dbfs:/user/hive/w...|         _delta_log/|   0|   1681750558289|

Reviewing Delta Lake Transactions

df4 = spark.sql('DESCRIBE HISTORY students')
-RECORD 0-----------------------------------
 version             | 0                    
 timestamp           | 2023-04-17 16:24:32  
 userId              | 8501686721698164     
 userName            | odl_user_915759@d... 
 operation           | CREATE TABLE         
 operationParameters | {isManaged -> tru... 
 job                 | null                 
 notebook            | {1477724271071511}   
 clusterId           | 0415-162149-6ai590aw 
 readVersion         | null                 
 isolationLevel      | WriteSerializable    
 isBlindAppend       | true                 
 operationMetrics    | {}                   
 userMetadata        | null                 
 engineInfo          | Databricks-Runtim... 

2. เพิ่มข้อมูล 3 ครั้ง

INSERT INTO students VALUES (1, "Yve", 1.0);
INSERT INTO students VALUES (2, "Omar", 2.5);
INSERT INTO students VALUES (3, "Elia", 3.3);
df2 = spark.sql('DESCRIBE DETAIL students')
RECORD 0--------------------------------
 format           | delta                
 id               | 46681f33-7201-4c6... 
 name             | spark_catalog.def... 
 description      | null                 
 location         | dbfs:/user/hive/w... 
 createdAt        | 2023-04-17 16:24:... 
 lastModified     | 2023-04-17 16:51:01  
 partitionColumns | []                   
 numFiles         | 3                    
 sizeInBytes      | 2613                 
 properties       | {}                   
 minReaderVersion | 1                    
 minWriterVersion | 2  
li_file ='dbfs:/user/hive/warehouse/students')
df3 = sqlContext.createDataFrame(li_file)
|                path|                name|size|modificationTime|
|dbfs:/user/hive/w...|         _delta_log/|   0|   1681750558289|
|dbfs:/user/hive/w...|part-00000-1d6df3...| 868|   1681750256000|
|dbfs:/user/hive/w...|part-00000-57bc12...| 872|   1681750261000|
|dbfs:/user/hive/w...|part-00000-ec8db3...| 873|   1681750259000|
display(spark.sql(f"SELECT * FROM json.`dbfs:/user/hive/warehouse/students/_delta_log/00000000000000000001.json`"))

รันคำสั่งนี้จะ error

SELECT * FROM parquet.`dbfs:/user/hive/warehouse/students/part-00000-1d6df344-3187-42fd-8591-df4ed47b403f.c000.snappy.parquet`
AnalysisException: Incompatible format detected.

A transaction log for Delta was found at `dbfs:/user/hive/warehouse/students/_delta_log`,
but you are trying to read from `dbfs:/user/hive/warehouse/students/part-00000-57bc1277-813b-41c0-990d-4ba8fd05c9b1.c000.snappy.parquet` using format("parquet"). You must use
'format("delta")' when reading and writing to a delta table.

To disable this check, SET
To learn more about Delta, see; line 1 pos 14

ให้ SET ค่านี้ก่อน



SELECT * FROM parquet.`dbfs:/user/hive/warehouse/students/part-00000-1d6df344-3187-42fd-8591-df4ed47b403f.c000.snappy.parquet`


Constructs a virtual table that has no physical data based on the result-set of a SQL query. ALTER VIEW and DROP VIEW only change metadata.


    [ column_list ]
    [ COMMENT view_comment ]
    [ TBLPROPERTIES clause ]
    AS query

   ( { column_alias [ COMMENT column_comment ] } [, ...] )


  ("Yi", 1),
  ("Ali", 2),
  ("Selina", 3)

หรือใช้คำว่า TEMP แทน TEMPORARY ก็ได้

CREATE OR REPLACE TEMP VIEW demo_tmp1(name, value) AS
  ("Yi", 1),
  ("Ali", 2),
  ("Selina", 3)

PySpark: display a spark data frame in a table format

สร้าง PySpark DataFrame ชื่อ df

df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("baz", 3)], ("k", "v"))

# <class 'pyspark.sql.dataframe.DataFrame'>



# +---+---+
# |  k|  v|
# +---+---+
# |foo|  1|
# |bar|  2|
# |baz|  3|
# +---+---+

แสดงตาราง โดยกำหนด n = 2


# +---+---+
# |  k|  v|
# +---+---+
# |foo|  1|
# |bar|  2|
# +---+---+
# only showing top 2 rows
%python, True)

# +---+---+
# |  k|  v|
# +---+---+
# |foo|  1|
# |bar|  2|
# +---+---+
# only showing top 2 rows


    n: int = 20,
    truncate: Union[bool, int] = True,
    vertical: bool = False,
) -> None
Prints the first ``n`` rows to the console.

.. versionadded:: 1.3.0

n : int, optional
    Number of rows to show.
truncate : bool or int, optional
    If set to ``True``, truncate strings longer than 20 chars by default.
    If set to a number greater than one, truncates long strings to length ``truncate``
    and align cells right.
vertical : bool, optional
    If set to ``True``, print output rows vertically (one line
    per column value).

>>> df
DataFrame[age: int, name: string]
|age| name|
|  2|Alice|
|  5|  Bob|
|  2| Ali|
|  5| Bob|
-RECORD 0-----
 age  | 2
 name | Alice
-RECORD 1-----
 age  | 5
 name | Bob
File:      /databricks/spark/python/pyspark/sql/

ติดตั้ง Java JDK บน Ubuntu 20.04

Installing Java

update packages ก่อน

$ sudo apt update

ติดตั้ง Java JDK 11 (openjdk)

$ sudo apt install default-jdk

แต่ถ้าจะติดตั้ง Java 8 ใช้คำสั่ง

sudo apt install openjdk-8-jdk

โปรแกรมจะติดตั้งอยู่ที่ /usr/lib/jvm/java-11-openjdk-amd64/bin/

    $ ls -l /usr/lib/jvm/java-11-openjdk-amd64/bin/java*
    -rwxr-xr-x 1 root root 14560 ม.ค.  20 16:07 /usr/lib/jvm/java-11-openjdk-amd64/bin/java
    -rwxr-xr-x 1 root root 14608 ม.ค.  20 16:07 /usr/lib/jvm/java-11-openjdk-amd64/bin/javac
    -rwxr-xr-x 1 root root 14608 ม.ค.  20 16:07 /usr/lib/jvm/java-11-openjdk-amd64/bin/javadoc
    -rwxr-xr-x 1 root root 14576 ม.ค.  20 16:07 /usr/lib/jvm/java-11-openjdk-amd64/bin/javap

    ตรวจสอบการติดตั้งบน Ubuntu 20.04.6

    $ java --version
    openjdk 11.0.18 2023-01-17
    OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1)
    OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
    $ javac --version
    javac 11.0.18

    ตรวจสอบการติดตั้งบน Ubuntu 22.04.2

    $ java --version
    openjdk 11.0.18 2023-01-17
    OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu122.04)
    OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu122.04, mixed mode, sharing)
    $ javac --version
    javac 11.0.18

    Managing Java

    ใช้คำสั่ง update-alternatives

    $ update-alternatives --help
    Usage: update-alternatives [<option> ...] <command>
      --install <link> <name> <path> <priority>
        [--slave <link> <name> <path>] ...
                               add a group of alternatives to the system.
      --remove <name> <path>   remove <path> from the <name> group alternative.
      --remove-all <name>      remove <name> group from the alternatives system.
      --auto <name>            switch the master link <name> to automatic mode.
      --display <name>         display information about the <name> group.
      --query <name>           machine parseable version of --display <name>.
      --list <name>            display all targets of the <name> group.
      --get-selections         list master alternative names and their status.
      --set-selections         read alternative status from standard input.
      --config <name>          show alternatives for the <name> group and ask the
                               user to select which one to use.
      --set <name> <path>      set <path> as alternative for <name>.
      --all                    call --config on all alternatives.
    <link> is the symlink pointing to /etc/alternatives/<name>.
      (e.g. /usr/bin/pager)
    <name> is the master name for this link group.
      (e.g. pager)
    <path> is the location of one of the alternative target files.
      (e.g. /usr/bin/less)
    <priority> is an integer; options with higher numbers have higher priority in
      automatic mode.
      --altdir <directory>     change the alternatives directory.
      --admindir <directory>   change the administrative directory.
      --log <file>             change the log file.
      --force                  allow replacing files with alternative links.
      --skip-auto              skip prompt for alternatives correctly configured
                               in automatic mode (relevant for --config only)
      --quiet                  quiet operation, minimal output.
      --verbose                verbose operation, more output.
      --debug                  debug output, way more output.
      --help                   show this help message.
      --version                show the version.

    You can have multiple Java installations on one server. You can configure which version is the default for use on the command line by using the update-alternatives command.

    $ sudo update-alternatives --config java

    ถ้ามี java ตัวเดียวก็จะขึ้นประมาณนี้

    $ sudo update-alternatives --config java
    There is only one alternative in link group java (providing /usr/bin/java): /usr/lib/jvm/java-11-openjdk-amd64/bin/java
    Nothing to configure.

    แต่ถ้ามี java หลายตัว ก็จะแสดงให้เราเลือก

    javac ก็เหมือนกัน ใช้คำสั่ง

    $ sudo update-alternatives --config javac

    Setting the JAVA_HOME

    $ sudo nano /etc/environment

    At the end of this file, add the following line, and to not include the bin/ portion of the path: (หา path ได้ด้วยคำสั่ง update-alternatives)


    Modifying this file will set the JAVA_HOME path for all users on your system.

    Save the file and exit the editor.

    Now reload this file to apply the changes to your current session:

    $ source /etc/environment

    Verify that the environment variable is set:

    $ echo $JAVA_HOME


    คำสั่ง apt

    ดู help ของคำสั่ง apt

    $ apt --help
    apt 2.4.8 (amd64)
    Usage: apt [options] command
    apt is a commandline package manager and provides commands for
    searching and managing as well as querying information about packages.
    It provides the same functionality as the specialized APT tools,
    like apt-get and apt-cache, but enables options more suitable for
    interactive use by default.
    Most used commands:
      list - list packages based on package names
      search - search in package descriptions
      show - show package details
      install - install packages
      reinstall - reinstall packages
      remove - remove packages
      autoremove - Remove automatically all unused packages
      update - update list of available packages
      upgrade - upgrade the system by installing/upgrading packages
      full-upgrade - upgrade the system by removing/installing/upgrading packages
      edit-sources - edit the source information file
      satisfy - satisfy dependency strings
    See apt(8) for more information about the available commands.
    Configuration options and syntax is detailed in apt.conf(5).
    Information about how to configure sources can be found in sources.list(5).
    Package and version choices can be expressed via apt_preferences(5).
    Security details are available in apt-secure(8).
                                            This APT has Super Cow Powers.

    Update and Upgrade packages

    update list of available packages

    $ sudo apt update

    list packages ที่สามารถอัพเกรดได้

    $ apt list --upgradable

    upgrade the system by installing/upgrading packages

    $ sudo apt upgrade

    ทำคำสั่ง apt update และต่อด้วย apt upgrade ด้วยการใช้คำสั่ง &&

    When using the && command, the second command will be executed only when the first one has been succcefully executed.

    $ sudo apt update && sudo apt upgrade

    Install or Remove package

    install packages

    $ sudo apt install <package_name>

    remove packages

    $ sudo apt remove <package_name>

    Remove automatically all unused packages

    $ sudo apt autoremove

    reinstall packages

    sudo apt reinstall <package_name>
    sudo apt reinstall lighttpd

    hold a package ด้วย apt-mark

    sudo apt-mark hold <package_name>
    sudo apt-mark hold sudo

    unhold a package ด้วย apt-mark

    sudo apt-mark unhold <package_name>
    sudo apt-mark unhold sudo


    show package details

    To show or see information about the given package(s) including its dependencies, installation and download size, sources the package is available from, the description of the packages content and much more:

    $ apt show <package_name>
    $ apt show sudo

    List package dependency

    apt depends <package_name>
    apt depends sudo

    search in package descriptions

    apt search php
    apt search mysql-5.?
    apt search mysql-server-5.?
    apt search httpd*
    apt search ^apache
    apt search ^nginx
    apt search ^nginx$

    apt search ค้นหาใน package descriptions ทำให้ได้ข้อมูลเยอะเกิน หา pakcage ที่ต้องการยาก ให้ลองใช้ apt list แทน

    apt list
    apt list | more
    apt list | grep foo
    apt list | grep php7-
    apt list nginx
    apt list 'php7*'

    List all installed packages

    apt list --installed
    apt list --installed | grep <package_name>


    Write Excel with PySpark

    ที่ Cluster ติดตั้ง com.crealytics:spark-excel-2.12.17-3.0.1_2.12:3.0.1_0.18.1

    สร้าง pyspark dataframe

    data = [('A', "1"),
            ('B', "2"),
            ('C', "3"),
            ('D', "4")
    print(type(data))  # <class 'list'>
    df = spark.createDataFrame(data)
    print(type(df))    # <class 'pyspark.sql.dataframe.DataFrame'>

    เขียนไฟล์ excel

    path = '/mnt/xxx/tmp/'
    filename = f'{path}output1.xlsx'
    print(f'filename = {filename}')
      .option("header", "true")\

    ลอง %fs ls ‘/mnt/xxx/tmp/‘ จะเห็นไฟล์ dbfs:/mnt/xxx/tmp/output1.xlsx ละ

    สร้าง dataframe อีกอัน

    columns = ['Identifier', 'Value', 'Extra Discount']
    vals = [(1, 150, 0), (2, 160, 12)]
    df2 = spark.createDataFrame(vals, columns)
    # +----------+-----+--------------+
    # |Identifier|Value|Extra Discount|
    # +----------+-----+--------------+
    # |         1|  150|             0|
    # |         2|  160|            12|
    # +----------+-----+--------------+

    เขียนแบบ append โดยข้อมูลเริ่มต้นที่ cell B3 ถึง C35 คอลัมน์ Extra Discount เลยหายไป

    df2.write.format("com.crealytics.spark.excel") \
      .option("dataAddress", "'My Sheet'!B3:C35") \
      .option("header", "true") \
      .mode("append") \

    Write Excel with Pandas

    สร้าง pandas DataFrame

    import pandas as pd
    import openpyxl
    df = pd.DataFrame([[11, 21, 31], [12, 22, 32], [31, 32, 33]],
                      index=['one', 'two', 'three'], columns=['a', 'b', 'c'])
    #         a   b   c
    # one    11  21  31
    # two    12  22  32
    # three  31  32  33
    # <class 'pandas.core.frame.DataFrame'>

    เขียน pandas DataFrame ลงไฟล์

    path = '/tmp/'
    filename = f'{path}output1.xlsx'
    with pd.ExcelWriter(filename) as writer:  
         df.to_excel(writer, sheet_name='Sheet_name_1')

    Python Naming Conventions

    The naming styles

    • b (single lowercase letter)
    • B (single uppercase letter)
    • lowercase
    • lower_case_with_underscores
    • CapitalizedWords (or CapWords, or CamelCase – so named because of the bumpy look of its letters [4]). This is also sometimes known as StudlyCaps.Note: When using acronyms in CapWords, capitalize all the letters of the acronym. Thus HTTPServerError is better than HttpServerError.
    • mixedCase (differs from CapitalizedWords by initial lowercase character!)
    • Capitalized_Words_With_Underscores (ugly!)

    Class Names

    Class names should normally use the CapWords convention.

    Function Names

    Function names should be lowercase, with words separated by underscores as necessary to improve readability. (lower_case_with_underscores)

    Variable Names

    Variable names follow the same convention as function names.

    SQL Declare Variable

    คำสั่ง SET

    เป็นการให้ค่าจาก SQL ด้วยคำสั่ง SET แล้วก็อ่านค่าด้วย SQL

    SET value = 2;
    SELECT ${hiveconf:value} 
    SELECT ${hiveconf:value} AS value
    SET LastChangeDate = current_date()
    Select ${hiveconf:LastChangeDate}

    คำสั่ง spark.conf.set()

    เป็นการให้ค่าจาก Python ด้วยคำสั่ง spark.conf.set() แล้วอ่านค่าด้วย SQL

    spark.conf.set("", "jack") 
    spark.conf.set("", "jack") 

    จะสังเกตุเห็นได้ว่าที่ Python ให้ค่าทั้ง ab และ AB เพื่อให้ SQL อ่านค่าได้ทั้ง ab และ AB

    SELECT '${}' AS name

    Databricks extension for Visual Studio Code

    Before you begin

    Before you can use the Databricks extension for Visual Studio Code, your Databricks workspace and your local development machine must meet the following requirements. You must also have an access token to authenticate with Databricks.

    • Workspace requirements
    • Access token
    • Local development machine requirements

    Workspace requirements

    enable Files in Repos

    คอนฟิกไฟล์ %USERPROFILE%\.databrickscfg เช่น C:\Users\jack\.databrickscfg

    host =
    token = dapi1234567xxx123456yyyy123456789012
    host =
    token = dapi2345678xxx234567yyyy234567890123

    Access token

    You must have a Databricks personal access token. If you do not have one available, you can generate a personal access token now.

    Local development machine requirements

    Visual Studio Code version 1.69.1 or higher.

    ติดตั้ง Extension


    from pyspark.sql import SparkSession
    spark: SparkSession = spark
    print("Hello from Databricks")