ติดตั้ง Apache Spark บน Ubuntu 22.04

ติดตั้ง java

update packages

$ sudo apt update

ติดตั้ง Java JDK (openjdk)

$ sudo apt install default-jdk

ตรวจสอบการติดตั้งบน Ubuntu 22.04.2

$ java --version
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu122.04.1, mixed mode, sharing)

ติดตั้ง Apache Spark

ติดตั้ง package curl , mlocate , git , scala

$ sudo apt install curl mlocate git scala 

ดาว์นโหลด Apache Spark จาก Download Apache Spark™

$ wget https://dlcdn.apache.org/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

แตกไฟล์

$ tar xvf spark-3.3.2-bin-hadoop3.tgz

ย้ายไฟล์

$ sudo mv spark-3.3.2-bin-hadoop3/ /opt/spark 

Set Spark environment

$ sudo nano ~/.bashrc

ใส่ค่านี้ต่อที่ด้านล่างของไฟล์ .bashrc

export SPARK_HOME=/opt/spark

export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

export SPARK_LOCAL_IP=localhost

export PYSPARK_PYTHON=/usr/bin/python3

export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
$ source .bashrc

run Spark shell

$ spark-shell

run Pyspark

$ pyspark
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/28 23:41:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Python version 3.10.6 (main, Nov 14 2022 16:10:14)
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1690562479751).
SparkSession available as 'spark'.
>>>

Start a standalone master server

$ start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-jack-org.apache.spark.deploy.master.Master-1-jack22042.out

The process will be listening on TCP port 8080.

$ sudo ss -tunelp | grep 8080
tcp   LISTEN 0      1      [::ffff:127.0.0.1]:8080             *:*    users:(("java",pid=6346,fd=283)) uid:1000 ino:63398 sk:e cgroup:/user.slice/user-1000.slice/session-4.scope v6only:0 <->

The Web UI looks like below. http://localhost:8080

Starting Spark Worker Process

$ start-worker.sh spark://jack22042:7077
starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-jack-org.apache.spark.deploy.worker.Worker-1-jack22042.out

 shut down the master and slave Spark processes

$ stop-worker.sh
$ stop-master.sh