Write data

Just as there are many ways to read data, we have just as many ways to write data.

  1. Writing Data

1. Writing Data

  • Writing data to Parquet files

ใช้ตัวอย่างไฟล์ json จาก Read data in JSON format

# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

jsonSchema = StructType([
  StructField("id", LongType(), True),
  StructField("father", StringType(), True),
  StructField("mother", StringType(), True),
  StructField("children", StructType([
    StructField("first", StringType(), True),
    StructField("second", StringType(), True),
    StructField("third", StringType(), True)
  ]), True),
])
jsonFile = "/mnt/training/sample2.json"

jsonDF = (spark.read
    .schema(jsonSchema)
    .json(jsonFile)
)
display(jsonDF)

Now that we have a DataFrame, we can write it back out as Parquet files or other various formats.

parquetFile = "/mnt/training/family.parquet"

print("Output location: " + parquetFile)

(jsonDF.write                      # Our DataFrameWriter
  .option("compression", "snappy") # One of none, snappy, gzip, and lzo
  .mode("overwrite")               # Replace existing files
  .parquet(parquetFile)            # Write DataFrame to Parquet files
)

Now that the file has been written out, we can see it in the DBFS:

%fs ls /mnt/training/family.parquet
display(dbutils.fs.ls(parquetFile))

And lastly we can read that same parquet file back in and display the results:

display(spark.read.parquet(parquetFile))

Writing to CSV

ในตัวอย่างนี้ถ้า Write เป็น csv จะ error เพราะ CSV data source does not support struct<first:string,second:string,third:string> data type.

csvFile = "/mnt/training/family.csv"

print("Output location: " + csvFile)

(jsonDF.write                      # Our DataFrameWriter
  .mode("overwrite")               # Replace existing files
  .csv(csvFile)                    # Write DataFrame to Parquet files
)

ถ้าเป็น json นี้ Write เป็น csv จะ error เพราะ CSV data source does not support array<string> data type.

# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

jsonSchema = StructType([
  StructField("id", LongType(), True),
  StructField("father", StringType(), True),
  StructField("mother", StringType(), True),
  StructField("children", ArrayType(StringType()), True)
])