Just as there are many ways to read data, we have just as many ways to write data.
- Writing Data
1. Writing Data
- Writing data to Parquet files
ใช้ตัวอย่างไฟล์ json จาก Read data in JSON format
# Required for StructField, StringType, IntegerType, etc. from pyspark.sql.types import * jsonSchema = StructType([ StructField("id", LongType(), True), StructField("father", StringType(), True), StructField("mother", StringType(), True), StructField("children", StructType([ StructField("first", StringType(), True), StructField("second", StringType(), True), StructField("third", StringType(), True) ]), True), ])
jsonFile = "/mnt/training/sample2.json" jsonDF = (spark.read .schema(jsonSchema) .json(jsonFile) )
display(jsonDF)
Now that we have a DataFrame
, we can write it back out as Parquet files or other various formats.
parquetFile = "/mnt/training/family.parquet" print("Output location: " + parquetFile) (jsonDF.write # Our DataFrameWriter .option("compression", "snappy") # One of none, snappy, gzip, and lzo .mode("overwrite") # Replace existing files .parquet(parquetFile) # Write DataFrame to Parquet files )
Now that the file has been written out, we can see it in the DBFS:
%fs ls /mnt/training/family.parquet
display(dbutils.fs.ls(parquetFile))
And lastly we can read that same parquet file back in and display the results:
display(spark.read.parquet(parquetFile))
Writing to CSV
ในตัวอย่างนี้ถ้า Write เป็น csv จะ error เพราะ CSV data source does not support struct<first:string,second:string,third:string> data type.
csvFile = "/mnt/training/family.csv" print("Output location: " + csvFile) (jsonDF.write # Our DataFrameWriter .mode("overwrite") # Replace existing files .csv(csvFile) # Write DataFrame to Parquet files )
ถ้าเป็น json นี้ Write เป็น csv จะ error เพราะ CSV data source does not support array<string> data type.
# Required for StructField, StringType, IntegerType, etc. from pyspark.sql.types import * jsonSchema = StructType([ StructField("id", LongType(), True), StructField("father", StringType(), True), StructField("mother", StringType(), True), StructField("children", ArrayType(StringType()), True) ])