Read data in JSON format

  1. ตรวจสอบไฟล์ข้อมูล
  2. Reading from JSON – InferSchema
  3. Reading from JSON – User-Defined Schema

1. ตรวจสอบไฟล์ข้อมูล

We can use %fs ls … to view the file on the DBFS.

%fs ls dbfs:/mnt/training/sample.json

We can use %fs head … to peek at the first couple thousand characters of the file.

%fs head /mnt/training/sample.json
{"id":1,"father":"Mark","mother":"Charlotte","children":["Tom"]}
{"id":2,"father":"John","mother":"Ann","children":["Jessika","Antony","Jack"]}
{"id":3,"father":"Bob","mother":"Monika","children":["Jerry","Karol"]}

และ

%fs head /mnt/training/sample2.json
{"id":1,"father":"Mark","mother":"Charlotte","children":{"first":"Tom1"}}
{"id":2,"father":"John","mother":"Ann","children":{"first":"Jack1","second":"Jack2"}}
{"id":3,"father":"Bob","mother":"Monika","children":{"first":"Karol1","second":"Karol2","third":"Karol3"}}
CREATE OR REPLACE TEMP VIEW tmp_json 
AS 
SELECT * 
FROM   JSON.`path/to/file.json` 

2. Reading from JSON – InferSchema

JSON Lines

  • That there is one JSON object per line and…
  • That it’s delineated by a new-line.

This format is referred to as JSON Lines or newline-delimited JSON. More information about this format can be found at http://jsonlines.org.

** Note: ** Spark 2.2 was released on July 11th 2016. With that comes File IO improvements for CSV & JSON, but more importantly, Support for parsing multi-line JSON and CSV files. You can read more about that (and other features in Spark 2.2) in the Databricks Blog.

Read The JSON File ตัวอย่าง 1

The command to read in JSON looks very similar to that of CSV.

jsonFile = "/mnt/training/sample.json"
tempDF = (spark.read           # The DataFrameReader
   .option("inferSchema", "true")   # Automatically infer data types & column names
   .json(jsonFile)              # Creates a DataFrame from CSV after reading in the file
)
tempDF.printSchema()
root
 |-- children: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- father: string (nullable = true)
 |-- id: long (nullable = true)
 |-- mother: string (nullable = true)

With our DataFrame created, we can now take a peak at the data. But to demonstrate a unique aspect of JSON data (or any data with embedded fields), we will first create a temporary view and then view the data via SQL:

# create a view called temp_view
tempDF.createOrReplaceTempView("temp_view")

And now we can take a peak at the data with simple SQL SELECT statement:

%sql
SELECT * FROM temp_view
%sql
SELECT id, father, mother, children
FROM temp_view

Read The JSON File ตัวอย่าง 2

jsonFile2 = "/mnt/training/sample2.json"

tempDF2 = (spark.read           # The DataFrameReader
   .option("inferSchema", "true")   # Automatically infer data types & column names
   .json(jsonFile2)              # Creates a DataFrame from CSV after reading in the file
)
tempDF2.printSchema()
root
 |-- children: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- third: string (nullable = true)
 |-- father: string (nullable = true)
 |-- id: long (nullable = true)
 |-- mother: string (nullable = true)
# create a view called temp_view2
tempDF2.createOrReplaceTempView("temp_view2")
%sql
SELECT * FROM temp_view2
%sql
SELECT id, children.first, children.second, children.third 
FROM temp_view2 
WHERE children.second IS NOT NULL

Review: Reading from JSON – InferSchema

  • We only need one job even when inferring the schema.
  • There is no header which is why there isn’t a second job in this case – the column names are extracted from the JSON object’s attributes.

3. Reading from JSON – User-Defined Schema

To avoid the extra job, we can (just like we did with CSV) specify the schema for the DataFrame.

 ตัวอย่าง 1 Step #1 – Create the Schema

Note that we can support complex data types as seen in the field children.

# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

jsonSchema = StructType([
  StructField("id", LongType(), True),
  StructField("father", StringType(), True),
  StructField("mother", StringType(), True),
  StructField("children", ArrayType(StringType()), True)
])

For a small file, manually creating the the schema may not be worth the effort. However, for a large file, the time to manually create the schema may be worth the trade off of a really long infer-schema process.

 ตัวอย่าง 1 Step #2 – Read in the JSON

Next, we will read in the JSON file and print its schema.

(spark.read             # The DataFrameReader
  .schema(jsonSchema)   # Use the specified schema
  .json(jsonFile)       # Creates a DataFrame from JSON after reading in the file
  .printSchema()
)
root
 |-- id: long (nullable = true)
 |-- father: string (nullable = true)
 |-- mother: string (nullable = true)
 |-- children: array (nullable = true)
 |    |-- element: string (containsNull = true)

 ตัวอย่าง 2 Step #1 – Create the Schema

# Required for StructField, StringType, IntegerType, etc.
from pyspark.sql.types import *

jsonSchema2 = StructType([
  StructField("id", LongType(), True),
  StructField("father", StringType(), True),
  StructField("mother", StringType(), True),
  StructField("children", StructType([
    StructField("first", StringType(), True),
    StructField("second", StringType(), True),
    StructField("third", StringType(), True)
  ]), True),
])

 ตัวอย่าง 2 Step #2 – Read in the JSON

(spark.read             # The DataFrameReader
  .schema(jsonSchema2)  # Use the specified schema
  .json(jsonFile2)      # Creates a DataFrame from JSON after reading in the file
  .printSchema()
)
root
 |-- id: long (nullable = true)
 |-- father: string (nullable = true)
 |-- mother: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- first: string (nullable = true)
 |    |-- second: string (nullable = true)
 |    |-- third: string (nullable = true)

Review: Reading from JSON – User-Defined Schema

  • Just like CSV, providing the schema avoids the extra jobs.
  • The schema allows us to rename columns and specify alternate data types.

คำสั่ง display()

tempDF2 = (spark.read
  .schema(jsonSchema2)
  .json(jsonFile2)
)
display(tempDF2)