MJN All Blog Cheatsheets Elasticsearch GCP JS LinuxBash Misc Notes Other ShortcutKeys / - Search

Home / Cheatsheets / Spark


Client / Cluster Mode

Cluster mode create an Application Monitor container in the cluster. Client mode run the application locally.

Data Frame API Categories

Transformations

Narrow Dependency

Performed in parallel on partitions eg. select(), filter(), drop(), withColumn()

Wide Dependency

Performed after grouping data from multiple partitions eg. groupBy(), join(), cube(), rollup(), agg(), repartition()

Actions

Trigger a job eg. read(), write(), collect(), take(), count()

Job Execution Plans

Logical execution plans are broken by wide dependencies -> stages - these are separated by shuffle/sorts.

If an executor has multiple cores then we call these multiple slots. Each slot can process a partition.

Memory Allocation

Spark Driver

spark.driver.memory - this is the memory allocated to the Spark driver.

spark.driver.memoryOverhead - Container process or other none JVM processes in the container. This is a percentage and the minimum allocated is max(spark.driver.memoryOverhead * spark.driver.memory, 384MB).

PySpark

# Dataframe to JSON (string)
jsonStrings = df.toJSON().collect()

# Dataframe to list
list = [row.asDict() for row in df.collect()]
for row in list:
      print(row)

# Create dataframe from list - need to define schema
df = spark.createDataFrame(list, schema)

# Get the schema definition for a dataframe
schema = df.schema
print(schema)

This page was generated by GitHub Pages. Page last modified: 24/05/20 10:14