1

When we broadcast a dataframe (let's say broadcast joins), it broadcasts the same copy of dataframe to all the executors, i.e. each executor (not node, but executor) will hold one copy of the data. So if a node has 3 executors, it unnecessarily holds 3 copies of the same data.

My question is, why can't it use off-heap space? Let's say I give sufficient off-heap memory so why can't it store this data there, just one copy per node and all executors in that node can read from this off heap space.

2
  • 1
    Spark does not store broadcast data in off-heap memory shared across executors because of its design principles: Executor isolation for fault tolerance and simplicity. Avoidance of synchronization overhead and performance bottlenecks. A distributed architecture that minimizes inter-executor dependencies. To optimize broadcast behavior: Reduce the number of executors per node. Consider alternatives like custom shared memory solutions or distributed file systems. Commented Jan 5 at 13:26
  • This question is similar to: Spark - using off-heap memory. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented Jan 5 at 21:48

1 Answer 1

1

That's a good question, let me first clarify something:

  • Executors are independent JVM processes that execute tasks in isolation to interfere in each other tasks, when you share data between them, the executors would require inter-process communication (IPC), causing bottlenecks due complexity.
  • When Spark broadcast data, it guarantees more consistency if one executor broke the data by itself. If it was shared data, this data would be inconsistent for all executors.
  • Spark is based on fault-tolerance, so if one task fails, Spark will re-execute again this task on a new executor with available resources, keeping a broadcast data copy on each executor simplifies this process of execution again.

Why can't it use off-heap space?

  • Performance: Accessing off-heap memory could be slower than keeping in a local copy, due IPC or other problems.
  • Lack of Built-In Mechanisms: Spark's core design revolves around executors as the primary unit of memory management, sharing data directly between executors on the same node would require significant architectural changes to Spark's memory management system.
  • Combability issues with different Cluster Managers: Introducing node-level shared memory would require modifications and support from each of these environments.

Alternatives:

  • Distributed systems, like HDFS.
  • Distributed Caching, like Redis/Memcached.
  • Custom Shared Memory, like some files to read.

Resources used:

1 - https://blog.devgenius.io/spark-on-heap-and-off-heap-memory-27b625af778b

2 - https://spark.apache.org/docs/latest/tuning.html

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks so much for such detailed explanation. One quick question though, it is mentioned that spark uses off heap when we cache any DF. Won't it cause all the problems which you've mentioned above here as well? Why is it that caching is designed to be off heap but broadcasting is not?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.