0

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this is associated with spark connect. In this documentation on Spark Connect, it says,

In Spark 3.4, Spark Connect supports most PySpark APIs, including DataFrame, Functions, and Column. However, some APIs such as SparkContext and RDD are not supported.

Is there any way to get around this?

1
  • 2
    RDD are low-level APIs that aren't supported in Shared compute clusters. It's time to move to Dataframes APIs, which are effective and work everywhere. What is it you are trying to do with RDD? Share it here please. Commented Sep 27, 2024 at 13:08

1 Answer 1

1

I was having a similar issue with rdd on Databricks, but since you did not share more details about your issue, here is how I fixed the issue I was having:

[NOT_IMPLEMENTED] rdd is not implemented.

Possible alternatives to fix this issue:

  • Change cluster access mode from shared to single user
  • Upgrade or downgrade dbr versions. Version 15.5 may work
  • Set cluster configurations:
    • spark.databricks.pyspark.enablePy4JSecurity false
    • spark.databricks.pyspark.trustedFilesystems org.apache.spark.api.java.JavaRDD
  • Use Dataframe API alternative
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.