Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
15 views

I need to join two RDDs as part of my programming assignment. The problem is first RDD is nested while other is flat. I tried different things but nothing seem to work. Is there any expert on pyspark ...
1 vote
1 answer
111 views

Here is minimal example using default data in DataBricks (Spark 3.4): import org.apache.spark.sql.functions.col import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.types._ sc....
Igor Railean's user avatar
0 votes
2 answers
140 views

I'm working with PySpark to process large amounts of data. However, I noticed that the function called by mapPartitions is executed one more time than expected. For instance, in the following code ...
sebenitezg's user avatar
0 votes
1 answer
30 views

I have RDD1 col1 col2 A x123 B y123 C z123 RDD2 col1 A C I want to run intersection of two RDDs and find common elements i.e. item that are in RDD2 what is the data of ...
Sachin Shrivastava's user avatar
0 votes
1 answer
4k views

I have a dataframe on databricks on which I would like to use the RDD api on. The type of the dataframe is pyspark.sql.connect.dataframe.Dataframe after reading from the catalog. I found out that this ...
imawful's user avatar
  • 135
0 votes
1 answer
67 views

The resources for this are scarce and I'm not sure that there's a solution to this issue. Suppose you have 3 simple RDD's. Or more specifically 3 PairRDD's. val rdd1: RDD[(Int, Int)] = sc.parallelize(...
Nizar's user avatar
  • 763
0 votes
0 answers
156 views

While using the following code: import pyspark from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import SparkSession from pyspark.sql.types import Row from datetime ...
aemilius89's user avatar
-1 votes
1 answer
350 views

I was used below code before enabled unity catalog cluster in azure databricks notebook but after changed shared users enabled cluster. i could not able to use below logic, how should we achieve ...
Developer Rajinikanth's user avatar
1 vote
1 answer
54 views

I see that dataframe.agg(avg(Col) works fine, but when i calculate avg() over a window over whole column(not using any partition), i see different results based on which column i use with orderBy. ...
anurag86's user avatar
  • 1,707
3 votes
1 answer
89 views

I have a code like below, which uses pyspark. test_truth_value = RDD. test_predictor_rdd = RDD. valuesAndPred = test_truth_value.zip(lasso_model.predict(test_predictor_rdd)).map(lambda x: ((x[0]), (x[...
Inkyu Kim's user avatar
  • 175
1 vote
1 answer
61 views

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark's HashingTF and IDF implementations. I tried to ...
Caden's user avatar
  • 65
1 vote
1 answer
622 views

I want to apply a schema to specific non-technical columns of a Spark DataFrame. Beforehand, I add an artificial ID using Window and row_number so that I can later join some other technical columns to ...
stats_guy's user avatar
  • 717
0 votes
0 answers
44 views

I need to solve a problem where a company wants to offer k different users free use (a kind of coupon) of their application for two months. The goal is to identify users who are likely to churn (leave ...
Yoel Ha's user avatar
0 votes
1 answer
258 views

I have a PySpark DataFrame which needs ordering on a column ("Reference"). The values in the column typically look like: ["AA.1234.56", "AA.1101.88", "AA.904.33"...
pymat's user avatar
  • 1,192
-1 votes
1 answer
62 views

When trying to map our 6 column pyspark RDD into a 4d-tuple we get a list out of range error for any list element besides 0 which return the normal result. The dataset is structured like this: X,Y,FID,...
Toxicone 7's user avatar
0 votes
1 answer
97 views

I have around 613 text files stored in azure data lake gen 2 at this path for eg '/rawdata/no=/.txt'. I want to read all the text files and unbase 64 all text files as they are base64 encoded. But ...
Rushank Patil's user avatar
1 vote
1 answer
47 views

i have a 2MB file, when i read it using df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t") df.rdd.getNumPartitions() # ...
Youssef Alaa Etman's user avatar
0 votes
1 answer
120 views

I have attempted to run the below code: data(house) house_rdd = rdd_data(x=x, y=y, data=house, cutpoint=0) summary(house_rdd) plot(house_rdd) When I plot it, I get this, which make sense. ...
Andy_H's user avatar
  • 3
1 vote
0 answers
57 views

I am trying to create a Spark program to read the airport data from airport.text file, find all the airports which are located in United States and output the airport's name and city's name to an ...
Uche Kalu's user avatar
1 vote
0 answers
110 views

I want to use the user-defined function (UDF) below for RDD in Pyspark from pyspark.sql.functions import abs as pyspark_abs, sum as pyspark_sum, \ rand as pyspark_rand, min as ...
Kunagisa Tomo's user avatar
1 vote
0 answers
90 views

Hi Im new to spark and im trying to access files from hdfs in spark-shell. I wrote the following code from spark-shell scala> val myfile = sc.textFile("hdfs://localhost:9870/sample/demofile....
Ashok Kumar's user avatar
0 votes
0 answers
97 views

I'm connecting to to sql server view using jdbc connector from databricks notebook df = spark.read .format("com.databricks.spark.sqldw") .option("url", url) ...
Madushan's user avatar
3 votes
1 answer
3k views

If Spark is designed to "spill-to-disk" if the there is not enough memory then I am wondering how is it even possible to get an OOM in Spark ? There is a spark.local.dir which is used as a ...
ng.newbie's user avatar
  • 3,332
0 votes
1 answer
58 views

I am trying to replace 'yyyy-MM' with 'yyyy-MM'+'-01' below is my code and I am not getting it right. Note, I am working on databricks: from pyspark.sql.functions import col, concat, lit, when # Show ...
Franklyn's user avatar
2 votes
2 answers
243 views

I have an RDD containing pairs of nodes and I need to assign unique IDs to them. But I'm getting an NPE and I can't figure out how to solve it. I'm basically putting all nodes into a distinct list, ...
user1905910's user avatar
-1 votes
1 answer
34 views

I am new to Spark and I am trying to get an intuition for how RDDs are represented in memory. HDFS RDDs are easy as the partitions are handled by the filesystem itself. That is, HDFS itself divides a ...
ng.newbie's user avatar
  • 3,332
0 votes
1 answer
92 views

I'm new to spark and trying to understand how functions like reduce, aggregate etc. work. While going through RDD.aggregate(), I tried changing the zeroValue to something other than identities (0 for ...
Jatin Rathour's user avatar
0 votes
1 answer
2k views

My objective is to fetch a column values into a variable if possible as a list from pyspark dataframe. Expected output = ["a", "b", "c", ... ] I tried : [ col....
300's user avatar
  • 323
1 vote
2 answers
48 views

I want to find out the pair of who have contacted with one another. Following is the data: Input is K-\> M, H M-\> K, E H-\> F B-\> T, H E-\> K, H F-\> K, H, E A-\> Z And the ...
Salman Rasheed's user avatar
0 votes
1 answer
188 views

I am working on webarchives, and extracting some data, initially i used to store this data as txt in my hdfs, but due it is massive size i will have to store the output in amazon s3 buckets, how can i ...
Kshitij Pandit's user avatar
0 votes
1 answer
78 views

I am banging my head to convert the following spark RDD data using code [('4', ('1', '2')), ('10', ('5',)), ('3', ('2',)), ('6', ('2', '5')), ('7', ('2', '5')), ('1', None), ('8', ('2', '5')), ('9', ('...
Ram's user avatar
  • 870
1 vote
2 answers
503 views

I have the following data in a pyspark dataframe where both the columns contain string data. data = [(123, '[{"FLD_NAME":"A","FLD_VAL":"0.1"},{"FLD_NAME&...
dontgimmehope's user avatar
1 vote
1 answer
54 views

So I have two columns | col_arr | col_ind | |[1, 2, 3]| [0, 2] | |[5, 1] | [1] | and I'd like my result to be an extraction of the values in col_arr by col_ind resulting in col_val below: | ...
Gal Novich's user avatar
0 votes
0 answers
33 views

I have a sorted RDD, I have already applied some filters to it before. It’s not a key value pair. I want to remove rows of the RDD. Given two consecutive rows, I would like to remove the second if ...
deb's user avatar
  • 425
0 votes
2 answers
165 views

I'm studing Apache Spark and found some interesting thing. When I creating a new rdd with pair of key-value, where key is randomly choosing from tuple - the result of reducebykey is not correct. from ...
AlexForExample's user avatar
0 votes
2 answers
11k views

I have a question : Wide and Narrow transformations python Spark, are found both in RDDs and Structured APIs, right ? I mean, I guess I got the difference between wide and narrow transformations. My ...
Claudia Martins's user avatar
1 vote
1 answer
495 views

we have a workload that computes on spark cluster workers (cpu intensive). The results are pulled back to the driver which has a large memory allocation to collect the results via RDD .collect() ...
DaManJ's user avatar
  • 440
1 vote
0 answers
78 views

I use hbase export tool to put hbase data to hdfs : hbase org.apache.hadoop.hbase.mapreduce.Export \ <tablename> <outputdir> And i see data on my hdfs when i want to read that data ...
CompEng's user avatar
  • 7,416
1 vote
1 answer
72 views

I have two massive RDDs, and for each record in one, I need to find the physically closest (lat/lon) point in the other that has the same key. But... in each RDD, there are 100's of millions of ...
kmh's user avatar
  • 1,596
2 votes
0 answers
45 views

I am currently running an experiment in PySpark where I compare the overall performance of the wide transformations reduceByKey and groupByKey. I would expect to see a difference in performance (time ...
Sara García's user avatar
0 votes
1 answer
115 views

Hi I am trying to understand where is the problem in my code: rdd_avg=sc.parallelize([("a",10),("b",15),("c",20),("a",8)]) rdd_sp1=rdd_avg.map(lambda x: (x,1)) ...
Dwaipayan Sarkar's user avatar
1 vote
0 answers
291 views

I am running a structural topic model using the stm package in R. My model includes an interaction effect between faction_id and numeric_date (a measure of time). I am using the following code to ...
Elisa Benni's user avatar
2 votes
1 answer
767 views

I have data consisting of a date-time, IDs, and velocity, and I'm hoping to get histogram data (start/end points and counts) of velocity for each ID using PySpark. Sample data: df = spark....
CopyOfA's user avatar
  • 931
0 votes
1 answer
39 views

I'm new to Spark. I encountered some problems with part of about saving df to Hive table. def insert_into_hive_table(df: DataFrame, table_name: str): # for debugging - this action is working ...
amit's user avatar
  • 81
1 vote
0 answers
183 views

I want to remove the first row of a dataset but this error keeps occuring: 'JavaPackage' object is not callable. I've written another simple program for easier review. Here is my code: # Create a ...
Amirhosein Somi's user avatar
0 votes
1 answer
135 views

I'm running a python notebook in Azure Databricks. I am getting an IllegalArgumentException error when trying to add a line number with rdd.zipWithIndex(). The file is 2.72 GB and 1238951 lines (I ...
zBomb's user avatar
  • 361
0 votes
1 answer
30 views

I have a dataframe df1 like below product start end price p1 6/12/2020 6/7/2021 12 p1 6/8/2021 10/19/2021 14 p1 10/20/2021 5/14/2022 13 p1 5/15/2022 11/20/2022 12.5 p1 11/21/2022 1/1/2099 12.5 p2 6/12/...
sri's user avatar
  • 161
0 votes
1 answer
283 views

I am new to Pyspark and I tried to find the number of Specific String('Private Room') from CSV using RDD count Action but I get a list index out of range error. When I use a take(20) action, I am ...
Hariharan's user avatar
1 vote
1 answer
130 views

I'm being forced onto a newer EMR version (5.23.1, 5.27.1, or 5.32+) by our cloud team, which is forcing me up from 5.17.0 w/ Spark 2.3.1 to Spark 2.4.x. The impetus is to allow a security ...
kmh's user avatar
  • 1,596
1 vote
0 answers
317 views

I am trying to add a dashed confidence intervals lines (upper and lower bound) using the RDplot command and unable to do so. The RDplot allows to only add a CI as a point estimate or a shade. My ...
ak7880's user avatar
  • 23

1
2 3 4 5
82