Spark dataframe parralelism after filtering
I am currently trying to extract a small subset(100 rows) rom a big dataframe in Spark (5TO in parquet format). To do that, I am using following code : df = sqlCtx.read.parquet(directory) subset = dataframe.limit(100) subset_count = subset.count() subset = subset.withColumn("c", subset["a"] + subset["b"]) However, After the "limit" filtering, Spark operations are really slow and exit with "GC ovserhead limit exceeded". When looking at Spark monitoring, I can see that subset has more than 25000 partitions (using repartition doesn't seem to be the solution neither). So, I am thinking that limit/filtering is not a good way to extract a dataframe from an other. Any ideas?
Spark: Use Temporary Table Twice in Query?
How can I define my ENV variables once in the DockerFile and pass them down to my spark image which is submitted by a supervisord managed script?
Unbalanced keys lead to performance problems in Spark
How to remove null data from JavaPairRDD
Spark Streaming: How Spark and Kafka communication happens?
Error while invoking spark-shell on windows
Best way to iterate/stream a Spark Dataframe
Is it is required to be data in hive matastore to be used in sql-context from spark?
How to modify a Spark Dataframe with a complex nested structure?
Object not serializable error on org.apache.avro.generic.GenericData$Record
How to run Spark Sql on a 10 Node cluster
How to do group by range query
Visualising a Matrix
More than one hour to execute pyspark.sql.DataFrame.take(4)
How to map a JavaDstream object into a string? Spark Streaming and Model Prediction JAVA
spark-submit: workers do not get assigned to the master