How to specify partition numbers when write a dataframe to parquet using PySpark
I want to write a spark dataframe to parquet but rather than specify it as partitionBybut the numPartitions or the size of each partition. Is there an easy way to do that in PySpark?
If all you care is the number of partitions the method is exactly the same as for any other output format - you can repartition DataFrame with given number of partitions and use DataFrameWriter afterwards: df.repartition(n).write.parquet(some_path)
How to upgrade Spark to newer version?
1.5.1| Spark Streaming | NullPointerException with SQL createDataFrame
DataFrame partitionBy to a single Parquet file (per partition)
Apache Spark Multi Node Clustering - java.io.FileNotFoundException
How to deal with concatenated Avro files?
Why does Spark task take a long time to find block locally?
Apache Spark:executor driver lost
spark saveAsTextFile method is really strange in java api，it just not work right in my program
Spark/S3 Importing Data
Feature normalization algorithm in Spark
How to create a connection to a remote Spark server and read in data from ipython running on local machine?
Apache Drill - Memory / Stream as Data Source
Spark + Mesos cluster mode, who uploads the jar?
Spark on a single node: speed improvement
Oozie > Spark action > why jar element does not accept multiple jars
Spark map is only one task while it should be parallel (PySpark)