How Can collect_set find the source? [duplicate]
According to the docs, the collect_set and collect_list functions should be available in Spark SQL. However, I cannot get it to work. I'm running Spark 1.6.0 using a Docker image. I'm trying to do this in Scala: import org.apache.spark.sql.functions._ df.groupBy("column1") .agg(collect_set("column2")) .show() And receive the following error at runtime: Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function collect_set; Also tried it using pyspark, but it also fails. The docs state these functions are aliases of Hive UDAFs, but I can't figure out to enable these functions. How to fix this? Thanx!
Spark 2.0+: SPARK-10605 introduced native collect_list and collect_set implementation. SparkSession with Hive support or HiveContext are no longer required. Spark 2.0-SNAPSHOT (before 2016-05-03): You have to enable Hive support for a given SparkSession: In Scala: val spark = SparkSession.builder .master("local") .appName("testing") .enableHiveSupport() // <- enable Hive support. .getOrCreate() In Python: spark = (SparkSession.builder .enableHiveSupport() .getOrCreate()) Spark < 2.0: To be able to use Hive UDFs you have use Spark built with Hive support (this is already covered when you use pre-built binaries what seems to be the case here) and initialize SparkContext using HiveContext. In Scala: import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.SQLContext val sqlContext: SQLContext = new HiveContext(sc) In Python: from pyspark.sql import HiveContext sqlContext = HiveContext(sc)
How can I define my ENV variables once in the DockerFile and pass them down to my spark image which is submitted by a supervisord managed script?
Unbalanced keys lead to performance problems in Spark
How to remove null data from JavaPairRDD
Spark Streaming: How Spark and Kafka communication happens?
Error while invoking spark-shell on windows
Best way to iterate/stream a Spark Dataframe
Is it is required to be data in hive matastore to be used in sql-context from spark?
How to modify a Spark Dataframe with a complex nested structure?
Object not serializable error on org.apache.avro.generic.GenericData$Record
How to run Spark Sql on a 10 Node cluster
How to do group by range query
Visualising a Matrix
More than one hour to execute pyspark.sql.DataFrame.take(4)
How to map a JavaDstream object into a string? Spark Streaming and Model Prediction JAVA
spark-submit: workers do not get assigned to the master
Fuzzy text matching in Spark