apache-spark


Spark : How to group by distinct values in DataFrame


I have a data in a file in the following format:
1,32
1,33
1,44
2,21
2,56
1,23
The code I am executing is following:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import spark.implicits._
import sqlContext.implicits._
case class Person(a: Int, b: Int)
val ppl = sc.textFile("newfile.txt").map(_.split(","))
.map(p=> Person(p(0).trim.toInt, p(1).trim.toInt))
.toDF()
ppl.registerTempTable("people")
val result = ppl.select("a","b").groupBy('a).agg()
result.show
Expected Output is:
a 32, 33, 44, 23
b 21, 56
Instead of aggregation by sum, count, mean etc. I want every element in the row.
Try collect_set function inside agg()
val df = sc.parallelize(Seq(
(1,3), (1,6), (1,5), (2,1),(2,4)
(2,1))).toDF("a","b")
+---+---+
| a| b|
+---+---+
| 1| 3|
| 1| 6|
| 1| 5|
| 2| 1|
| 2| 4|
| 2| 1|
+---+---+
val df2 = df.groupBy("a").agg(collect_set("b")).show()
+---+--------------+
| a|collect_set(b)|
+---+--------------+
| 1| [3, 6, 5]|
| 2| [1, 4]|
+---+--------------+
And if you want duplicate entries , can use collect_list
val df3 = df.groupBy("a").agg(collect_list("b")).show()
+---+---------------+
| a|collect_list(b)|
+---+---------------+
| 1| [3, 6, 5]|
| 2| [1, 4, 1]|
+---+---------------+

Related Links

Is it is required to be data in hive matastore to be used in sql-context from spark?
How to modify a Spark Dataframe with a complex nested structure?
Object not serializable error on org.apache.avro.generic.GenericData$Record
How to run Spark Sql on a 10 Node cluster
How to do group by range query
Visualising a Matrix
More than one hour to execute pyspark.sql.DataFrame.take(4)
How to map a JavaDstream object into a string? Spark Streaming and Model Prediction JAVA
spark-submit: workers do not get assigned to the master
Fuzzy text matching in Spark
Spark: Match columns from two dataframes
Spark Jobs crashing with ExitCodeException exitCode=15
Spark-Cassandra: how to efficiently restrict partitions
Spark job on hbase data
SparkSQL restrict queries by Cassandra partition key ranges
Merging equi-partitioned data frames in Spark

Categories

HOME
jsp
word-vba
asana
wikipedia
redhat
ibm-midrange
silverlight
xorg
sonata-admin
nanotime
html5-video
git-svn
apache-kafka-streams
data-annotations
flow
teraterm
oracle11gr2
rgdal
php-carbon
confidence-interval
flatbuffers
swipe
groove
sieve-of-eratosthenes
draggable
paypal-rest-sdk
eve
csr
quicksand
azure-cdn
manifoldjs
notimplementedexception
infiniband
maven-versions-plugin
typesetting
javascript-security
autogen
tastypie
findall
ipmi
gwt-material-design
jsbin
file-import
ellipsis
securesocial
groups
webmail
mongolab
flotr2
trello.net
pddl
cucumber-java
wand
actioncable
double-quotes
android-listfragment
node-orm2
solr-boost
smarty2
horizontalscrollview
jasmin
opensc
sendto
bonecp
sirtrevor
angularjs-google-maps
cfile
libavcodec
modx-evolution
simpleaudioengine
buildout
nsusernotificationcenter
xcode4.2
adlds
mosek
loadlibrary
html5-apps
shebang
jmdns
working-directory
freeglut
google-profiles-api
undeclared-identifier
ccr
expressionvisitor
katta
ember-router
gemstone
jboss-cache
regexbuddy
dynamic-websites
drwatson
clipboard-pictures
roguelike

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App