apache-spark


Spark dataframe parralelism after filtering


I am currently trying to extract a small subset(100 rows) rom a big dataframe in Spark (5TO in parquet format). To do that, I am using following code :
df = sqlCtx.read.parquet(directory)
subset = dataframe.limit(100)
subset_count = subset.count()
subset = subset.withColumn("c", subset["a"] + subset["b"])
However, After the "limit" filtering, Spark operations are really slow and exit with "GC ovserhead limit exceeded". When looking at Spark monitoring, I can see that subset has more than 25000 partitions (using repartition doesn't seem to be the solution neither).
So, I am thinking that limit/filtering is not a good way to extract a dataframe from an other. Any ideas?

Related Links

Spark: Use Temporary Table Twice in Query?
How can I define my ENV variables once in the DockerFile and pass them down to my spark image which is submitted by a supervisord managed script?
Unbalanced keys lead to performance problems in Spark
How to remove null data from JavaPairRDD
Spark Streaming: How Spark and Kafka communication happens?
Error while invoking spark-shell on windows
Best way to iterate/stream a Spark Dataframe
Is it is required to be data in hive matastore to be used in sql-context from spark?
How to modify a Spark Dataframe with a complex nested structure?
Object not serializable error on org.apache.avro.generic.GenericData$Record
How to run Spark Sql on a 10 Node cluster
How to do group by range query
Visualising a Matrix
More than one hour to execute pyspark.sql.DataFrame.take(4)
How to map a JavaDstream object into a string? Spark Streaming and Model Prediction JAVA
spark-submit: workers do not get assigned to the master

Categories

HOME
wix
appium-ios
applet
jsonschema
passwords
api.ai
mocha
intel-xdk
geany
addeventlistener
classloader
clion
light-inject
w3.css
vulkan
union
facebook-opengraph
python-2.5
plots.jl
data-annotations
vast
git-bash
multiprocessing
cayley
realm-java
opencms
google-distancematrix-api
aikau
vnc
grouping
unsigned-integer
valueinjecter
file-manager
charts.js
directx-12
pvlib
subtotal
tweenmax
aem-6
google-crawlers
folly
qweb
knowledge-management
windows-scripting
polymer-cli
gradle-tooling-api
autofocus
html-form
asyncsocket
formstack
xbrl
okular
ios-app-group
mashery
typesetting
column-oriented
untagged
glumpy
configuration-profile
hard-drive-failure
mysql-notifier
error-correction
texture2d
formatter
securesocial
xlform
ctakes
yapdatabase
lidar
wgs84
samsung-gear
post-processor
openocd
openejb
ubiquity
double-quotes
netbsd
scalariform
textmatching
byte-order-mark
mysql5
razor-2
clearcase-remote-client
bunny
email-spam
custom-titlebar
nine-patch
magento-1.6
declspec
django-endless-pagination
repeating
google-chrome-frame
speaker
alternate-data-stream
cassette
locomotivejs
oncreate
timestamp-with-timezone
51degrees
documentviewer
rpn
teamcity-7.0
ekeventkit
ccr
f2c
having
spquery
high-traffic
webshop
resharper-4.5
nhaml

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App