apache-spark


Delete HdfsTarget before running a SparkSubmitTask


Community, I'd like to delete the HdfsTarget folder before running a SparkSubmitTask. What is the best practice? So far I tried two options mentioned in the code attached without success:
Dependent/required job doesn't get executed if HdfsTarget already exists
Tasks would be executed in parallel if called with yield
import luigi
import luigi.format
import luigi.contrib.hdfs
from luigi.contrib.spark import SparkSubmitTask
class CleanUp(luigi.Task):
path = luigi.Parameter()
def run(self):
self.target = luigi.contrib.hdfs.HdfsTarget(self.path, format=luigi.format.Gzip)
if self.target.exists():
self.target.remove(skip_trash=True)
class MySparkTask(SparkSubmitTask):
output = luigi.Parameter()
driver_memory = '8g'
executor_memory = '3g'
num_executors = 5
app = 'my-app.jar'
entry_class = 'com.company.MyJob'
def app_options(self):
return ['/input', self.output]
def requires(self):
(1)
def output(self):
return luigi.contrib.hdfs.HdfsTarget(self.output, format=luigi.format.Gzip)
class RunAll(luigi.Task):
result_dir = '/output'
''' Dummy task that triggers execution of a other tasks'''
def requires(self):
(2)
return MySparkTask(self.result_dir)

Related Links

Spark Streaming: How Spark and Kafka communication happens?
Error while invoking spark-shell on windows
Best way to iterate/stream a Spark Dataframe
Is it is required to be data in hive matastore to be used in sql-context from spark?
How to modify a Spark Dataframe with a complex nested structure?
Object not serializable error on org.apache.avro.generic.GenericData$Record
How to run Spark Sql on a 10 Node cluster
How to do group by range query
Visualising a Matrix
More than one hour to execute pyspark.sql.DataFrame.take(4)
How to map a JavaDstream object into a string? Spark Streaming and Model Prediction JAVA
spark-submit: workers do not get assigned to the master
Fuzzy text matching in Spark
Spark: Match columns from two dataframes
Spark Jobs crashing with ExitCodeException exitCode=15
Spark-Cassandra: how to efficiently restrict partitions

Categories

HOME
mongodb
music
serial-port
weight
google-adwords
binary-search
silverlight
wav
wso2-das
cross-compiling
aac
tapi
reload
coreos
image-gallery
jquery-selectors
jmespath
orientdb-2.1
durandal-2.0
nsmutableattributedstring
gyroscope
libpcap
collision
opencms
pwm
amazon-cloudfront
mediator
digital-logic
pvs-studio
iar
memorystream
graphql-dotnet
codewarrior
reformatting
python-behave
sqlite.swift
google-shared-contacts
sourcetree
response
spring-saml
amazon-elasticsearch
rft
word-embedding
btle
easy-digital-downloads
jquery-ui-dialog
android-studio-2.1
integer-programming
distributed-lock
freecodecamp
spring-form
html-form
sharpziplib
attributerouting
facebook-canvas
bosh-deployer
mysql-error-1050
kie-workbench
configuration-profile
atmosphere.js
building
facebook-audience-network
sqlfiddle
dashing
insert-update
menustrip
pypyodbc
groupbox
xlform
lidar
js-of-ocaml
hl7-v2
parsoid
linqtocsv
dnx
ssha
sejda
openejb
phone-number
equinox
cvc4
vbo
gevent-socketio
setattribute
citymaps
subdirectory
bson
asp.net-apicontroller
deepzoom
requiredfieldvalidator
dfsort
resolver
floating-point-conversion
gora
emacs-jedi
memento
osmf
gigaspaces
rgba
android-screen
ondraw
data-oriented-design
android-assets
junitperf
measure
mkreversegeocoder
treetop
noaa
remote-control
gumstix
gendarme

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App