hadoop


How to create a key, value pair in mapreduce program if values are stored across the boundaries ?


In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).
The data would be in below format
HEADER 1
Record 1
Record 2
.
.
Record n
HEADER 2
Record 1
Record 2
.
.
Record n
HEADER 3
Record 1
Record 2
.
.
Record n
All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code.
The problem over here is my Records are split across different blocks.
For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block.
Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1.
Can any one please explain me as how to get the Header and its records completely.
You need a custom recordreader and custom linereader to process in such a way rather than reading each line.
Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not.
Hope this below link might be helpful
How does Hadoop process records split across block boundaries?
You have two ways:
A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok.
If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.
You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.

Related Links

Simple Java based Spark program doesn't get Finished
How does the hadoop fix the number of mappers or Input splits when mapreduce task is done over multiple input files?
Access a secured Hive when running Spark in an unsecured YARN cluster
Ambari HDP 2.4 Add Hosts Metrics Monitor fail to Install
hadoop cannot connect to localhost
Cannot remove INFO log messages from hive cli
Exception in thread “main” java.io.IOException: Permission denied in command line. hadoop
Why videos are unstructured data in context of Big data?
Unusual datetime string parsing in Hive
Pseudo Distributed Mode Hadoop
Data retrieval failure from Oracle to Spark-sql with no Error
Running from a local IDE against a remote Spark cluster
How to create a robust data pipeline from external data source to hive using oozie?
Change HBase WAL location
HBase Vs Vertica (Detials, and Pros and Cons)
Hadoop Administration: Admin Command to merge fsimage and edits log

Categories

HOME
typescript
groovy
web-services
collections
performance-testing
haskell-stack
garbage-collection
saml
runtime-error
redhat
antlr4
tin-can-api
wavelet
graphengine
frequency
zend-expressive
plots.jl
installer
apache-kafka-streams
afnetworking-2
lookup-tables
cmusphinx
swig
collision
media-source
jmeter-plugins
py4j
hough-transform
mouseevent
hawq
jtable
angular-ngmodel
android-arrayadapter
apscheduler
codeceptjs
solarwindslem
angular2-databinding
scom
dlopen
apex-code
imagemapster
roo
gridstack
macvim
signing
rft
mindstorms
ulimit
monogame
tabview
autofocus
hostapd
rating-system
clientscript
attributerouting
jawr
cmsmadesimple
glumpy
facebook-canvas
android-touch-event
findall
iseries-navigator
callkit
evaluate
sound-synthesis
flattr
cglib
ckcontainer
hyperterm
pygments
vlc-android
isapi-rewrite
groupbox
gcloud-java
aplpy
sqlj
kinto
samsung-gear
wchar-t
nominatim
quasar
symbian
chunk-templates
alienvault
movie
lossless-compression
simpleadapter
bitrock
imaging
citymaps
clearcase-remote-client
dynamic-data-display
zeroclipboard
aps
django-supervisor
pkcs#10
hobbitmon
mydbr
sirtrevor
disk-io
jqgrid-php
mongo-shell
ccombobox
visual-assist
adlds
impdp
datadesign
double-precision
file-structure
qbwc
time-limiting
graphiti-js
dired
eye-detection
mkmapviewdelegate
gedcom

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App