hadoop


How to create a key, value pair in mapreduce program if values are stored across the boundaries ?


In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB).
The data would be in below format
HEADER 1
Record 1
Record 2
.
.
Record n
HEADER 2
Record 1
Record 2
.
.
Record n
HEADER 3
Record 1
Record 2
.
.
Record n
All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code.
The problem over here is my Records are split across different blocks.
For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block.
Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1.
Can any one please explain me as how to get the Header and its records completely.
You need a custom recordreader and custom linereader to process in such a way rather than reading each line.
Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not.
Hope this below link might be helpful
How does Hadoop process records split across block boundaries?
You have two ways:
A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok.
If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.
You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.

Related Links

Pig Error: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
java.lang.NoClassDefFoundError: org/apache/hadoop/hdfs/server/namenode/NameNode
Mahout K Means clustering input file format
hadoop fs -ls out of memory error
Sqoop job to import data drom SQL server stuck at Map 0%
Replication factor thumb rule
Hbase managed zookeeper suddenly trying to connect to localhost instead of zookeeper quorum
How can I find unique user on my webpage visits in hive?
Hadoop Implement a status callback
Cascading for Impatient TFIDF example freezing
Analyze the runtime characteristics of a HiveQL query without actual execution
apache pig rank operator not working with multiple reducer
Control intermediates results in hadoop
Should HBase be installed on the client side? Is sqoop an API? Is Drill an API?
Yarn virtual memory usage
Best way to process many objects?

Categories

HOME
vb.net
wix
pyqt5
react-navigation
haskell-stack
python-3.5
api.ai
streaming
active-directory
ionic3
mobilefirst-adapters
xorg
light-inject
pymc3
axure
openmodelica
openui5
oracle-agile-plm
apex
syswow64
windows-applications
mouseevent
jqplot
burp
slf4j
azure-redis-cache
intersystems
sumifs
codewarrior
sql-server-2014-express
feedly
lync
react-redux-form
sqlite.swift
apex-code
codeskulptor
message-hub
internet-explorer-10
procobol
google-slides
hawkular
freerdp
backwards-compatibility
ase
azure-cdn
vega-lite
xssf
autogen
manifoldcf
cakephp-2.3
activity-diagram
kernighan-and-ritchie
submission
accounts
rabbitvcs
highslide
genetic
oracle-bi
ssha
coldfusion-10
linq2db
spymemcached
treegrid
scalariform
window-server
theorem-proving
fortrabbit
sigkill
mks
imaging
aps
short
cellular-automata
master-theorem
xacml3
bson
modalpopupextender
xml.modify
srp
jxta
django-endless-pagination
memoization
rcu
xcode4.2
gigaspaces
clean-urls
cocosbuilder
mixing
maven-bundle-plugin
vim-fugitive
graphiti-js
jammit
mate
replay
plinqo
cluetip
help-authoring

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App