How to create a key, value pair in mapreduce program if values are stored across the boundaries ?
In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB). The data would be in below format HEADER 1 Record 1 Record 2 . . Record n HEADER 2 Record 1 Record 2 . . Record n HEADER 3 Record 1 Record 2 . . Record n All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code. The problem over here is my Records are split across different blocks. For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block. Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1. Can any one please explain me as how to get the Header and its records completely.
You need a custom recordreader and custom linereader to process in such a way rather than reading each line. Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not. Hope this below link might be helpful How does Hadoop process records split across block boundaries?
You have two ways: A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok. If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.
You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.
Pig Error: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Mahout K Means clustering input file format
hadoop fs -ls out of memory error
Sqoop job to import data drom SQL server stuck at Map 0%
Replication factor thumb rule
Hbase managed zookeeper suddenly trying to connect to localhost instead of zookeeper quorum
How can I find unique user on my webpage visits in hive?
Hadoop Implement a status callback
Cascading for Impatient TFIDF example freezing
Analyze the runtime characteristics of a HiveQL query without actual execution
apache pig rank operator not working with multiple reducer
Control intermediates results in hadoop
Should HBase be installed on the client side? Is sqoop an API? Is Drill an API?
Yarn virtual memory usage
Best way to process many objects?