How to create a key, value pair in mapreduce program if values are stored across the boundaries ?
In the input file which I need to process have data classified by headers and its respective records. My 200 MB file has 3 such headers and its records split across 4 blocks(3*64 MB and 1*8 MB). The data would be in below format HEADER 1 Record 1 Record 2 . . Record n HEADER 2 Record 1 Record 2 . . Record n HEADER 3 Record 1 Record 2 . . Record n All I need is to take the HEADER as a key and its below Records as values and process some operations in my mapper code. The problem over here is my Records are split across different blocks. For suppose my first Header and its respective Records occupy a space of 70 MB, it means it occupies 64 MB of the first block and 6 MB of space in 2nd block. Now how does the mapper that runs on 2nd block knows that 6 MB of file belongs to records of the HEADER 1. Can any one please explain me as how to get the Header and its records completely.
You need a custom recordreader and custom linereader to process in such a way rather than reading each line. Since the splits are calculated in the client, every mapper already knows if it needs to discard the records of previous header or not. Hope this below link might be helpful How does Hadoop process records split across block boundaries?
You have two ways: A single mapper handling all the records, so you have the complete data in single class, and you decide how to separate them. Given the input size, this will have performance issues. More Info at Hadoop Defintive guide, MR Types and Formats, Input Formats, Prevent Splitting. Less coding effort, and if your mapper has less data and running frequently, this approach is ok. If you plan to use custom split and record reader, you are modifying the way the framework works. Because, your records are similar to TextInputFormat. So mostly no need to plan for custom record reader. However you need to define how the splits are made. In general, splits are divided mostly equal to block size, to take advantage of data locality. In your case, your data(mainly the header part) can end at any block and you should split accordingly. All the above changes need to be made, to make map reduce work with the data you have.
You can increase the default size of HDFS block to 128MB and if the file is small it will take that as one block.
Simple Java based Spark program doesn't get Finished
How does the hadoop fix the number of mappers or Input splits when mapreduce task is done over multiple input files?
Access a secured Hive when running Spark in an unsecured YARN cluster
Ambari HDP 2.4 Add Hosts Metrics Monitor fail to Install
hadoop cannot connect to localhost
Cannot remove INFO log messages from hive cli
Exception in thread “main” java.io.IOException: Permission denied in command line. hadoop
Why videos are unstructured data in context of Big data?
Unusual datetime string parsing in Hive
Pseudo Distributed Mode Hadoop
Data retrieval failure from Oracle to Spark-sql with no Error
Running from a local IDE against a remote Spark cluster
How to create a robust data pipeline from external data source to hive using oozie?
Change HBase WAL location
HBase Vs Vertica (Detials, and Pros and Cons)
Hadoop Administration: Admin Command to merge fsimage and edits log