Optimizing Hive Query Performance Through Mapjoin

Let us explore three parameters having significant impact to hive query performance:
hive.auto.convert.join.noconditionaltask = true;
hive.auto.convert.join.noconditionaltask.size=10000000
hive.mapjoin.smalltable.filesize:
hive.auto.convert.join.noconditionaltask
Added in Hive 0.11.0, and it is true by default. That  means, if the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size(10MB by default), the join is directly converted to a mapjoin (there is no conditional task).The size configuration enables the user to control what size table can fit in memory. This value represents the sum of the sizes of tables that can be converted to hashmaps that fit in memory.
hive.auto.convert.join.noconditionaltask determines whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this parameter is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
hive.mapjoin.smalltable.filesize
Added in Hive 0.8.0, and it is 25MB by default which means, if the input file size is smaller than this threshold, Hive will try to convert the common join into map join. If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.
 These three parameters, jointly enable:

  • Loading the small table into local memory (where Hive client is running)  
  • Creates hash table for that small table in local memory  
  • Push the hash table into HDFS.  
  • All the mappers load this hash table from the HDFS and does map-side join

  Restrictions
If all but one of the tables being joined are small, the join can be performed as a map only job. The query
 
select /*+ MAPJOIN(b) */ a.key, a.value
from a join b ON a.key = b.key
does not need a reducer. For every mapper of A, B is read completely.

Consider the possibility of multiple mapjoins on different keys:
 
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM
  ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM
    bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne)                                                  
  ) firstjoin                                                  
  JOIN                                                                    smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)                      

The above query is not supported. Without the mapjoin hint, the above query would be executed as 2 map-only jobs. If the user knows in advance that the inputs are small enough to fit in memory, the following configurable parameters can be used to make sure that the query executes in a single map-reduce job.