Hadoop HA – integral, Reliable & Simple

High Availability (HA) in traditional RDBMS, like Oracle, is a complicated business – not for the faint of heart. Also, HA in RDBMS,  comes with high capital and operational cost as it is considered a separate component. In contrast, HA  In Hadoop, seems like a natural progression of its cluster and configuration oriented architecture. HA […]

Sizing NameNode Heap Memory

The NameNode memory allocation plays a critical role in ensuring overall cluster performance. The NameNode is the core metadata server of Hadoop. This is the most critical piece of the system. NameNode stores the file system image and the file system journal. The NameNode keeps all of the filesystem layout information (files, blocks, directories, permissions, etc) […]

The Big Data Mindset

  Big data / Hadoop is here today. Mostly because as an open platform with powerful tools to achieve enterprise data platforms cost effectively.  No doubt, that there is hype in rushing to Hadoop implementation with less consideration to user adaption, sustainability and internal staff augmentation. For worthwhile considerations may be: Can Hadoop be the total/end-to-end replacement […]

HDFS Commands

command Description Usage fsck HDFS Command to check the health of the Hadoop file system hdfs fsck ls HDFS Command to display the list of Files and Directories in HDFS hdfs dfs –ls mkdir hdfs dfs –mkdir /directory_name touchz HDFS Command to create a file in HDFS with file size 0 bytes hdfs dfs –touchz /directory/filename du HDFS Command to […]

Optimizing Hive Query Performance Through Mapjoin

Let us explore three parameters having significant impact to hive query performance:hive.auto.convert.join.noconditionaltask = true;hive.auto.convert.join.noconditionaltask.size=10000000hive.mapjoin.smalltable.filesize:hive.auto.convert.join.noconditionaltaskAdded in Hive 0.11.0, and it is true by default. That  means, if the sum of size for n-1 of the tables/partitions for an n-way join is smaller than the size specified by hive.auto.convert.join.noconditionaltask.size(10MB by default), the join is directly converted to […]

TEZ Memory Tuning Checklist

TEZ Application Manager tez.am.resource.memory.mb  should be a multiple of yarn.scheduler.maximum-allocation-mb but less than yarn.scheduler.maximum-allocation-mb             Application Master Java Heap sizes (tez.am.launch.cmd-opts) should be by default 80% of  tez.am.resource.memory.mb  TEZ Container Set hive.tez.container.size to be the same as or a small multiple (1 or 2 times that) of YARN container size yarn.scheduler.minimum-allocation-mb but NEVER more than yarn.scheduler.maximum-allocation-mb, […]