How to Set up Local Hadoop Cluster with Oozie

Yiou Zhao
5 min readMar 31, 2021

--

Currently, there are a lot of distributed big data management platform like Hue and HDP with the Hadoop ecosystem built in, but this would bring some issues when the Hadoop version or the component within it needs to be updated. So, it would be much better if people can have full control on the environment. In this article, I would like to introduce how to create a local hadoop ecosystem with oozie using docker.

Requirements

  • Hadoop Docker-Compose File(In this tutorial, I’m using the Big Data Europe repo)
  • Oozie Docker(In this tutorial, we are going to compile the Oozie source code and create the customized docker image)
  • In this tutorial, we assume the Hadoop version is 3.1.1(This can be easily changed since Big Data Europe supports plenty of the Hadoop verions)

Oozie Compiling

Requirements

  • Oozie 4.3.1 Source Code, download from here
  • Java Version 1.8
  • Maven Veresion 3+

Modify Source Code

  • Add the following to maven settings.xml($Maven_Path/$Version/libexec/conf/settings.xml)
<repository><id>spring-plugin-releases</id><url>http://repo.spring.io/plugins-release/</url></repository>
  • Change hadoop version to 3.1.1 in oozie-4.3.1/pom.xml
<properties><hadoop.version>3.1.1</hadoop.version><hadoop.majorversion>3</hadoop.majorversion><pig.classifier>h2</pig.classifier><sqoop.classifier>hadoop200</sqoop.classifier><jackson.version>1.9.13</jackson.version></properties>
  • Modify Hadoop auth to use hadoop-auth-2 in oozie-4.3.1/hadooplibs/pom.xml
<profile><id>hadoop-3</id><activation><activeByDefault>false</activeByDefault></activation><modules><module>hadoop-distcp-3</module><module>hadoop-auth-2</module><module>hadoop-utils-3</module></modules></profile>
  • Join the dependency of Hadoop-Common in oozie-4.3.1/core/pom.xml and oozie-4.3.1/sharelib/pig/pom.xml
<dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>${hadoop.version}</version></dependency>
  • Modify oozie-hadoop-auth to use hadoop-2–4.3.1 in oozie-4.3.1/pom.xml
dependency><groupId>org.apache.oozie</groupId><artifactId>oozie-hadoop-auth</artifactId><version>hadoop-2-4.3.1</version><scope>provided</scope></dependency>
  • Comment out the findbugs and checkstyle in oozie-4.3.1/pom.xml
<!-- findbugs plugin. Execute 'mvn verify' and look for target/findbugs/findbugsXml.html under each module<plugin><groupId>org.codehaus.mojo</groupId><artifactId>findbugs-maven-plugin</artifactId><configuration><excludeSubProjects>false</excludeSubProjects><xmlOutput>true</xmlOutput><findbugsXmlOutput>true</findbugsXmlOutput><findbugsXmlWithMessages>true</findbugsXmlWithMessages><effort>Max</effort><failOnError>false</failOnError><threshold>Low</threshold><xmlOutput>true</xmlOutput><findbugsXmlOutputDirectory>${project.build.directory}/findbugs</findbugsXmlOutputDirectory></configuration><executions><execution><id>findbug</id><phase>verify</phase><goals><goal>check</goal></goals></execution></executions></plugin>--><!-- xml plugin is used for transforming the findbugs xml output into a friendlier html page<plugin><groupId>org.codehaus.mojo</groupId><artifactId>xml-maven-plugin</artifactId><configuration><excludeSubProjects>false</excludeSubProjects><transformationSets><transformationSet><dir>${project.build.directory}/findbugs</dir><outputDir>${project.build.directory}/findbugs</outputDir><stylesheet>fancy-hist.xsl</stylesheet><fileMappers><fileMapperimplementation="org.codehaus.plexus.components.io.filemappers.FileExtensionMapper"><targetExtension>.html</targetExtension></fileMapper></fileMappers></transformationSet></transformationSets></configuration><executions><execution><phase>verify</phase><goals><goal>transform</goal></goals></execution></executions><dependencies><dependency><groupId>com.google.code.findbugs</groupId><artifactId>findbugs</artifactId><version>2.0.3</version></dependency></dependencies></plugin>--><!-- checkstyle plugin. Execute 'mvn verify' and look for target/checkstyle-result.xml under each module<plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-checkstyle-plugin</artifactId><version>2.9.1</version><executions><execution><goals><goal>check</goal></goals><configuration><consoleOutput>true</consoleOutput><includeTestSourceDirectory>true</includeTestSourceDirectory><configLocation>src/main/resources/checkstyle.xml</configLocation><headerLocation>src/main/resources/checkstyle-header.txt</headerLocation></configuration></execution></executions></plugin>-->
  • Change fs.permission.AccessControlException in class JavaActionExcutor, AuthorizationService to security.AccessControlException
  • Delete file oozie-4.3.1/webapp/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java

Compile the Source Code

sudo mvn -U clean package assembly:single -DskipTests -P hadoop-3,uber,spark-2 -Dhadoop.version=3.1.1

If the compiling is successful, you will see the output from the console as shown below

Successfully Compliling Output

Get the Distro File

The generated distro file can be found in oozie-4.3.1/distro/target/oozie-4.3.1-distro.tar.gz

Now we can use this compiled file to create the Oozie Dockerfile.

Oozie Dockerfile Creation

The base image used for the Oozie docker is the Hadoop base image in Big Data Europe since the Hadoop cluster is built based on this.

To ensure that Oozie service is running as expected, it also requires the correct oozie-site.xml config files, otherwise, the oozie job won’t work. An example config file is shown as below.

<?xml version="1.0"?><configuration><property><name>oozie.service.ProxyUserService.proxyuser.root.hosts</name><value>*</value><description>List of hosts the '#USER#' user is allowed to perform 'doAs'operations.The '#USER#' must be replaced with the username o the user who isallowed to perform 'doAs' operations.The value can be the '*' wildcard or a list of hostnames.For multiple users copy this property and replace the user namein the property name.</description></property><property><name>oozie.service.ProxyUserService.proxyuser.root.groups</name><value>*</value><description>List of groups the '#USER#' user is allowed to impersonate usersfrom to perform 'doAs' operations.The '#USER#' must be replaced with the username o the user who isallowed to perform 'doAs' operations.The value can be the '*' wildcard or a list of groups.For multiple users copy this property and replace the user namein the property name.</description></property><property><name>oozie.service.HadoopAccessorService.hadoop.configurations</name><value>*=/etc/hadoop/</value></property><property><name>oozie.service.HadoopAccessorService.action.configurations</name><value>*=/etc/hadoop/</value></property><!-- <property><name>oozie.service.WorkflowAppService.system.libpath</name><value>hdfs:///user/${user.name}/share/lib;hdfs://namenode:8020</value><description>System library path to use for workflow applications.This path is added to workflow application if their job properties setsthe property 'oozie.use.system.libpath' to true.</description></property> --><property><name>oozie.service.WorkflowAppService.system.libpath</name><value>/user/root/share/lib/</value><description>System library path to use for workflow applications.This path is added to workflow application if their job properties setsthe property 'oozie.use.system.libpath' to true.</description></property></configuration>

Then the Oozie Dockefile can be created as following:

# base imageFROM bde2020/hadoop-base:2.0.0-hadoop3.1.1-java8RUN apt-get install wget -yRUN apt-get install zip unzip -y# varsARG OOZIE_VERSION=4.3.1ARG HADOOP_VERSION=3.1.1ARG SQL_CONNECTOR_VERSION=5.1.48# env varsENV \HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native" \OOZIE_HOME=/opt/oozie \HADOOP_HOME=/opt/hadoop-${HADOOP_VERSION} \HADOOP_COMMON_HOME=/opt/hadoop-${HADOOP_VERSION} \HADOOP_HDFS_HOME=/opt/hadoop-${HADOOP_VERSION} \HADOOP_MAPRED_HOME=/opt/hadoop-${HADOOP_VERSION} \HADOOP_YARN_HOME=/opt/hadoop-${HADOOP_VERSION} \HADOOP_CONF_DIR=/opt/hadoop-${HADOOP_VERSION}/etc/hadoop \YARN_CONF_DIR=/opt/hadoop-${HADOOP_VERSION}/etc/hadoop \YARN_HOME=/opt/hadoop-${HADOOP_VERSION}# copy oozie buildCOPY oozie-${OOZIE_VERSION}-distro.tar.gz /# installRUN tar -xzf oozie-${OOZIE_VERSION}-distro.tar.gz && \mv oozie-${OOZIE_VERSION} /opt/oozie && \mkdir /opt/oozie/libext && \wget http://archive.cloudera.com/gplextras/misc/ext-2.2.zip && \mv ext-2.2.zip ${OOZIE_HOME}/libext/ && \wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-${SQL_CONNECTOR_VERSION}.tar.gz && \tar -xzf mysql-connector-java-${SQL_CONNECTOR_VERSION}.tar.gz && \cp mysql-connector-java-$SQL_CONNECTOR_VERSION/mysql-connector-java-$SQL_CONNECTOR_VERSION.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/common/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/common/lib/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/hdfs/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/hdfs/lib/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/mapreduce/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/mapreduce/lib/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/yarn/*.jar ${OOZIE_HOME}/libext/ && \cp ${HADOOP_HOME}/share/hadoop/yarn/lib/*.jar ${OOZIE_HOME}/libext/ && \echo "PATH=${JAVA_HOME}/bin:${HADOOP_HOME}/bin:${OOZIE_HOME}/bin:${PATH}" >> ~/.bashrc && \rm -rf oozie-${OOZIE_VERSION}-distro.tar.gz mysql-connector-java-${SQL_CONNECTOR_VERSION}.tar.gz mysql-connector-java-${SQL_CONNECTOR_VERSION} /var/lib/apt/lists/* /tmp/* /var/tmp/*# Setup the Oozie userRUN useradd oozie \&& mkdir -p /var/log/oozie && chown -R oozie /var/log/oozie \&& mkdir -p /var/lib/oozie/data && chown -R oozie /var/lib/oozie \&& ln -s /var/log/oozie /opt/oozie/log \&& ln -s /var/lib/oozie/data /opt/oozie/data#copy config fileCOPY oozie-site.xml ${OOZIE_HOME}/conf/
# portsEXPOSE 10000 10001COPY run.sh /run.shRUN chmod a+x /run.shCMD ["/run.sh"]

You might notice that we have a run.sh script, what it does is to run the history server within the Oozie service as shown below

!/bin/bash/opt/hadoop-3.1.1/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Now we can use this Dockerfile, oozie-site.xml and run.sh to create the Oozie docker image.

docker build . -t oozie:4.3.1-3.1.1

Big Data Europe Hadoop Docker Setup

To add Oozie to the Hadoop cluster, it requires to modify the docker-compose file from the repo. Just add the new component as shown below:

oozie:container_name: oozieimage: oozie:4.3.1-3.1.1restart: alwaysdepends_on:- namenode- datanode- resourcemanager- nodemanager1- historyserverports:- "11000:11000"- "11001:11001"env_file:- ./hadoop.env

The hadoop.env is used for setup the environmental variables for the hadoop cluster.

HIVE_SITE_CONF_javax_jdo_option_ConnectionURL=jdbc:postgresql://hive-metastore-postgresql/metastoreHIVE_SITE_CONF_javax_jdo_option_ConnectionDriverName=org.postgresql.DriverHIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hiveHIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hiveHIVE_SITE_CONF_datanucleus_autoCreateSchema=falseHIVE_SITE_CONF_hive_metastore_uris=thrift://hive-metastore:9083HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=falseCORE_CONF_fs_defaultFS=hdfs://namenode:8020CORE_CONF_hadoop_http_staticuser_user=rootCORE_CONF_io_compression_codecs=org.apache.hadoop.io.compress.SnappyCodecCORE_CONF_hadoop_proxyuser_oozie_hosts=*CORE_CONF_hadoop_proxyuser_oozie_groups=*CORE_CONF_hadoop_proxyuser_root_hosts=*CORE_CONF_hadoop_proxyuser_root_groups=*HDFS_CONF_dfs_webhdfs_enabled=trueHDFS_CONF_dfs_permissions_enabled=falseHDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=falseYARN_CONF_yarn_log___aggregation___enable=trueYARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/YARN_CONF_yarn_resourcemanager_recovery_enabled=trueYARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStoreYARN_CONF_yarn_resourcemanager_scheduler_class=org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSchedulerYARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___mb=8192YARN_CONF_yarn_scheduler_capacity_root_default_maximum___allocation___vcores=4YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstateYARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=trueYARN_CONF_yarn_resourcemanager_hostname=resourcemanagerYARN_CONF_yarn_resourcemanager_address=resourcemanager:8032YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030YARN_CONF_yarn_resourcemanager_resource__tracker_address=resourcemanager:8031YARN_CONF_yarn_timeline___service_enabled=trueYARN_CONF_yarn_timeline___service_generic___application___history_enabled=trueYARN_CONF_yarn_timeline___service_hostname=historyserverYARN_CONF_yarn_timeline___service_address=historyserver:10200YARN_CONF_yarn_timeline___service_webapp_address=historyserver:8188YARN_CONF_mapreduce_map_output_compress=trueYARN_CONF_mapred_map_output_compress_codec=org.apache.hadoop.io.compress.SnappyCodecYARN_CONF_yarn_nodemanager_resource_memory___mb=16384YARN_CONF_yarn_nodemanager_resource_cpu___vcores=8YARN_CONF_yarn_nodemanager_disk___health___checker_max___disk___utilization___per___disk___percentage=98.5YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logsYARN_CONF_yarn_nodemanager_aux___services=mapreduce_shuffleYARN_CONF_yarn_nodemanager_vmem___check___enabled=falseYARN_CONF_yarn_scheduler_fair_preemption=trueYARN_CONF_yarn_scheduler_fair_preemption_cluster___utilization___threshold=1.0MAPRED_CONF_mapreduce_framework_name=yarnMAPRED_CONF_mapred_child_java_opts=-Xmx4096mMAPRED_CONF_mapreduce_map_memory_mb=4096MAPRED_CONF_mapreduce_reduce_memory_mb=8192MAPRED_CONF_mapreduce_map_java_opts=-Xmx3072mMAPRED_CONF_mapreduce_reduce_java_opts=-Xmx6144mMAPRED_CONF_yarn_app_mapreduce_am_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/MAPRED_CONF_mapreduce_map_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/MAPRED_CONF_mapreduce_reduce_env=HADOOP_MAPRED_HOME=/opt/hadoop-3.1.1/MAPRED_CONF_mapreduce_jobhistory_address=localhost:10020MAPRED_CONF_mapreduce_jobhistory_webapp_address=localhost:19888MAPRED_CONF_yarn_app_mapreduce_am_staging___dir=/tmp/hadoop-yarn/stagingMAPRED_CONF_mapreduce_jobhistory_done___dir=/tmp/hadoop-yarn/staging/history/doneMAPRED_CONF_mapreduce_jobhistory_intermediate___done___dir=/tmp/hadoop-yarn/staging/history/done_intermediate

Now we can run the following command to start the Hadoop cluster

docker-compose up

--

--