Nima Afzal

Home

Build a Data Lake with MinIO

Published Nov 05, 2021

Data Lake


Introduction

When comparing Data Lake versus Data Warehouse, a simple example that can be imagined is considering both of them as a pool but with a slight difference. Consider Data Lake as a pool that contains everything, not just water. Here, consider the phrase everything as any shape of data, structured or unstructured one like a bunch of CSV files or multiple raw image files. On the other hand, think of Data Warehouse as a pool that contains only liquid substances. Here, the liquid substance in terms of data can be any structured or processed one.

For this guide, constructing a Data Lake by utilizing an object storage as a storage layer is examined, and for the object storage, MinIO is used. In addition to MinIO, there are other options like Amazon S3 and Ceph.

Object Storage

Object Storage is an architecture to store data as an object rather than blocks.

The term block is mostly used in Block Storage systems

In the Object Storage architecture, each object consists of an identifier and data. In addition, an object has an expandable amount of metadata, a powerful and customizable feature that is not available in Block Storage.

Let’s have a brief look at some features of Object Storage systems in terms of consistency and scalability:

All the mentioned features plus the capability of storing unstructured data makes the Object Storage architecture a suitable candidate for implementing Data Lakes.

MinIO


MinIO is an open-source distributed Object Storage server that is also API compatible with the Amazon S3 cloud storage service.

MinIO server

In this section, running the MinIO server with Docker is explained. Note that for this guide, the MinIO server is configured to run in standalone mode. In the following, the steps for this are described:

  1. Create a directory to be used as a storage layer for MinIO in your machine like /mnt/storage

  2. Execute the following command to create a MinIO server running inside a docker container

    docker run \
      -d \
      --restart always \
      -p 9000:9000 \
      -p 9001:9001 \
      --name minio-server \
      -v /mnt/storage:/data \
      -e "MINIO_ROOT_USER=admin" \
      -e "MINIO_ROOT_PASSWORD=P@ssw0rd" \
      quay.io/minio/minio server /data --console-address ":9001"
    

    Here, environment variables MINIO_ROOT_USER and MINIO_ROOT_PASSWORD are used as credentials for any client who needs to communicate with the MinIO server. By default, the MinIO server runs on port 9000.

MinIO Client

For communicating with the MinIO server, utilizing the MinIO client software called mc or connecting to the MinIO console by browsing to http://${YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9001 are two of the many options that exist.

  1. Creating a directory for MinIO client

     mkdir -p ~/apps/minio/client
     cd ~/apps/minio/client
    
  2. Downloading the MinIO client

     wget https://dl.min.io/client/mc/release/linux-amd64/mc
    
  3. Assigning the execution permission to the MinIO Client executable

     chmod +x ./mc
    
  4. Updating the ~/.bashrc and $PATH variable

     echo -e "\nexport PATH=\$PATH:\$HOME/apps/minio/client\n"  >> ~/.bashrc
     source ~/.bashrc
    
  5. Adding the auto-completion support

     mc --autocompletion
    
  6. Setting an alias for the MinIO server

    The mc alias command is used to ease the management of the S3-compatible hosts that the client can connect. Consider alias as a short name or key for obtaining the required information for connecting to the S3-compatible host.

     mc alias set minio-server http://{YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9000 admin P@ssw0rd
    

    You can also get the list of already configured aliases by executing the command below:

     mc alias list
    

A Bucket For Data Lake

In the world of object storage, a bucket is a container for objects. For the Data Lake, a bucket to store data in is a necessity that can be achieved by executing the command below.

```shell
mc mb minio-server/datalake 
```

Here, the datalake is the bucket name, and the minio-server is the alias that is defined at this step.

Hadoop


In this section, the minimum required steps to configure a single-node Hadoop installation are described, and finally, switching the default file system from HDFS to S3A is checked.

For this guide, the Hadoop version 3.3.1 is used, and it is presumed that Java version 8 or 11 has been already installed on the system considered for running Hadoop.

Configure

  1. Downloading the Hadoop package

     mkdir -p ~/downloads
     cd ~/downloads
     wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
    
  2. Extracting the package

     cd ~/apps
     tar xzvf ~/downloads/hadoop-3.3.1.tar.gz  
     ln -s hadoop-3.3.1/ hadoop
    
  3. Creating the required directories for HDFS (namenode and datanode) and Hadoop

     mkdir -p ~/apps/hadoop/storage/datanode
     mkdir -p ~/apps/hadoop/storage/namenode
     mkdir -p ~/apps/hadoop/temp/hdata
    
  4. Adding the required environment variables

     echo -e "\nexport HADOOP_HOME=\${HOME}/apps/hadoop\n\
     export HADOOP_CONF_DIR=\${HADOOP_HOME}/etc/hadoop\n\
     export HDFS_NAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
     export HDFS_DATANODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
     export HDFS_SECONDARYNAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
     export PATH=\${HADOOP_HOME}/sbin:\${HADOOP_HOME}/bin:\$PATH" >> ~/.bashrc
     source ~/.bashrc
    

    The variable ${YOUR_HDFS_OR_HADOOP_USER} should be replaced by the name of the user that has been created for running and managing the HDFS.

  5. Adding the required environment variables

     export JAVA_HOME="${YOUR_JAVA_HOME}"
     export HADOOP_HOME=${HOME}/apps/hadoop
     export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
    

    Inside the file hadoop-env.sh which is located at ${HADOOP_CONF_DIR}, define the variables above.

    The variable HADOOP_OPTIONAL_TOOLS is necessary to get Hadoop works with S3

  6. Adding the required jar libraries to the Hadoop classpath

     cp ${HADOOP_HOME}/share/hadoop/tools/lib/{aws-java-sdk-bundle-1.11.901.jar,hadoop-aws-3.3.1.jar,wildfly-openssl-1.0.7.Final.jar} ${HADOOP_HOME}/share/hadoop/common/lib/
    

    These jar libraries are crucial for enabling Hadoop to communicate with S3 compatible object storage as the underlying file system. The versions of jar libraries are dependent on the Hadoop version that is used. Here, as Hadoop version 3.3.1 is configured, those versions of the libraries are selected.

  7. Editing the core-site.xml

     <property>
         <name>fs.defaultFS</name>
         <value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
     </property>
     <property>
         <name>hadoop.tmp.dir</name>
         <value>${TEMP_DIR}</value>
     </property>
     <property>
         <name>fs.s3a.endpoint</name>
         <description>AWS S3 endpoint to connect to.</description>
         <value>http://${MINIO_SERVER_IP_OR_DOMAIN}:${MINIO_SERVER_PORT}</value>
     </property>
     <property>
             <name>fs.s3a.access.key</name>
             <description>AWS access key ID.</description>
             <value>${MINIO_ACCESS_KEY}</value>
     </property>
     <property>
             <name>fs.s3a.secret.key</name>
             <description>AWS secret key.</description>
             <value>${MINIO_SECRET_KEY}</value>
     </property>
     <property>
             <name>fs.s3a.path.style.access</name>
             <value>true</value>
             <description>Enable S3 path style access.</description>
     </property>
     <property>
         <name>fs.s3a.connection.ssl.enabled</name>
         <value>false</value>
         <description>Enables or disables SSL connections to S3.</description>
     </property>
    

    Add the configurations above between the configuration tag inside the core-site.xml file. The variables MINIO_SERVER_IP_OR_DOMAIN, MINIO_SERVER_PORT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and TEMP_DIR should be replaced respectively with the values obtained after configuring the MinIO server and the required directories for Hadoop.

  8. Configuring HDFS

     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
     <property>
         <name>dfs.namenode.name.dir</name>
         <value>${NAMENODE_DIR}</value>
     </property>
     <property>
         <name>dfs.datanode.data.dir</name>
         <value>${DATANODE_DIR}</value>
     </property>
     <property>
         <name>dfs.permissions</name>
         <value>false</value>
     </property>
    

    Put the configurations above inside the configuration section of the hdfs-site.xml file that is located at${HADOOP_CONF_DIR}. Values for ${NAMENODE_DIR} and ${DATANODE_DIR} should be replaced with the ones that have been configured for HDFS here.

    Finally, format the namenode

     hdfs namenode -format
    

Run

Up here, Hadoop is configured and ready to work both with HDFS and S3 compatible object storage services as the underlying file system.

Be aware that the default file system for the Hadoop is set to be HDFS.

To start and stop HDFS, just run the commands start-dfs.sh and stop-hdfs.sh respectively. The HDFS Browser is going to be run on port 9870 and there you can access the files and directories that exist on the HDFS layer.

What about the object storage layer? Well, to access the objects of the object storage a valid S3A URI is required. Let’s examine how it is possible to communicate with the MinIO server through Hadoop to put and get a sample object to the bucket created at here.

  • List the objects inside the bucket

      hadoop fs -ls s3a://datalake/
    
  • Put an object inside the bucket

      echo "Hello S3 World!" >> ./message.txt
      hadoop fs -put ./message.txt s3a://datalake/message.txt
    
  • List the object content

      hadoop fs -cat s3a://datalake/message.txt
    

Default File System

The Hadoop installation is configured in a way that it supports both HDFS and S3 compatible storage as its underlying storage layer. For changing the default file system from HDFS to S3, it requires modifying fs.defaultFS attribute inside the cor-site.xml to the S3 compatible URI that points to the desired bucket that is going to be used as the root container of the objects.

Therefore, in the file core-site.xml, change

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
</property>

to this

<property>
    <name>fs.defaultFS</name>
    <value>s3a://datalake/</value>
</property>

Now, by executing the command below, the contents of the datalake bucket is displayed.

hadoop fs -ls /