Data Lake

Introduction

When comparing Data Lake versus Data Warehouse, a simple example that can be imagined is considering both of them as a pool but with a slight difference. Consider Data Lake as a pool that contains everything, not just water. Here, consider the phrase everything as any shape of data, structured or unstructured one like a bunch of CSV files or multiple raw image files. On the other hand, think of Data Warehouse as a pool that contains only liquid substances. Here, the liquid substance in terms of data can be any structured or processed one.

For this guide, constructing a Data Lake by utilizing an object storage as a storage layer is examined, and for the object storage, MinIO is used. In addition to MinIO, there are other options like Amazon S3 and Ceph.

Object Storage

Object Storage is an architecture to store data as an object rather than blocks.

The term block is mostly used in Block Storage systems

In the Object Storage architecture, each object consists of an identifier and data. In addition, an object has an expandable amount of metadata, a powerful and customizable feature that is not available in Block Storage.

Let’s have a brief look at some features of Object Storage systems in terms of consistency and scalability:

Consistency

Despite the Block Storage systems like transactional databases that provide strong consistency, Object Storage systems are eventually consistent.
Scalability

Object Storage provides high scalability by decoupling file management from the low-level block management which makes it capable to store objects that can go beyond petabytes.

All the mentioned features plus the capability of storing unstructured data makes the Object Storage architecture a suitable candidate for implementing Data Lakes.

MinIO

MinIO is an open-source distributed Object Storage server that is also API compatible with the Amazon S3 cloud storage service.

MinIO server

In this section, running the MinIO server with Docker is explained. Note that for this guide, the MinIO server is configured to run in standalone mode. In the following, the steps for this are described:

Create a directory to be used as a storage layer for MinIO in your machine like /mnt/storage

Execute the following command to create a MinIO server running inside a docker container

docker run \
  -d \
  --restart always \
  -p 9000:9000 \
  -p 9001:9001 \
  --name minio-server \
  -v /mnt/storage:/data \
  -e "MINIO_ROOT_USER=admin" \
  -e "MINIO_ROOT_PASSWORD=P@ssw0rd" \
  quay.io/minio/minio server /data --console-address ":9001"

Here, environment variables MINIO_ROOT_USER and MINIO_ROOT_PASSWORD are used as credentials for any client who needs to communicate with the MinIO server. By default, the MinIO server runs on port 9000.

MinIO Client

For communicating with the MinIO server, utilizing the MinIO client software called mc or connecting to the MinIO console by browsing to http://${YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9001 are two of the many options that exist.

Creating a directory for MinIO client

 mkdir -p ~/apps/minio/client
 cd ~/apps/minio/client

Downloading the MinIO client

 wget https://dl.min.io/client/mc/release/linux-amd64/mc

Assigning the execution permission to the MinIO Client executable
```
 chmod +x ./mc
```

Updating the ~/.bashrc and $PATH variable

 echo -e "\nexport PATH=\$PATH:\$HOME/apps/minio/client\n"  >> ~/.bashrc
 source ~/.bashrc

Adding the auto-completion support
```
 mc --autocompletion
```
Setting an alias for the MinIO server

The mc alias command is used to ease the management of the S3-compatible hosts that the client can connect. Consider alias as a short name or key for obtaining the required information for connecting to the S3-compatible host.
```
 mc alias set minio-server http://{YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9000 admin P@ssw0rd
```
You can also get the list of already configured aliases by executing the command below:
```
 mc alias list
```

A Bucket For Data Lake

In the world of object storage, a bucket is a container for objects. For the Data Lake, a bucket to store data in is a necessity that can be achieved by executing the command below.

```shell
mc mb minio-server/datalake 
```

Here, the datalake is the bucket name, and the minio-server is the alias that is defined at this step.

Hadoop

In this section, the minimum required steps to configure a single-node Hadoop installation are described, and finally, switching the default file system from HDFS to S3A is checked.

For this guide, the Hadoop version 3.3.1 is used, and it is presumed that Java version 8 or 11 has been already installed on the system considered for running Hadoop.

Configure

Downloading the Hadoop package

 mkdir -p ~/downloads
 cd ~/downloads
 wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz

Extracting the package

 cd ~/apps
 tar xzvf ~/downloads/hadoop-3.3.1.tar.gz  
 ln -s hadoop-3.3.1/ hadoop

Creating the required directories for HDFS (namenode and datanode) and Hadoop

 mkdir -p ~/apps/hadoop/storage/datanode
 mkdir -p ~/apps/hadoop/storage/namenode
 mkdir -p ~/apps/hadoop/temp/hdata

Adding the required environment variables

 echo -e "\nexport HADOOP_HOME=\${HOME}/apps/hadoop\n\
 export HADOOP_CONF_DIR=\${HADOOP_HOME}/etc/hadoop\n\
 export HDFS_NAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
 export HDFS_DATANODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
 export HDFS_SECONDARYNAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
 export PATH=\${HADOOP_HOME}/sbin:\${HADOOP_HOME}/bin:\$PATH" >> ~/.bashrc
 source ~/.bashrc

The variable ${YOUR_HDFS_OR_HADOOP_USER} should be replaced by the name of the user that has been created for running and managing the HDFS.

Adding the required environment variables
```
 export JAVA_HOME="${YOUR_JAVA_HOME}"
 export HADOOP_HOME=${HOME}/apps/hadoop
 export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
```
Inside the file hadoop-env.sh which is located at ${HADOOP_CONF_DIR}, define the variables above.

The variable HADOOP_OPTIONAL_TOOLS is necessary to get Hadoop works with S3
Adding the required jar libraries to the Hadoop classpath
```
 cp ${HADOOP_HOME}/share/hadoop/tools/lib/{aws-java-sdk-bundle-1.11.901.jar,hadoop-aws-3.3.1.jar,wildfly-openssl-1.0.7.Final.jar} ${HADOOP_HOME}/share/hadoop/common/lib/
```
These jar libraries are crucial for enabling Hadoop to communicate with S3 compatible object storage as the underlying file system. The versions of jar libraries are dependent on the Hadoop version that is used. Here, as Hadoop version 3.3.1 is configured, those versions of the libraries are selected.

Editing the core-site.xml

 <property>
     <name>fs.defaultFS</name>
     <value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
 </property>
 <property>
     <name>hadoop.tmp.dir</name>
     <value>${TEMP_DIR}</value>
 </property>
 <property>
     <name>fs.s3a.endpoint</name>
     <description>AWS S3 endpoint to connect to.</description>
     <value>http://${MINIO_SERVER_IP_OR_DOMAIN}:${MINIO_SERVER_PORT}</value>
 </property>
 <property>
         <name>fs.s3a.access.key</name>
         <description>AWS access key ID.</description>
         <value>${MINIO_ACCESS_KEY}</value>
 </property>
 <property>
         <name>fs.s3a.secret.key</name>
         <description>AWS secret key.</description>
         <value>${MINIO_SECRET_KEY}</value>
 </property>
 <property>
         <name>fs.s3a.path.style.access</name>
         <value>true</value>
         <description>Enable S3 path style access.</description>
 </property>
 <property>
     <name>fs.s3a.connection.ssl.enabled</name>
     <value>false</value>
     <description>Enables or disables SSL connections to S3.</description>
 </property>

Add the configurations above between the configuration tag inside the core-site.xml file. The variables MINIO_SERVER_IP_OR_DOMAIN, MINIO_SERVER_PORT, MINIO_ACCESS_KEY, MINIO_SECRET_KEY, and TEMP_DIR should be replaced respectively with the values obtained after configuring the MinIO server and the required directories for Hadoop.

Configuring HDFS

 <property>
     <name>dfs.replication</name>
     <value>1</value>
 </property>
 <property>
     <name>dfs.namenode.name.dir</name>
     <value>${NAMENODE_DIR}</value>
 </property>
 <property>
     <name>dfs.datanode.data.dir</name>
     <value>${DATANODE_DIR}</value>
 </property>
 <property>
     <name>dfs.permissions</name>
     <value>false</value>
 </property>

Put the configurations above inside the configuration section of the hdfs-site.xml file that is located at${HADOOP_CONF_DIR}. Values for ${NAMENODE_DIR} and ${DATANODE_DIR} should be replaced with the ones that have been configured for HDFS here.

Finally, format the namenode

 hdfs namenode -format

Run

Up here, Hadoop is configured and ready to work both with HDFS and S3 compatible object storage services as the underlying file system.

Be aware that the default file system for the Hadoop is set to be HDFS.

To start and stop HDFS, just run the commands start-dfs.sh and stop-hdfs.sh respectively. The HDFS Browser is going to be run on port 9870 and there you can access the files and directories that exist on the HDFS layer.

What about the object storage layer? Well, to access the objects of the object storage a valid S3A URI is required. Let’s examine how it is possible to communicate with the MinIO server through Hadoop to put and get a sample object to the bucket created at here.

List the objects inside the bucket
```
  hadoop fs -ls s3a://datalake/
```

Put an object inside the bucket

  echo "Hello S3 World!" >> ./message.txt
  hadoop fs -put ./message.txt s3a://datalake/message.txt

List the object content

  hadoop fs -cat s3a://datalake/message.txt

Default File System

The Hadoop installation is configured in a way that it supports both HDFS and S3 compatible storage as its underlying storage layer. For changing the default file system from HDFS to S3, it requires modifying fs.defaultFS attribute inside the cor-site.xml to the S3 compatible URI that points to the desired bucket that is going to be used as the root container of the objects.

Therefore, in the file core-site.xml, change

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
</property>

to this

<property>
    <name>fs.defaultFS</name>
    <value>s3a://datalake/</value>
</property>

Now, by executing the command below, the contents of the datalake bucket is displayed.

hadoop fs -ls /

Nima Afzal

Build a Data Lake with MinIO

Share