Published Nov 05, 2021
When comparing Data Lake versus Data Warehouse, a simple example that can be imagined is considering both of them as a pool but with a slight difference. Consider Data Lake as a pool that contains everything, not just water. Here, consider the phrase everything as any shape of data, structured or unstructured one like a bunch of CSV files or multiple raw image files. On the other hand, think of Data Warehouse as a pool that contains only liquid substances. Here, the liquid substance in terms of data can be any structured or processed one.
For this guide, constructing a Data Lake by utilizing an object storage as a storage layer is examined, and for the object storage, MinIO is used. In addition to MinIO, there are other options like Amazon S3 and Ceph.
Object Storage is an architecture to store data as an object rather than blocks.
The term block is mostly used in Block Storage systems
In the Object Storage architecture, each object consists of an identifier and data. In addition, an object has an expandable amount of metadata, a powerful and customizable feature that is not available in Block Storage.
Let’s have a brief look at some features of Object Storage systems in terms of consistency and scalability:
Consistency
Despite the Block Storage systems like transactional databases that provide strong consistency, Object Storage systems are eventually consistent.
Scalability
Object Storage provides high scalability by decoupling file management from the low-level block management which makes it capable to store objects that can go beyond petabytes.
All the mentioned features plus the capability of storing unstructured data makes the Object Storage architecture a suitable candidate for implementing Data Lakes.
MinIO is an open-source distributed Object Storage server that is also API compatible with the Amazon S3 cloud storage service.
In this section, running the MinIO server with Docker is explained. Note that for this guide, the MinIO server is configured to run in standalone mode. In the following, the steps for this are described:
Create a directory to be used as a storage layer for MinIO in your machine like /mnt/storage
Execute the following command to create a MinIO server running inside a docker container
docker run \
-d \
--restart always \
-p 9000:9000 \
-p 9001:9001 \
--name minio-server \
-v /mnt/storage:/data \
-e "MINIO_ROOT_USER=admin" \
-e "MINIO_ROOT_PASSWORD=P@ssw0rd" \
quay.io/minio/minio server /data --console-address ":9001"
Here, environment variables MINIO_ROOT_USER
and MINIO_ROOT_PASSWORD
are used as credentials for any client who needs to communicate with the MinIO server. By default, the MinIO server runs on port 9000.
For communicating with the MinIO server, utilizing the MinIO client software called mc
or connecting to the MinIO console by browsing to http://${YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9001
are two of the many options that exist.
Creating a directory for MinIO client
mkdir -p ~/apps/minio/client
cd ~/apps/minio/client
Downloading the MinIO client
wget https://dl.min.io/client/mc/release/linux-amd64/mc
Assigning the execution permission to the MinIO Client executable
chmod +x ./mc
Updating the ~/.bashrc
and $PATH
variable
echo -e "\nexport PATH=\$PATH:\$HOME/apps/minio/client\n" >> ~/.bashrc
source ~/.bashrc
Adding the auto-completion support
mc --autocompletion
Setting an alias for the MinIO server
The mc alias
command is used to ease the management of the S3-compatible hosts that the client can connect. Consider alias
as a short name or key for obtaining the required information for connecting
to the S3-compatible host.
mc alias set minio-server http://{YOUR_MINIO_IP_OR_DOMAIN_ADDRESS}:9000 admin P@ssw0rd
You can also get the list of already configured aliases by executing the command below:
mc alias list
In the world of object storage, a bucket is a container for objects. For the Data Lake, a bucket to store data in is a necessity that can be achieved by executing the command below.
```shell
mc mb minio-server/datalake
```
Here, the datalake
is the bucket name, and the minio-server
is the alias that is defined at this step.
In this section, the minimum required steps to configure a single-node Hadoop installation are described, and finally, switching the default file system from HDFS
to S3A
is checked.
For this guide, the Hadoop version 3.3.1 is used, and it is presumed that Java version 8 or 11 has been already installed on the system considered for running Hadoop.
Downloading the Hadoop package
mkdir -p ~/downloads
cd ~/downloads
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.1/hadoop-3.3.1.tar.gz
Extracting the package
cd ~/apps
tar xzvf ~/downloads/hadoop-3.3.1.tar.gz
ln -s hadoop-3.3.1/ hadoop
Creating the required directories for HDFS (namenode and datanode) and Hadoop
mkdir -p ~/apps/hadoop/storage/datanode
mkdir -p ~/apps/hadoop/storage/namenode
mkdir -p ~/apps/hadoop/temp/hdata
Adding the required environment variables
echo -e "\nexport HADOOP_HOME=\${HOME}/apps/hadoop\n\
export HADOOP_CONF_DIR=\${HADOOP_HOME}/etc/hadoop\n\
export HDFS_NAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
export HDFS_DATANODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
export HDFS_SECONDARYNAMENODE_USER=\"${YOUR_HDFS_OR_HADOOP_USER}\"\n\
export PATH=\${HADOOP_HOME}/sbin:\${HADOOP_HOME}/bin:\$PATH" >> ~/.bashrc
source ~/.bashrc
The variable
${YOUR_HDFS_OR_HADOOP_USER}
should be replaced by the name of the user that has been created for running and managing the HDFS.
Adding the required environment variables
export JAVA_HOME="${YOUR_JAVA_HOME}"
export HADOOP_HOME=${HOME}/apps/hadoop
export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
Inside the file hadoop-env.sh
which is located at ${HADOOP_CONF_DIR}, define the variables above.
The variable
HADOOP_OPTIONAL_TOOLS
is necessary to get Hadoop works withS3
Adding the required jar libraries to the Hadoop classpath
cp ${HADOOP_HOME}/share/hadoop/tools/lib/{aws-java-sdk-bundle-1.11.901.jar,hadoop-aws-3.3.1.jar,wildfly-openssl-1.0.7.Final.jar} ${HADOOP_HOME}/share/hadoop/common/lib/
These jar libraries are crucial for enabling Hadoop to communicate with S3 compatible object storage as the underlying file system. The versions of jar libraries are dependent on the Hadoop version that is used. Here, as Hadoop version
3.3.1
is configured, those versions of the libraries are selected.
Editing the core-site.xml
<property>
<name>fs.defaultFS</name>
<value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>${TEMP_DIR}</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<description>AWS S3 endpoint to connect to.</description>
<value>http://${MINIO_SERVER_IP_OR_DOMAIN}:${MINIO_SERVER_PORT}</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID.</description>
<value>${MINIO_ACCESS_KEY}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key.</description>
<value>${MINIO_SECRET_KEY}</value>
</property>
<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
<description>Enable S3 path style access.</description>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>false</value>
<description>Enables or disables SSL connections to S3.</description>
</property>
Add the configurations above between the
configuration
tag inside thecore-site.xml
file. The variablesMINIO_SERVER_IP_OR_DOMAIN
,MINIO_SERVER_PORT
,MINIO_ACCESS_KEY
,MINIO_SECRET_KEY
, andTEMP_DIR
should be replaced respectively with the values obtained after configuring the MinIO server and the required directories for Hadoop.
Configuring HDFS
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>${NAMENODE_DIR}</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>${DATANODE_DIR}</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
Put the configurations above inside the
configuration
section of thehdfs-site.xml
file that is located at${HADOOP_CONF_DIR}
. Values for${NAMENODE_DIR}
and${DATANODE_DIR}
should be replaced with the ones that have been configured for HDFS here.
Finally, format the namenode
hdfs namenode -format
Up here, Hadoop is configured and ready to work both with HDFS and S3 compatible object storage services as the underlying file system.
Be aware that the default file system for the Hadoop is set to be HDFS.
To start and stop HDFS, just run the commands start-dfs.sh
and stop-hdfs.sh
respectively. The HDFS Browser is going to be run on port 9870
and there you can access the files and directories that exist on the HDFS layer.
What about the object storage layer? Well, to access the objects of the object storage a valid S3A URI is required. Let’s examine how it is possible to communicate with the MinIO server through Hadoop to put and get a sample object to the bucket created at here.
List the objects inside the bucket
hadoop fs -ls s3a://datalake/
Put an object inside the bucket
echo "Hello S3 World!" >> ./message.txt
hadoop fs -put ./message.txt s3a://datalake/message.txt
List the object content
hadoop fs -cat s3a://datalake/message.txt
The Hadoop installation is configured in a way that it supports both HDFS and S3 compatible storage as its underlying storage layer. For changing the default file system from HDFS to S3, it requires modifying fs.defaultFS
attribute inside the cor-site.xml
to the S3 compatible URI that points to the desired bucket that is going to be used as the root container of the objects.
Therefore, in the file core-site.xml
, change
<property>
<name>fs.defaultFS</name>
<value>hdfs://{HDFS_IP_OR_DOMAIN_ADDRESS}:{HDFS_PORT}</value>
</property>
to this
<property>
<name>fs.defaultFS</name>
<value>s3a://datalake/</value>
</property>
Now, by executing the command below, the contents of the datalake
bucket is displayed.
hadoop fs -ls /