Skip to content

Commit

Permalink
fix formatting and HBase spelling
Browse files Browse the repository at this point in the history
  • Loading branch information
Felix Hennig committed Sep 12, 2024
1 parent 6cddc91 commit 7baba0d
Showing 1 changed file with 21 additions and 21 deletions.
42 changes: 21 additions & 21 deletions docs/modules/demos/pages/hbase-hdfs-load-cycling-data.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,7 @@ Install this demo on an existing Kubernetes cluster:
$ stackablectl demo install hbase-hdfs-load-cycling-data
----

[WARNING]
====
This demo should not be run alongside other demos.
====
WARNING: This demo should not be run alongside other demos.

[#system-requirements]
== System requirements
Expand All @@ -35,11 +32,11 @@ This demo will

* Install the required Stackable operators.
* Spin up the following data products:
** *Hbase:* An open source distributed, scalable, big data store. This demo uses it to store the
** *HBase:* An open source distributed, scalable, big data store. This demo uses it to store the
{kaggle}[cyclist dataset] and enable access.
** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into Hbase
** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into HBase
* Use {distcp}[distcp] to copy a {kaggle}[cyclist dataset] from an S3 bucket into HDFS.
* Create HFiles, a File format for hbase consisting of sorted key/value pairs. Both keys and values are byte arrays.
* Create HFiles, a File format for hBase consisting of sorted key/value pairs. Both keys and values are byte arrays.
* Load Hfiles into an existing table via the `Importtsv` utility, which will load data in `TSV` or `CSV` format into
HBase.
* Query data via the `hbase` shell, which is an interactive shell to execute commands on the created table
Expand Down Expand Up @@ -87,10 +84,11 @@ This demo will run two jobs to automatically load data.

=== distcp-cycling-data

{distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying. It uses MapReduce to effect its
distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map
tasks, each of which will copy a partition of the files specified in the source list. Therefore, the first Job uses
DistCp to copy data from a S3 bucket into HDFS. Below, you'll see parts from the logs.
{distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying.
It uses MapReduce to effect its distribution, error handling, recovery, and reporting.
It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Therefore, the first Job uses DistCp to copy data from a S3 bucket into HDFS.
Below, you'll see parts from the logs.

[source]
----
Expand All @@ -111,11 +109,12 @@ Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.g

The second Job consists of 2 steps.

First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and
Hfiles. Hfile is an Hbase dedicated file format which is performance optimized for hbase. It stores meta-information
about the data and thus increases the performance of hbase. When connecting to the hbase master, opening a hbase shell
and executing `list`, you will see the created table. However, it'll contain 0 rows at this point. You can connect to
the shell via:
First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and Hfiles.
Hfile is an HBase dedicated file format which is performance optimized for HBase.
It stores meta-information about the data and thus increases the performance of HBase.
When connecting to the HBase master, opening a HBase shell and executing `list`, you will see the created table.
However, it'll contain 0 rows at this point.
You can connect to the shell via:

[source,console]
----
Expand Down Expand Up @@ -163,7 +162,7 @@ Took 13.4666 seconds

== Inspecting the Table

You can now use the table and the data. You can use all available hbase shell commands.
You can now use the table and the data. You can use all available HBase shell commands.

[source,sql]
----
Expand Down Expand Up @@ -191,15 +190,15 @@ COLUMN FAMILIES DESCRIPTION
{NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
----

== Accessing the Hbase web interface
== Accessing the HBase web interface

[TIP]
====
Run `stackablectl stacklet list` to get the address of the _ui-http_ endpoint.
If the UI is unavailable, do a port-forward `kubectl port-forward hbase-master-default-0 16010`.
====

The Hbase web UI will give you information on the status and metrics of your Hbase cluster. See below for the start page.
The HBase web UI will give you information on the status and metrics of your HBase cluster. See below for the start page.

image::hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[]

Expand All @@ -209,7 +208,7 @@ image::hbase-hdfs-load-cycling-data/hbase-table-ui.png[]

== Accessing the HDFS web interface

You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.
You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.

Below you will see the overview of your HDFS cluster.

Expand All @@ -223,7 +222,8 @@ You can also browse the file system by clicking on the `Utilities` tab and selec

image::hbase-hdfs-load-cycling-data/hdfs-data.png[]

Navigate in the file system to the folder `data` and then the `raw` folder. Here you can find the raw data from the distcp job.
Navigate in the file system to the folder `data` and then the `raw` folder.
Here you can find the raw data from the distcp job.

image::hbase-hdfs-load-cycling-data/hdfs-data-raw.png[]

Expand Down

0 comments on commit 7baba0d

Please sign in to comment.