@@ -15,10 +15,7 @@ Install this demo on an existing Kubernetes cluster:
1515$ stackablectl demo install hbase-hdfs-load-cycling-data
1616----
1717
18- [WARNING]
19- ====
20- This demo should not be run alongside other demos.
21- ====
18+ WARNING: This demo should not be run alongside other demos.
2219
2320[#system-requirements]
2421== System requirements
@@ -35,11 +32,11 @@ This demo will
3532
3633* Install the required Stackable operators.
3734* Spin up the following data products:
38- ** *Hbase :* An open source distributed, scalable, big data store. This demo uses it to store the
35+ ** *HBase :* An open source distributed, scalable, big data store. This demo uses it to store the
3936 {kaggle}[cyclist dataset] and enable access.
40- ** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into Hbase
37+ ** *HDFS:* A distributed file system used to intermediately store the dataset before importing it into HBase
4138* Use {distcp}[distcp] to copy a {kaggle}[cyclist dataset] from an S3 bucket into HDFS.
42- * Create HFiles, a File format for hbase consisting of sorted key/value pairs. Both keys and values are byte arrays.
39+ * Create HFiles, a File format for hBase consisting of sorted key/value pairs. Both keys and values are byte arrays.
4340* Load Hfiles into an existing table via the `Importtsv` utility, which will load data in `TSV` or `CSV` format into
4441 HBase.
4542* Query data via the `hbase` shell, which is an interactive shell to execute commands on the created table
@@ -87,10 +84,11 @@ This demo will run two jobs to automatically load data.
8784
8885=== distcp-cycling-data
8986
90- {distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying. It uses MapReduce to effect its
91- distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map
92- tasks, each of which will copy a partition of the files specified in the source list. Therefore, the first Job uses
93- DistCp to copy data from a S3 bucket into HDFS. Below, you'll see parts from the logs.
87+ {distcp}[DistCp] (distributed copy) is used for large inter/intra-cluster copying.
88+ It uses MapReduce to effect its distribution, error handling, recovery, and reporting.
89+ It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
90+ Therefore, the first Job uses DistCp to copy data from a S3 bucket into HDFS.
91+ Below, you'll see parts from the logs.
9492
9593[source]
9694----
@@ -111,11 +109,12 @@ Copying s3a://public-backup-nyc-tlc/cycling-tripdata/demo-cycling-tripdata.csv.g
111109
112110The second Job consists of 2 steps.
113111
114- First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and
115- Hfiles. Hfile is an Hbase dedicated file format which is performance optimized for hbase. It stores meta-information
116- about the data and thus increases the performance of hbase. When connecting to the hbase master, opening a hbase shell
117- and executing `list`, you will see the created table. However, it'll contain 0 rows at this point. You can connect to
118- the shell via:
112+ First, we use `org.apache.hadoop.hbase.mapreduce.ImportTsv` (see {importtsv}[ImportTsv Docs]) to create a table and Hfiles.
113+ Hfile is an HBase dedicated file format which is performance optimized for HBase.
114+ It stores meta-information about the data and thus increases the performance of HBase.
115+ When connecting to the HBase master, opening a HBase shell and executing `list`, you will see the created table.
116+ However, it'll contain 0 rows at this point.
117+ You can connect to the shell via:
119118
120119[source,console]
121120----
@@ -163,7 +162,7 @@ Took 13.4666 seconds
163162
164163== Inspecting the Table
165164
166- You can now use the table and the data. You can use all available hbase shell commands.
165+ You can now use the table and the data. You can use all available HBase shell commands.
167166
168167[source,sql]
169168----
@@ -191,15 +190,15 @@ COLUMN FAMILIES DESCRIPTION
191190{NAME => 'started_at', BLOOMFILTER => 'ROW', IN_MEMORY => 'false', VERSIONS => '1', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', COMPRESSION => 'NONE', TTL => 'FOREVER', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
192191----
193192
194- == Accessing the Hbase web interface
193+ == Accessing the HBase web interface
195194
196195[TIP]
197196====
198197Run `stackablectl stacklet list` to get the address of the _ui-http_ endpoint.
199198If the UI is unavailable, do a port-forward `kubectl port-forward hbase-master-default-0 16010`.
200199====
201200
202- The Hbase web UI will give you information on the status and metrics of your Hbase cluster. See below for the start page.
201+ The HBase web UI will give you information on the status and metrics of your HBase cluster. See below for the start page.
203202
204203image::hbase-hdfs-load-cycling-data/hbase-ui-start-page.png[]
205204
@@ -209,7 +208,7 @@ image::hbase-hdfs-load-cycling-data/hbase-table-ui.png[]
209208
210209== Accessing the HDFS web interface
211210
212- You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.
211+ You can also see HDFS details via a UI by running `stackablectl stacklet list` and following the link next to one of the namenodes.
213212
214213Below you will see the overview of your HDFS cluster.
215214
@@ -223,7 +222,8 @@ You can also browse the file system by clicking on the `Utilities` tab and selec
223222
224223image::hbase-hdfs-load-cycling-data/hdfs-data.png[]
225224
226- Navigate in the file system to the folder `data` and then the `raw` folder. Here you can find the raw data from the distcp job.
225+ Navigate in the file system to the folder `data` and then the `raw` folder.
226+ Here you can find the raw data from the distcp job.
227227
228228image::hbase-hdfs-load-cycling-data/hdfs-data-raw.png[]
229229
0 commit comments