Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] add a datanode-exporter #4

Open
lhoss opened this issue Sep 19, 2017 · 6 comments
Open

[Feature request] add a datanode-exporter #4

lhoss opened this issue Sep 19, 2017 · 6 comments

Comments

@lhoss
Copy link

lhoss commented Sep 19, 2017

building on the latest updates of https://github.com/Datatamer/hadoop_exporter
which contains already an extra exporter for the 'journalnode' (added by @laferrieren) 👍

@laferrieren
Copy link

@lhoss is there any specific stats you are looking for/hoping to get out of a datanode exporter?

I might have some time this week to add a basic exporter that maps 1:1 to a datanode (ie you would want to install the exporter on each server that is a datanode)

@lhoss
Copy link
Author

lhoss commented Sep 20, 2017

Hi @laferrieren
Having already base JVM metrics (like currently for the journalnode) is a good start (ps: those generic metrics collection code could be extracted, and re-used for all the exporters, avoiding too much code duplication)

I went through all the metrics available (see below for reference).. and here are the ones that would be especially useful, mostly around used disks, and some
VolumeFailures error/block metrics (the commented ones, would be less important):

    "name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",

    "VolumeFailures" : 0,
    "DatanodeNetworkErrors" : 7,
#  "ReadBlockOpNumOps" : 4,
#  "ReadBlockOpAvgTime" : 155.25,
#  "WriteBlockOpNumOps" : 2777,
#   "WriteBlockOpAvgTime" : 11609.653943104076,

    "name" : "Hadoop:service=DataNode,name=FSDatasetState-null",

    "Remaining" : 34914304,
    "Capacity" : 298764926976,
    "DfsUsed" : 282915610624,
    "NumFailedVolumes" : 0,
    "EstimatedCapacityLostTotal" : 0,
#  "NumBlocksCached" : 0,
#  "NumBlocksFailedToCache" : 0,
#  "NumBlocksFailedToUncache" : 869

Datanode JMX metrics reference

for reference here's all the special DataNode metrics theoretically available (generic jvm metrics were omitted here):

$ curl -i http://localhost:50075/jmx | less

...
}, {
    "name" : "Hadoop:service=DataNode,name=DataNodeActivity-dev4-50010",
    "modelerType" : "DataNodeActivity-dev4-50010",
    "tag.SessionId" : null,
    "tag.Context" : "dfs",
    "tag.Hostname" : "dev4",
    "BytesWritten" : 371616975022,
    "TotalWriteTime" : 2512619,
    "BytesRead" : 2780495,
    "TotalReadTime" : 229,
    "BlocksWritten" : 2777,
    "BlocksRead" : 4,
    "BlocksReplicated" : 0,
    "BlocksRemoved" : 869,
    "BlocksVerified" : 0,
    "BlockVerificationFailures" : 0,
    "BlocksCached" : 0,
    "BlocksUncached" : 0,
    "ReadsFromLocalClient" : 0,
    "ReadsFromRemoteClient" : 4,
    "WritesFromLocalClient" : 1169,
    "WritesFromRemoteClient" : 1608,
    "BlocksGetLocalPathInfo" : 109,
    "RemoteBytesRead" : 2780495,
    "RemoteBytesWritten" : 214491758202,
    "RamDiskBlocksWrite" : 0,
    "RamDiskBlocksWriteFallback" : 0,
    "RamDiskBytesWrite" : 0,
    "RamDiskBlocksReadHits" : 0,
    "RamDiskBlocksEvicted" : 0,
    "RamDiskBlocksEvictedWithoutRead" : 0,
    "RamDiskBlocksEvictionWindowMsNumOps" : 0,
    "RamDiskBlocksEvictionWindowMsAvgTime" : 0.0,
    "RamDiskBlocksLazyPersisted" : 0,
    "RamDiskBlocksDeletedBeforeLazyPersisted" : 0,
    "RamDiskBytesLazyPersisted" : 0,
    "RamDiskBlocksLazyPersistWindowMsNumOps" : 0,
    "RamDiskBlocksLazyPersistWindowMsAvgTime" : 0.0,
    "FsyncCount" : 0,
    "VolumeFailures" : 0,
    "DatanodeNetworkErrors" : 7,
    "ReadBlockOpNumOps" : 4,
    "ReadBlockOpAvgTime" : 155.25,
    "WriteBlockOpNumOps" : 2777,
    "WriteBlockOpAvgTime" : 11609.653943104076,
    "BlockChecksumOpNumOps" : 0,
    "BlockChecksumOpAvgTime" : 0.0,
    "CopyBlockOpNumOps" : 0,
    "CopyBlockOpAvgTime" : 0.0,
    "ReplaceBlockOpNumOps" : 0,
    "ReplaceBlockOpAvgTime" : 0.0,
    "HeartbeatsNumOps" : 914867,
    "HeartbeatsAvgTime" : 2.2573554407361267,
    "BlockReportsNumOps" : 130,
    "BlockReportsAvgTime" : 19.200000000000006,
    "IncrementalBlockReportsNumOps" : 8363,
    "IncrementalBlockReportsAvgTime" : 4.572641396627999,
    "CacheReportsNumOps" : 0,
    "CacheReportsAvgTime" : 0.0,
    "PacketAckRoundTripTimeNanosNumOps" : 4072945,
    "PacketAckRoundTripTimeNanosAvgTime" : 2.6202025743151914E7,
    "FlushNanosNumOps" : 5770250,
    "FlushNanosAvgTime" : 20419.89643637526,
    "FsyncNanosNumOps" : 0,
    "FsyncNanosAvgTime" : 0.0,
    "SendDataPacketBlockedOnNetworkNanosNumOps" : 55,
    "SendDataPacketBlockedOnNetworkNanosAvgTime" : 3365114.454545455,
    "SendDataPacketTransferNanosNumOps" : 55,
    "SendDataPacketTransferNanosAvgTime" : 229675.47272727254
 }, {
    "name" : "Hadoop:service=DataNode,name=DataNodeInfo",
    "modelerType" : "org.apache.hadoop.hdfs.server.datanode.DataNode",
    "XceiverCount" : 2,
    "DatanodeNetworkCounts" : [ {
      "key" : "/192.168.161.103",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 3
      } ]
    }, {
      "key" : "/192.168.161.101",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 1
      } ]
    }, {
      "key" : "/192.168.161.104",
      "value" : [ {
        "key" : "networkErrors",
        "value" : 3
      } ]
    } ],
    "Version" : "2.7.3",
    "RpcPort" : "50020",
    "HttpPort" : null,
    "NamenodeAddresses" : "{\"dev1\":\"BP-843475092-192.168.161.101-1483356716950\",\"dev4\":\"BP-843475092-192.168.161.101-1483356716950\"}",
    "VolumeInfo" : "{\"/mnt/hdfs01/hdfs-slave-datadir/current\":{\"usedSpace\":282915610624,\"freeSpace\":34914304,\"reservedSpace\":10737418240}}",
    "ClusterId" : "CID-xxxxxx"
  }, {
    "name" : "Hadoop:service=DataNode,name=FSDatasetState-null",
    "modelerType" : "org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl",
    "Remaining" : 34914304,
    "StorageInfo" : "FSDataset{dirpath='[/mnt/hdfs01/hdfs-slave-datadir/current]'}",
    "Capacity" : 298764926976,
    "DfsUsed" : 282915610624,
    "CacheCapacity" : 0,
    "CacheUsed" : 0,
    "NumFailedVolumes" : 0,
    "FailedStorageLocations" : [ ],
    "LastVolumeFailureDate" : 0,
    "EstimatedCapacityLostTotal" : 0,
    "NumBlocksCached" : 0,
    "NumBlocksFailedToCache" : 0,
    "NumBlocksFailedToUncache" : 869
  }, {

@lhoss
Copy link
Author

lhoss commented Oct 17, 2017

@laferrieren any progress ?

@laferrieren
Copy link

@lhoss Sorry, was a little absent minded and forgot to push it back up to github, here is a branch with a datanode exporter (https://github.com/laferrieren/hadoop_exporter/tree/datanode), working on getting back upstream.

@lhoss
Copy link
Author

lhoss commented Oct 17, 2017

awesome @laferrieren 👍 ( PR https://github.com/Datatamer/hadoop_exporter/pull/4/files )
we will gladly help with testing it !

@lhoss
Copy link
Author

lhoss commented Oct 23, 2017

quick heads up, i did some initial tests already deployed the datanode exporter (from this PR commit) on our test cluster, and already collected metrics, working good so far 👍

on a side node, I detected a small bug that is in the (duplicated) code in all exporters, thus also in the new datanode exporter: #5
( ps: it would def. be a great idea to refactor out the common logic of those exporters )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants