Skip to main content

Nomad Server Cluster

TST, Hong Kong

Extended your single-node setup to a scalable, production-grade cluster with multiple nodes.

Setting up the Server

The first step is to create the configuration file for the server:

mkdir ~/nomad
cd ~/nomad
nano server.hcl

Paste the following into a file called server.hcl:

# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/tmp/server1"

# Give the agent a unique name. Defaults to hostname
name = "server1"

# Enable the server
server {
enabled = true

# Self-elect, should be 3 or 5 for production
bootstrap_expect = 1
}

This will start an agent in server only mode and have it elected as a leader. The major change that should be made for production is to run more than one server, and to change the corresponding bootstrap_expect value. Once the file is created, start the agent in a new tab:

nomad agent -config server.hcl

Setting up a Client

Similar to the server, you must first configure the clients by pasting the following into ~/nomad/client1.hcl on your Client Server (if your Client and Master are on the same server change the server address to 127.0.0.1):

# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/tmp/client1"

# Give the agent a unique name. Defaults to hostname
name = "client1"

# Enable the client
client {
enabled = true

# For demo assume we are talking to server1 - in my case this is IP 192.168.2.110.
# For production, this should be like "nomad.service.consul:4647" and a system
# like Consul used for service discovery.
servers = ["192.168.2.110"]
}

# Modify our port to avoid a collision with server1
ports {
http = 5656
}

# Because we will potentially have two clients talking to the same
# Docker daemon, we have to disable the dangling container cleanup,
# otherwise they will stop each other's work thinking it was orphaned.
plugin "docker" {
config {
gc {
dangling_containers {
enabled = false
}
}
}
}

Now create the data directory and start the Nomad agent:

mkdir /tmp/client1
nomad agent -config client1.hcl

Back on the Master Server you can now check if the Client was able to connect:

nomad node status
ID DC Name Class Drain Eligibility Status
dfa0023b dc1 client1 <none> false eligible ready

Your Client was added with the Client ID dfa0023b.

Submit a Job

Because the sample job contains a Consul health check, Nomad’s deployment watcher will wait for the check to pass by default. This will cause your deployment to stall after the first allocation updates. Resolve this by adding the following attribute inside of the update stanza:

nano ~/nomad/example.nomad
health_check = "task_states"

Use the job run command to submit the job:

nomad job run example.nomad
==> Monitoring evaluation "0cd401ff"
Evaluation triggered by job "example"
Evaluation within deployment: "fdc580b0"
Allocation "74f0acf4" created: node "dfa0023b", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "0cd401ff" finished with status "complete"

You can see that the Master allocated the deployment onto the Client server with the ID dfa0023b. To stop the job run:

nomad job stop example

Multiple Instances

We can now edit the example.nomad file to start more than one instance of redis:

group "cache" {
# The "count" parameter specifies the number of the task groups that should
# be running under this group. This value must be non-negative and defaults
# to 1.
count = 3

...
}

You can verify the modified file with nomad job plan example.nomad.

To handle this extra load we will create another Client - this time on the same PC that runs the Master Server (I only have those 2 at the moment ¯\(ツ)/¯):

nano ~/nomad/client2.hcl
# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/tmp/client2"

# Give the agent a unique name. Defaults to hostname
name = "client2"

# Enable the client
client {
enabled = true
servers = ["127.0.0.1"]
}

# Modify our port to avoid a collision with server1
ports {
http = 5657
}

# Because we will potentially have two clients talking to the same
# Docker daemon, we have to disable the dangling container cleanup,
# otherwise they will stop each other's work thinking it was orphaned.
plugin "docker" {
config {
gc {
dangling_containers {
enabled = false
}
}
}
}

Now create the data directory and start the Nomad agent:

mkdir /tmp/client2
nomad agent -config client2.hcl

You can now check if the second Client was able to connect:

nomad node status

ID DC Name Class Drain Eligibility Status
f5893fd2 dc1 client1 <none> false eligible ready
849470a4 dc1 client2 <none> false eligible ready

Use the job run command to submit the job:

nomad job run example.nomad

==> Monitoring evaluation "b13a03df"
Evaluation triggered by job "example"
Evaluation within deployment: "5ce135ac"
Allocation "2074bce6" created: node "849470a4", group "cache"
Allocation "da4c9a66" created: node "f5893fd2", group "cache"
Allocation "0bd967d0" created: node "f5893fd2", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "b13a03df" finished with status "complete"

You can see that the Master allocated the deployment onto the both Client server.

nomad status example

ID = example
Name = example
Submit Date = 2020-08-29T12:45:14Z
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false

Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 3 0 0 0

Latest Deployment
ID = 9d27882a
Status = failed
Description = Failed due to progress deadline

Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
cache 3 3 0 0 2020-08-29T12:55:14Z

Allocations
ID Node ID Task Group Version Desired Status Created Modified
be729abd 849470a4 cache 4 run running 16m42s ago 13m42s ago
5daea777 f5893fd2 cache 4 run running 16m42s ago 13m42s ago
88c1eeaa f5893fd2 cache 4 run running 16m42s ago 13m42s ago
nomad alloc status be729abd

ID = be729abd-865c-f70d-2557-0e364d066b81
Eval ID = fd465084
Name = example.cache[1]
Node ID = 849470a4
Node Name = client2
Job ID = example
Job Version = 4
Client Status = running
Client Description = Tasks are running
Desired Status = run
Desired Description = <none>
Created = 17m56s ago
Modified = 14m56s ago
Deployment ID = 9d27882a
Deployment Health = unhealthy

Task "redis" is "running"
Task Resources
CPU Memory Disk Addresses
4/500 MHz 952 KiB/256 MiB 300 MiB db: 192.168.2.110:27149

Task Events:
Started At = 2020-08-29T12:46:47Z
Finished At = N/A
Total Restarts = 0
Last Restart = N/A

Recent Events:
Time Type Description
2020-08-29T12:48:14Z Alloc Unhealthy Task not running for min_healthy_time of 10s by deadline
2020-08-29T12:46:47Z Started Task started by client
2020-08-29T12:45:48Z Driver Docker image pull progress: Pulled 5/6 (24.59MiB/28.03MiB) layers: 0 waiting/1 pulling - est 121.5s remaining
2020-08-29T12:45:14Z Driver Downloading image
2020-08-29T12:45:14Z Task Setup Building Task Directory
2020-08-29T12:45:14Z Received Task received by client

WebUI

You can check out the Nomad UI on your Servers IP address and port 4646 e.g. http://192.168.2.110:4646/ui/jobs:

Nomad Cluster UI

You can use the Stop button to stop the deployment.