Zookeeper has been designed so that we can develop distributed applications easier. For example, we can use it to switch from a failing active process to a standby one, in an active/standby configuration. This is what this blog post is about. It demonstrates the ability to have a process switch from standby to active role, when another active process crashes.
This is the 3rd story on Zookeeper. The previous ones can be found here:
Design and Architecture
Before we delve into actual code let’s see an overview of the architecture of this demo.
The above is a hypothetical system in which a task producer gives commands to a master. The master distributes the commands to workers. Although we could have had only 1 master process, we want to make master functionality highly available. In order to do that, we bring up multiple processes to play the role of master. However, only 1 master process will be active. The other ones will be standby.
That particular bit of the whole system architecture is the one that we are going to implement here, i.e. the part of the multiple master processes, one being active, whereas, the others being standby.
The rest of the system architecture will be implemented and demonstrated in following Zookeeper stories.
Zookeeper Coordinates Active/Standby Election
In order to implement the active/standby architecture, we will rely on Zookeeper to keep track of the master process. How can we do that? These are the rules of the game:
- Assume that the active process owns the active token, whereas the standby does not.
- When a process starts, it will request Zookeeper the active token.
- If Zookeeper has not given the active token to any other process, it will give the token to the process that is requesting it.
- If Zookeeper has given the active token to another process, it will deny the token to the requesting process.
- When a process is denied the active token, then it asks Zookeeper to be notified when the active token is free.
- When a process is notified that the active token is free, it requests the active token from Zookeeper. It might get it, or it might not get it. This is because in between the notification that active token is free and the request to acquire the token, another process might have grabbed it.
- If a process that has the active token crashes, Zookeeper will consider the active token free, and will give it to the next process that will request it.
Zookeeper Properties
The Zookeeper properties that will help us implement this active token concept are:
- When Zookeeper is requested to create a node, it will return an error if the node already exists.
- Requests to create nodes are serialized and are atomic.
- There is a node type which is called ephemeral and which lives as long as the client session lives. It is automatically deleted when the session with the client terminates.
- A client can request a watch on an existing node. If the node is deleted, then client is notified.
Implement Active Token
Given the above Zookeeper properties
- When a process creates an ephemeral node it will become the active process. We assume that it takes the active token.
- Any other process that will try to create the same ephemeral node, it will fail. We assume that this process will be the standby one, without the active token.
- The standby process will set a watch on the ephemeral node. Hence, it will be notified when that node will be deleted.
- When the active process fails/crashes for any reason, its session with Zookeeper will terminate and Zookeeper will delete the ephemeral node. Hence, any other standby process will be notified.
- The standby processes are notified and the first that manages to create the new ephemeral node will become the active node. The others, will ask to be notified again, by installing a new watch on the ephemeral node.
Example Interaction
In the following diagram, you can see how the ephemeral node is being used in order to implement the concept of the active token.
The master active process is the one that initially has created the ephemeral Zookeeper node /master
. The standby master process fails to create the node, and installs a watch. When the master active process crashes, Zookeeper notifies the standby process that /master
ephemeral node does not exist any more. When standby process receives the notification, it creates the /master
ephemeral node and becomes the active master process.
Implementation using Ruby
We have had enough of theory. Let’s write some code.
Note: The demo code can be found in this Github repo (tag: active-stand-by-master-nodes)
Master Process
The master process code is given below:
# File: master.rb
#
master_app = MasterApp.new
master_app.connect_to_zk
result = master_app.register_as_active
master_app.watch_for_failing_active unless result
while true
sleep 3
puts "I am #{master_app.mode} master"
end
The process does not actually deal with tasks. This will be implemented in future Zookeeper stories. This version above, all that it does is to be in a infinite loop printing its mode in between 3 second sleeps.
The master process mode will be either active
or standby
.
- The implementation relies on the class
MasterApp
. Read about it further down in the post. - The process initially connects to the Zookeeper server:
master_app.connect_to_zk
- Then it tries to register as an
active
master process:result = master_app.register_as_active
- If it manages to register as
active
(result
beingtrue
), it goes to the main loop. - If it does not manage to register as
active
(result
beingfalse
), it registers a watch to be notified if the active process crashes/fails. Then it goes to the main loop.
Note that our client master process has another thread waiting for notifications from Zookeeper. Hence, while it may be in the while loop, this does not prevent the standby process from being notified by the Zookeeper that ephemeral node has been deleted.
MasterApp
The MasterApp
class is the one that has the logic to connect to Zookeeper and register as active or standby.
# File: master_app.rb
#
require 'zookeeper'
require 'zookeeper_client_configuration'
require 'zookeeper_client_api_result'
require 'zookeeper_client_watcher_callback'
class MasterApp
MASTER_NODE = '/master'
attr_reader :mode
def initialize
self.mode = :not_connected
end
def connect_to_zk
@zookeeper_client = Zookeeper.new(zookeeper_client_configuration.servers)
end
def register_as_active
result = create_ephemeral_node(MASTER_NODE, Process.pid)
if result.no_error?
self.mode = :active
true
elsif result.node_already_exists?
self.mode = :standby
false
else
raise "ERROR: Cannot start master app: #{result.inspect}"
end
end
def watch_for_failing_active
watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
callback_object = ZookeeperClientWatcherCallback.new(callback_object)
if callback_object.node_deleted?(MASTER_NODE)
result = register_as_active
watch_for_failing_active unless result
end
end
zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
end
private
attr_reader :zookeeper_client
attr_writer :mode
def zookeeper_client_configuration
ZookeeperClientConfiguration.instance
end
def create_ephemeral_node(node, data)
ZookeeperClientApiResult.new(zookeeper_client.create(path: node, data: data.to_s, ephemeral: true))
end
end
zookeeper
gem
The communication of the master process with the Zookeeper server relies on the zookeeper gem.
Keeping Track of Mode
There is an @mode
instance variable that takes the values :not_connected
, :active
and :standby
. The process that manages to create the ephemeral node, it will be in mode :active
. Otherwise, :standby
. When initializing, it is :not_connected
.
MasterApp#register_as_active
This is the method that tries to register the process as active
.
# File: master_app.rb
#
...
def register_as_active
result = create_ephemeral_node(MASTER_NODE, Process.pid)
if result.no_error?
self.mode = :active
true
elsif result.node_already_exists?
self.mode = :standby
false
else
raise "ERROR: Cannot start master app: #{result.inspect}"
end
end
...
- It tries to create the ephemeral node.
- If it succeeds, then it changes its mode to
:active
and returnstrue
- If it fails and the error code is that the node already exists, then it stays to
:standby
and returnsfalse
Watch For Failing Active
If the master process does not manage to get the active token, i.e. to create the ephemeral node, then it will attach a watch on the existing ephemeral node, in order to be notified by Zookeeper for any change in that node.
# File: master_app.rb
#
...
def watch_for_failing_active
watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
callback_object = ZookeeperClientWatcherCallback.new(callback_object)
if callback_object.node_deleted?(MASTER_NODE)
result = register_as_active
watch_for_failing_active unless result
end
end
zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
end
...
In order for a client process to install a watch, it first creates the callback and then calls the #stat
method to install it.
Defining Watch Callback
This is how the watch callback is defined:
# File: master_app.rb
#
...
watcher_callback = Zookeeper::Callbacks::WatcherCallback.create do |callback_object|
callback_object = ZookeeperClientWatcherCallback.new(callback_object)
if callback_object.node_deleted?(MASTER_NODE)
result = register_as_active
watch_for_failing_active unless result
end
end
...
The code inside the block, is going to be called / executed when Zookeeper notifies our process. It is given a callback_object
, that we then wrap into ZookeeperClientWatcherCallback
in order to interpret its content. If the callback_object
informs us that the ephemeral node has been deleted, then we call register_as_active
. If register_as_active
succeeds, then we are now the active
process, but if it fails, we call the watch_for_failing_active
method again.
Registering Watch
After defining the callback logic, we need to register the watch:
# File: master_app.rb
#
...
zookeeper_client.stat(path: MASTER_NODE, watcher: watcher_callback)
...
The method #stat
tells Zookeeper to return back to us the status of the node given in the path
argument. Also, we pass the watcher
argument specifying how we want to get notified for any changes in that node.
Creating Ephemeral Nodes
The method that creates the ephemeral node is:
# File: master_app.rb
#
...
def create_ephemeral_node(node, data)
ZookeeperClientApiResult.new(zookeeper_client.create(path: node, data: data.to_s, ephemeral: true))
end
...
The #create()
method called on the Zookeeper client, is also sent the ephemeral: true
argument. This is the way to create an ephemeral node.
Interpreting API Result
The MasterApp
class relies on ZookeeperClientApiResult
class in order to interpret the result/response from Zookeeper.
# File: zookeeper_client_api_result.rb
#
class ZookeeperClientApiResult
def initialize(result)
@result = result
end
def request_id
result[:req_id]
end
alias :req_id :request_id
def node_already_exists?
result[:rc] == Zookeeper::Constants::ZNODEEXISTS
end
def no_error?
result[:rc] == 0
end
attr_reader :result
end
- If the result code (
:rc
) is equal toZookeeper::Constants::ZNODEEXISTS
, then we know that we have tried to create a node that was already there. - If the result code (
:rc
) is0
, then no error has been returned and everything succeeded at the Zookeeper side. - The
:reg_id
key returns the request identifier.
Interpreting Watcher Callback
When Zookeeper calls our master process back, which happens when the ephemeral node is deleted, it wraps information inside an object that exposes two important properties:
- the
type
property, which tells us what type of event has taken place, and - the
path
property, which tells us which is the node the event has taken place against.
We take advantage of these pieces of information above within the class ZookeeperClientWatcherCallback
:
# File: zookeeper_client_watcher_callback.rb
#
class ZookeeperClientWatcherCallback
def initialize(callback_object)
@callback_object = callback_object
end
def node_deleted?(node)
callback_object.type == Zookeeper::Constants::ZOO_DELETED_EVENT && callback_object.path == node
end
private
attr_reader :callback_object
end
The method #node_deleted?()
checks whether the node given is the one that has just been deleted.
Example Run
The README.md file, in the source code of this demo, explains how you can configure and run this demo.
Closing Note
In this blog post, we have learnt
- About ephemeral Zookeeper nodes.
- About Zookeeper commands being synchronized and atomic.
- How to use the
zookeeper gem
to create ephemeral nodes and how to install watches on them. - How to process callback calls from Zookeeper when an ephemeral node is deleted.
- How to create an active/standby process architecture using all the above.
Thank you for reading this blog post. Your comments below are more than welcome. I am willing to answer any questions that you may have and give you feedback on any comments that you may post. Don’t forget that, I learn from you as much as you learn from me.
About the Author
Panayotis Matsinopoulos works as Development Lead at Simply Business and, on his free time, enjoys giving and taking classes about Web development at Tech Career Booster.
Ready to start your career at Simply Business?
Want to know more about what it’s like to work in tech at Simply Business? Read about our approach to tech, then check out our current vacancies.
This block is configured using JavaScript. A preview is not available in the editor.