We utilise a variety of custom socket servers to support our applications. Most of our apps have at least one. We write RPC servers to interact with repository storage in Deploy and Codebase, the new Deploy Agent has a socket server for users to connect to, AppMail runs it's own SMTP server.

Restarting these services poses a problem. You can't start a new version, then kill the old one as the new version will be unable to bind to the socket. You can't kill the old service and start a new one, as you'll have downtime while the new service starts. You can tell your clients to retry, however this only works if you control all of the clients.

These problems lead to only restarting these services when absolutely necessary. Usually manually. If someone updates a server they must remember to restart the service after deploying it. This is a sure way to have outdated code running in production.

In a perfect world, restarts would be seamless. The old service goes away and the new one immediately starts serving requests.

Making the world more perfect

To demonstrate how we accomplished our super-slick restarting services, we'll create a simple service that writes "Hello World!" to a TCP Socket and disconnects. Nothing fancy, no concurrency. Connect, write, close.

# super_simple_service.rb
require 'socket'

class SuperSimpleService
  attr_reader :bind_address, :bind_port

  def initialize
    @bind_address = 'localhost'
    @bind_port = 12345
  end

  def run
    @socket_server = TCPServer.new(bind_address, bind_port)

    loop do
      client_socket = @socket_server.accept # blocks until a new connection is made
      client_socket.puts "Hello World!"
      client_socket.close
    end
  end
end

SuperSimpleService.new.run

Controlling the restart

We've established that we need something smarter than simply stopping and starting our service. To address this we're going to hand over control of our restarts to the service itself.

Instead of sending our service a TERM signal to stop it, we're going to send it a USR1 signal. USR1 is a user defined signal, with no fixed meaning. We're going to catch it and use it to restart our server. For more information on signals, Tim Uruski has a great blog post on catching signals in Ruby.

Restarting will involve spawning a new copy of the service in a fork. The new fork will then kill the old version once it's taken over the socket connection.

# super_simple_service.rb
class SuperSimpleService
  # ...

  def run
    @socket_server = TCPServer.new(bind_address, bind_port)

    kill_parent if ENV['RESTARTED']
    setup_signal_traps
    # ...
  end

  def setup_signal_traps
    trap('USR1') { hot_restart }
  end

  def hot_restart
    fork do
      # :close_others ensures that open file descriptors are inherited by the new process
      exec("RESTARTED=true ruby super_simple_service.rb", close_others: false)
    end
  end

  def kill_parent
    parent_process_id = Process.ppid
    Process.kill('TERM', parent_process_id)
  end
end

A new copy of the service will be spawned whenever USR1 is received. This new service is passed a RESTARTED flag in it's environment variables. The new service upon seeing this flag sends a TERM to the it's parent (the old copy of the service).

To prevent the old server from exiting immediately when receiving TERM and dropping any existing connections, a graceful shutdown is implemented. This allows any active connections to complete before exiting.

class SuperSimpleService
  def run
    # ...

    loop do
      client_socket = @socket_server.accept # blocks until a new connection is made
      begin
        @connection_active = true # keeps track of if we have an active connection
        client_socket.puts "Hello World!"
        client_socket.close
      ensure
        @connection_active = false
      end
    end
  end

  def setup_signal_traps
    # ...
    trap('TERM') { graceful_shutdown }
  end

  def graceful_shutdown
    @socket_server.close # Stop listening for new connections
    sleep 0.1 while @connection_active # Wait for active connection to complete

    Process.exit(0)
  end
end

The service now keeps a flag in connection_active, which indicates if a connection is currently being processed. On TERM the service will now stop accepting new connections and wait for any existing connection to complete before exiting cleanly.

Sharing sockets

We've got our restarting process down. Unfortunately, when we send USR1 to a running service we'll get the following error:

➜ pkill -USR1 -f super_simple_service.rb
super_simple_service.rb:13:in `initialize': Address already in use - bind(2) for "localhost" port 12345 (Errno::EADDRINUSE)
    from super_simple_service.rb:13:in `new'
    from super_simple_service.rb:13:in `run'
    from super_simple_service.rb:54:in `<main>'

This error is caused when the new version of the service attempts to bind to the port. The port is still in use by the original version, so the binding fails.

We require a way to rebind to an already open socket. Fortunately, in the world of POSIX, every open file gets a numeric ID assigned to it, a file descriptor. Open sockets are treated as files, therefore they are also given a file descriptor. You can find the file descriptor of any IO object in Ruby by calling #fileno on it.

File.open('/tmp/my_file.txt').fileno
# => 19

Conveniently, Ruby's BasicSocket class includes a .for_fd method, which opens a socket based on a passed file descriptor. The built in TCPServer and UNIXServer classes both inherit from BasicSocket and so support binding to a descriptor out-of-the box.

Passing the file descriptor to the newly spawned service will allow it to rebind to the existing port. We already pass the RESTARTED flag to the new server in the environment. Instead of this, we can pass the file descriptor to signify a restart, and bind to this. We also need to set a couple of options on the socket to prevent it from being closed when we start the new server.

class SuperSimpleService
  # ...

  def run
    if ENV['SOCKET_FD']
      @socket_server = TCPServer.for_fd(ENV['SOCKET_FD'].to_i)
      kill_parent
    else
      @socket_server = TCPServer.new(bind_address, bind_port)
    end

    @socket_server.autoclose = false
    @socket_server.close_on_exec = false

    # ...
  end

  def hot_restart
    fork do
      # :close_others ensures that open file descriptors are inherited by the new process
      exec("SOCKET_FD=#{@socket_server.fileno} ruby super_simple_service.rb", close_others: false)
    end
  end
end

Voila! Your server is complete, you can see it in all of it's finished glory here. However, it's grown from 20 lines to over 60, that's a lot of extra code. If only someone could make this into a nice, easy to use gem.

Enter Uninterruptible

Uninterruptible is a gem that takes all of the pain out of the hot-restart process. It manages all of the signals, ports and file descriptors. It supports both UNIX and TCP sockets for maximum flexibility. To implement our SuperSimpleService with Uninterruptible is simple:

class SuperSimpleService
  include Uninterruptible::Server

  def handle_request(client_socket)
    client_socket.puts("Hello World!")
  end
end

server = SuperSimpleService.new
server.configure do  |config|
  config.bind = "tcp://localhost:12345"
  config.start_command = 'ruby super_simple_service.rb'
end
server.run

Uninterruptible is currently running many of the socket servers across our applications and is production ready. If you've got any questions, give us a shout, contact details are below.

Tell us how you feel about this post?