Category Archives: linux

Postgres-BDR + Docker: shared-nothing databases made easy

Postgres-BDR + 

To achieve fault tolerance, you need redundant systems. There are two basic approaches to redundancy, active-standby or active-active.

Active-standby

Active-standby means that in the event of failure of the active node, a failover to a standby node is carried out.

Active-active

Active-active means that all nodes are continuously active. In the event of failure of a node, that node simply stops being used and the other nodes assume the full load.

The problem with active-standby

Active-standy has a huge problem in the real world – at the time when a node fails, the chances of failover occurring smoothly are hugely reduced since that the problem that caused the failure is quite likely to affect the system’s ability to failover smoothly – in other words, when things are failing, its a bad idea to start trying to switch over to standby nodes.

For this reason, active-active is the preferred approach to achieving robust fault-tolerance.

Node independence and shared-nothing architecture

Furthermore, redundant nodes should be as independent from one another as possible. For this, a shared-nothing architecture is the ideal. That way, when a node fails for any reason, there’s no reason to fear that the remaining nodes will also fail.

An example of this is the nodes should not share a server, a network or a database. Its relatively easy to distribute redundant  nodes across independent servers, slightly harder to distribute across independent networks and very hard to achieve independent databases.

Postgres-BDR

Postgres-BDR (bidirectional replication) provides shared-nothing database redundancy. Postgres-BDR is a special distribution of Postgres with extensions for replication. It would be relatively complex to install, but Docker makes it easy to use. We’ve deployed it in several recent projects and its been great.

Below I’ll provide you with step-by-step instructions to install a two node pair of Postgres-BDR databases as two Docker containers.

Setting up a Postgres-BDR node pair as Docker containers

Here’s how to setup a pair of BDR databases on a pair of docker hosts (we’ll call them host1, host2) to replicate a database called “testdb”.

We’re basing out containers on the Docker image “jgiannuzzi/postgres-bdr” which is available from the central Docker registry.

Here’s the docker-compose file entries required on each host. Each host additionally has a /data directory which will hold the database files.

version: "3"

services:
 database:
   image: jgiannuzzi/postgres-bdr
   restart: always
   ports:
     - 5432:5432
   volumes:
     - /data:/var/lib/postgresql/data
   environment:
     POSTGRES_PASSWORD: <xxxxxxx>

Once the containers are up on the two hosts, connect with pgadmin to host1  and do the following:

Create a database “testdb” and run the following queries against it to initialise replication.

NOTE: make sure you have the dbtest database selected before running the queries!

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE EXTENSION IF NOT EXISTS bdr;

SELECT bdr.bdr_group_create(
 local_node_name := 'host1',
 node_external_dsn := 'host=host1 port=5432 dbname=signals password=<xxxxxxx>'
);

Then connect pgadmin to host2:

Create a database “dbtest” and run the following queries against it to initialise replication.

CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE EXTENSION IF NOT EXISTS bdr;

SELECT bdr.bdr_group_join(
 local_node_name := 'host2',
 node_external_dsn := 'host=host2 port=5432 dbname=dbtest password=<xxxxxxx>',
 join_using_dsn := 'host=host1 port=5432 dbname= dbtest password=<xxxxxxx>'
);

========================================

You can now test that replication is working correctly by creating a test table in the dbtest database on host1 (dbtest/schemas/public/Tables/Create) and then checking that it is correctly replicated to host2.

========================================

Notes:
(1) Replication occurs on the standard postgres port 5432, so this network port must be available between the hosts.
(2) If you need multiple replicated databases, you need to repeat this process for each database.
(3) Sequences will run in independent ranges (default is 50000 apart), to avoid collision if records are added on both sides.
(4) The most efficient application deployment strategy is to treat one database as the primary and the other as the secondary – this results in the least replication traffic and the lowest potential for conflicts.

Fault-tolerant ESX datastores for free

Preamble: NFS works great for ESX datastores. Its a whole lot easier to manage than iSCSI and although iSCSI is generally considered to perform better, we find that the flexibility of using NFS more than makes up for the lost performance.

If you have sufficient budget, there are great solutions from Dell, HP etc. where you can get fault-tolerant ESX installations already setup in a rack, which not only provide data-store fault tolerance, but also VM failover and so on. But there’s a lot of people or companies out there who have several servers running the free ESX5i hypervisor and who would still like to have some fault-tolerance. This article is for those people.

What’s not so great about shared storage (like NFS or iSCSI) is that you generally have a lot of extra boxes around and complicated network configuration. If you want fault tolerance, you’ll need two ESX servers, two NFS servers and two switches (and a bunch of cables to connect them all). Furthermore, you’ll add a lot of complexity to configure those (often proprietary) NFS boxes. I recently configured a couple of Lefthand boxes for a customer and it not trivial to set up.

So I figured there must be an easier way – after all, ESX is an amazing platform for reducing the number of boxes in the rack, so why would I want to start adding boxes again if I don’t have to.

The first important point is that ESX provides the VMXNET3 10Gb virtual ethernet adapter, so that even if your ESX server does not have 10Gb network cards, the VMs running on the server can communicate with each other at 10Gb and, more importantly for our purposes, ESX itself can communicate with its VMs at 10Gb speeds. So if we run an NFS server as a VM on the ESX server, and use it as an NFS datastore for the ESX server, then the server will see it as a 10Gbps NFS server.

OK, but what about the fault-tolerance? For that, we need to replicate the NFS server’s disk to another server. So, if we don’t have a real 10Gb network, that’ll have to be across 1Gb. Does that slow things down? Apparently not much – we’re using DRBD asynchronously which causes a minimal performance hit.

So what we do is to clone the NFS server VM to a second ESX server and set up DRBD replication between the two VMs. We’re using Ubuntu 11.10 server (you’ll need a reasonably recent Linux distribution to get the 10Gbps support with the VMXNET3 virtual network adapter).

Because you only get the 10Gbps datastore access to the NFS server when the NFS VM is hosted on the local ESX server, this is not really shared storage (or at least its shared only at 1Gbps to other ESX servers). However, for our fault-tolerance purposes that doesn’t matter much. In fact, from a scalability point of view, it makes sense to provide each ESX server with a locally-running NFS datastore, accessed at 10Gbps and replicated to another ESX server. This scheme also has the advantage that each ESX server is autonomous – ESX servers with remote datastores always make me a bit nervous – any problems on the network and the VMs are likely to freak out. This way, the ESX server is completely self-contained – it only needs another ESX server for fault-tolerance. Even if the network fails, the local NFS datastore will be unaffected (except that fault-tolerance is temporarily suspended) and when the network is available again, the DRBD secondary will simply catch up again automatically, providing fault-tolerance again.

ESX Server 2 can do the same thing with another pair of NFS servers (a local one for fast access and a remote one on ESX Server 1 for fault tolerance). This idea can be scaled indefinitely – each ESX server having its own local NFS VM running its datastore and replicating to another ESX server. The major advantage of this approach is that its more scalable than a single fault-tolerant pair of shared NFS servers and you get 10Gbps access for free. Conversely, the price you pay for this is that you have a separate NFS server for each ESX server which makes administration more complex than for a single shared datastore (but hey, you can’t have everything, at least not for free).

You could additionally configure the nfs servers to fail over a shared ip address to the secondary – we haven’t bothered to do this since if the primary nfs server fails, its likely that the whole ESX server has failed. If that’s the case, we need to promote the DRBD secondary to primary manually, start the NFS server and register the VMs.

And how does it perform? Pretty well actually. The benchmarks below are made on an ESX5 host with a single i7 930 CPU, 24GB RAM, an Adaptec 5405z controller and 4x SATAII disks in RAID5.

Disk performance of a VM running directly on the ESX host (i.e. on the local datastore).

hdparm -tT /dev/sda

/dev/sda:
Timing cached reads:   12396 MB in  2.00 seconds = 6201.81 MB/sec
Timing buffered disk reads: 398 MB in  3.00 seconds = 132.59 MB/sec

dd if=/dev/zero of=ddfile bs=8k count=20000
20000+0 records in
20000+0 records out
163840000 bytes (164 MB) copied, 0.274105 s, 598 MB/s

And now the disk performance of a VM running on our fault-tolerant nfs datastore:

hdparm -tT /dev/sda

/dev/sda:
Timing cached reads:   12242 MB in  2.00 seconds = 6124.43 MB/sec
Timing buffered disk reads: 258 MB in  3.00 seconds =  85.89 MB/sec

dd if=/dev/zero of=ddfile bs=8k count=20000
20000+0 records in
20000+0 records out
163840000 bytes (164 MB) copied, 0.707547 s, 232 MB/s

As you can see from the above, the VM running on the fault-tolerant NFS datastore is not as fast as the VM running on the local datastore, but it’s sufficiently fast for the subset of your VMs which require more fault-tolerance than that provided by daily backups.

Speaking of backup, we’re using ghettoVCB to backup 150 VMs (>1TB) every night via NFS to a Netgear ReadyNAS Ultra 6 device. We then rsync them to a remote data-center for offsite, versioned storage.

Weird racing clock problem on VMWare Linux Guests

A while back we installed a new Dell server with a low-power Xeon 3GHz L3110 CPU to run some other essential network infrastructure. We chose the specific server configuration because it dissipates less than 30W while running 5 VMware VMs. It runs for hours on a UPS and doesn’t require cooling, so even if our server room air-conditioning were to die, this server should keep our network, firewalls, VPNs, DNS, DHCP and primary Terracotta server and a few other vital services up long enough for us to figure out what’s going on.

The server is running CentOS 5.5 and VMWare 1.0.10 – normally a very stable combination. However, we found that Linux guests running on this server were not keeping time properly. their clocks were running too fast – 50% too fast in fact. We finally figured out that the problem was caused by the fact that the CPU power-saving causes Linux to think that has a 2GHz processor instead of a 3GHz processor and this causes the 50% clock speedup in the Linux guests under VMWare. We disabled the demand-based power saving feature of the CPU in the BIOS and now it works correctly.