GreenArrow Cluster

Table of Contents
Overview
Cluster architecture using HAProxy
Cluster architecture using GreenArrow Proxy

Overview

This page provides a overview of how GreenArrow instances can be clustered together to achieve High Availability and High Scalability, using either ther HAProxy or GreenArrow Proxy.

For more information on using GreenArrow with GreenArrow Proxy, see the following pages:

Cluster architecture using HAProxy

Many large ESPs and senders run a fleet of GreenArrow Engine servers that cooperate together to achieve High Availability and High Scalability in a Mail Transfer Agent. The architecture described below does not use GreenArrow’s clustering features. (The next section will describe GreenArrow’s clustering features.)

The architecture is typically as follows:

Injector Application This is the application that creates emails that need to be delivered. The application will connect to the Load Balancer to hand-off the emails.
Load Balancer This is a standard load balancer that distributes the email evenly to each MTA in the cluster. This can be an OSI Layer 7 load balancer that operates on the HTTP or HTTPS level, or an OSI layer 4 load balancer that operates on TCP sessions (for either HTTP, HTTPS, SMTP, or SMTPS). Both options work perfectly fine.
SMTP or HTTP Injection API Messages are handed from the Injector Application, through the Load Balancer, into the MTA Tier using either SMTP for injection (typically authorized by username/password instead of IP address) or the GreenArrow HTTP injection API.
MTA Tier Each server is configured with identical configuration of all of the IP addresses, DKIM keys, and domains. Therefore each MTA is able to handle any outgoing message for any sending IP and any tenant. The configuration files are typically stored in a git repository and published out to all MTAs using software like Puppet, so that a push to git triggers a continuous integration process to publish the config to each MTA. The MTAs do not directly originate connections to MX Servers. Instead, they connect to the Proxy Tier using SMTP Over HAProxy and request that HAProxy originate a connection from the Sending IP (which is bound to the HAProxy server) to the appropriate MX Server. Each MTA makes its own independent throttling decisions. Therefore, for N servers, each MTA server must be configured with 1/N of the desired cluster-wide sending limits for each Sending IP (maximum concurrent connections and maximum delivery attempts per hour). For example, if there are 10 servers in the MTA Tier, then to achieve a cluster-wide maximum concurrent connection limit of 20, each server must be configured with a maximum concurrent connection limit of 2. This has the downside that it is not possible to configure a maximum concurrent connection limit less than the number of MTA servers. This also has the downside that the limits are not shared between MTAs. For example, with a maximum concurrent connection limit of 10, there may only be five outstanding connections from the cluster, but a message on one MTA may be made to wait, simply because there is already a concurrent connection from that particular MTA, since each MTA is only allowed one concurrent connection. These downsides are removed with GreenArrow cluster-aware throttling. These servers are sized for high network bandwidth, high compute, and high disk IO. A common AWS instance type is `c6in.32xlarge`.
Proxy Tier / HAProxy This is a set of servers running the HAProxy software. Each HAProxy server has a subset of the public Sending IPs bound to it. The purpose of this tier is to allow every MTA to make connections from any Sending IP. If the sending IPs were bound to the MTAs, then each MTA would only be able to establish connections from the sending IPs bound to it. Each HAProxy server also has an IP address on the Private Network, which is used by the MTAs to contact it. These nodes are stateless. These servers are sized for high network bandwidth, but do not require much compute or disk IO. A common AWS instance type is `c6in.8xlarge`.
Private Network The communication between the Injection Application, Load Balancer, MTA, and HAProxy servers are typically done on a private network address space, and firewalled off from the public Internet – although this is not required.
HTTPS web-hooks The MTA servers communicate information on delivery attempts and synchronous bounces (sometimes called “accounting information”) to your Data Warehouse or Data Lake typically using web-hooks over HTTPS. GreenArrow is very flexible in batching event data for high throughput. This can also be done by having GreenArrow write accounting files (such as CSV or SqLite format) and shipping them off the MTA servers, or having the MTA establish a database connection directly to your Data Warehouse and insert rows into a table or run a stored procedure.

Here is a description of a cluster using this architecture run by Klaviyo.

Cluster architecture using GreenArrow Proxy

A cluster using GreenArrow’s clustering features is similar to the above, but there are some key differences (which are noted in bold in the diagram below using bold)

The Proxy Tier is running GreenArrow Proxy instead of HAProxy.
Throttle decisions for each Sending IP are made on the GreenArrow Proxy instance that hosts that Sending IP, instead of independently on each GreenArrow MTA server. This allows for coordinated cluster-wide throttling limits, without introducing a new single-point-of-failure into the system.
The protocol between the MTA Tier and the Proxy Tier is now the Proprietary GreenArrow Proxy Protocol. This protocol has certain advantages described below.

Here are descriptions of parts of the system that are different:

MTA Tier

Each server is configured with identical configuration of all of the IP addresses, DKIM keys, and domains. Therefore each MTA is able to handle any outgoing message for any sending IP and any tenant. The configuration files are typically stored in a git repository and published out to all MTAs using software like Puppet, so that a push to git triggers a continuous integration process to publish the config.

The MTAs do not directly originate connections to MX Servers. Instead, they connect to the Proxy Tier using the Proprietary GreenArrow Proxy Protocol to request that GreenArrow Proxy originate a connection from the Sending IP (which is bound to the Proxy Server) to the appropriate MX Server.

Each GreenArrow MTA leans on GreenArrow Proxy to enforce coordinated cluster-wide throttle limits (maximum messages per hour and maximum concurrent connections). This means that you can configure each MTA with the actual cluster-wide throttle limits that you want.

(Configuration from the different MTAs is merged using one of two methods. Average: The cluster-wide limit is the average each MTA’s configured limit for the particular throttle. Sum: The cluster-wide limit is the total of each MTA’s configured limit for the particular throttle. The Sum method is designed to make it easy to transition from using the HAProxy format cluster.)

Proprietary GreenArrow Proxy Protocol

The GreenArrow protocol has the following advantages over the HAProxy protocol:

The two errors of (a) being unable to connect to the MX Server and (b) connecting to the MX Server but getting disconnected before reading a greeting, can be differentiated between using GreenArrow Proxy. (With the HAProxy Protocol, these errors can not be differentiated; although in practice this is not a problem.)
Authentication between GreenArrow MTA and GreenArrow Proxy is performed using a shared secret and HMAC (instead of just by source IP), allowing more flexibility in network architecture and adding more security. This authentication is bi-directional, with each end proving it has the correct shared secret without revealing the shared secret. (For additional security, source IPs may still also be restricted in the GreenArrow Proxy configuration.)
The GreenArrow Proxy tier performs the service of holding open connections for potential reuse. Connections that are established by one GreenArrow MTA may be reused for subsequent deliveries performed by any other GreenArrow MTA. This increases performance, as STARTTLS connection establishment can be computationally expensive, so holding a connection open longer is helpful.
The TCP connections between GreenArrow MTA and GreenArrow Proxy are reused between deliveries of different messages, allowing for TCP receive windows to increase.
All traffic between the GreenArrow MTA and GreenArrow Proxy is TLS encrypted, regardless of whether the connection from GreenArrow Proxy to the MX Server is using STARTTLS.

The GreenArrow protocol also handles communication to implement cluster-wide throttling limits.

This is done with each GreenArrow MTA opening a persistent connection to each GreenArrow Proxy for communication about throttle limits.
This connection is also TLS encrypted and authenticated with the same shared secret and an HMAC.
If this TCP connection is broken, or if the GreenArrow Proxy server is restarted, then the GreenArrow MTA server will reconnect and transparently re-establish all of the required state – so that the system continues operating as if there had been no interruption (other than delivery attempts not happening during the interruption.)

Proxy Tier / GreenArrow Proxy

This is a set of servers running the GreenArrow Proxy software.

Each GreenArrow Proxy server has a subset of the public Sending IPs bound to it. The purpose of this tier is to allow every MTA to make connections from any Sending IP. If the sending IPs were bound to the MTAs, then each MTA would only be able to establish connections from the sending IPs bound to it.

Each GreenArrow server also has an IP address on the Private Network, which is used by the MTAs to contact it.

These nodes are stateless.

GreenArrow Proxy performs the cluster-wide throttle decision making for all deliveries from all IP Addresses that are accessed through that GreenArrow Proxy instance. This provides cluster-wide throttle decision making without introduction any new single-point-of-failure.

These servers are sized for high network bandwidth, but do not require much compute or disk IO. A common AWS instance type is c6in.8xlarge.