Introduction to MN (Part 5 – Flow State)

In this article we discuss how  MidoNet manages flow state in order to implement distributed/fault-tolerant versions of advanced network services like connection tracking, load-balancing, port-masquerading. This post is the fifth in a series intended to familiarize users with MidoNet’s overlay networking models:

  • Part 1 covered MN’s Provider Router.
  • Part 2 covered Tenant Routers and Networks.
  • Part 3 covered how MN Agent simulates packet traversal of the virtual topology in order to compute a flow rule.
  • Part 4 covered Security Groups.

Introduction

Many advanced network services make per-flow decisions that need to be preserved for the duration of the flow. For example, IPv4 port-masquerading (NAPT) chooses a free public IP:port pair to replace the private IP:port pair in TCP/UDP packet sent from a client in the cloud to a server on the internet.

Traditionally, in physical networking, this per-flow state could live in a single location, e.g. a network device. Both forward and return flows traversed the same device allowing reverse translation in the case of port-masquerading, or connection-status-based filtering in the case of a stateful firewall.

The single location was a SPOF (single point of failure), so network devices and appliances have long had solutions for this – e.g. sharing session state between a fault-tolerant pair or among a cluster (e.g. of Load-balancers).

MidoNet requires sharing per-flow state among (MN Agents at) multiple nodes because:

  • it allows us to avoid forwarding an overlay flow to an intermediate hypervisor or network device to reach a stateful network service (whether SPOF or a fault-tolerant cluster) like load-balancing and firewalling.
  • when flows traverse L2 and L3 Gateways, sharing per-flow state supports asymmetric return paths and gateway node fault-tolerance.

The Old Approach

Recap: The (MN Agent at the) ingress node that receives a flow – whether because it was generated by a local VM or because it arrived on a non-tunnel L2 or L3 Gateway NIC that connects physical and virtual workloads – computes/simulates how the flow traverses the overlay network, whether it’s dropped or the virtual device and port where it egresses the overlay. It then maps the egress virtual port onto a physical node and sets up fast-path matching, modifying, encapsulating, and tunneling the flow to the computed egress node. MidoNet has progressively added more distributed network services to the overlay in a way that allows more flow computation to happen at the ingress host, eliminating intermediate hops and management of middle boxes/appliances.

Previously, a distributed database (Cassandra) was used to store any flow state created or required during computation of the flow’s traversal of the overlay network. This state could answer these questions:

  • is this a forward or return flow?
  • in what state is this TCP connection?
  • where should load-balancer LB1 send this flow?
  • what Public source IPv4 address and TCP port would this port-masquerading instance select? what private source IPv4 and TCP port is being masked by this public IPv4 address and TCP port?
  • FYI – we could not answer “which route did the ECMP algorithm choose?” We didn’t and still don’t consider this a use-case because once a single route is chosen it continues to be used during the lifetime of the flow installed by the Simulation that ran the ECMP algorithm.

Diagrams 1, 2 and 3 below illustrate how flow-state was managed in previous MidoNet versions. In all the diagrams, flow-computation is not called out. It’s implied that flow-state is leveraged by the flow computation and encap/tunnel selection.

For a new flow (diagram 1), the ingress node, Node 1, made 2 round trips to the Flow-state Database (Cassandra). One to look for existing state (i.e. detect whether it was a new flow) and another to store the newly created flow-state. Many of the flow-state lookups had to be performed synchronously by necessity, thus introducing higher first-packet latency, flow computation pausing/re-starting and thread-management complexity in the MidoNet agent.

Diagram 2 shows how the ingress node for the return flow (either Node 2 from Diagram 1 or some Node 3 if the return path is asymmetric) finds the flow state in the Flow-state DB, computes the return flow and tunnels it to Node 1 (or perhaps some Node 4 if the return path terminates with some ECMP routes).

Diagram 3 shows that if the forward flow ingresses some Node 4 (because of upstream ECMP routing) or an amnesiac Node 1, the ingress node is able to find the flow-state in the DB and correctly compute the flow – in this case tunneling it to Node 2 as in Diagram 1.

Diagram 1:

MNFlowStateOld1

Diagram 2:

MNFlowStateOld2

Diagram 3:

MNFlowStateOld3
The New Approach

Summary: the ingress node pushes the flow-state directly to the Interested Set of nodes.

We reasoned that although each flow’s state was required by 2 (and often more) nodes, for performance reasons the state should be local to those nodes. Then we realized that a new flow’s ingress node can easily determine (with a little hinting, explained below) the set of nodes – the Interested Set – that might possibly need that flow’s state and push it to them. Remember that each of the forward and return flows have potentially several ingress nodes and several egress nodes.

Two improvements were therefore feasible:

  • reduce latency because flow computation only requires locally available state
  • remove a dependency on a database

The new approach has the same race conditions as the old approach. In the old approach MidoNet’s implementation had to choose: how many replicas of the state should be written (usually 3), how many replicas must be confirmed before tunneling the flow to the egress node (determines the likelihood that the return flow’s computation will find the flow-state), how many replicas to read to find the flow-state when it’s needed for the return flow or to re-compute the forward flow. In the new approach the ingress node must push the state directly to all the nodes in the Interested Set (possibly more than 3 but just 1 in the vast majority of cases), the choices are slightly different, but lead to similar race condiions: fire-and-forget or confirm? if retry, how often? Delay tunneling the forward flow until the flow-state has propagated or accept the risk of the return flow being computed without the flow-state?

Diagram 4:

MNFlowStateNew1

Diagram 5:

MNFlowStateNew2

Diagram 6:

MNFlowStateNew3
Diagrams 4, 5 and 6 above illustrate how flow-state is managed in newer MidoNet versions. In all the diagrams, flow-computation is not called out. It’s implied that flow-state is leveraged by the flow computation and encap/tunnel selection.

In diagram 4, a new flow ingresses Node 1. Node 1 queries its local state and determines this is a new flow. It performs the flow computation during which flow state is created and stored locally. Finally, “before” forwarding the flow to the egress, Node 2, Node 1 determines the Interested Set {Node 1, Node 2, Node 3, and Node 4} and pushes the flow state. In this case the Interested Set includes Nodes 1 and 4 because they’re potential ingresses for the forward flow and Nodes 2 and 3 because they’re potential ingresses for the return flow.

Algorithm/protocol for pushing flow-state between Agents

Tunnel packet (currently using GRE/UDP) with special-value tunnel key indicating “flow state”. The state packet is sent unreliably, fire-and-forget, from ingress to egress node before the related data packets. However, the data packets are not delayed because they do not wait for the egress to reply (even if the egress is configured to reply/confirm).

The upcoming release of MidoNet will have the option of setting DSCP values in the outer header of the state packets. Operators can map this DSCP value to a higher-priority traffic class in the underlay switches. This is important since the MN Agent cannot always detect when the state packet needs to be retransmitted. Failure to receive a state packet can result in a failed connection, for example, because the MN Agent at the host where the return flow ingresses does not have the forward flow state.

Interested Set Hinting

Earlier we mentioned that the ingress host can easily determine the Interested Set with a little hinting. Specifically, the Northbound API integration programmer or the cloud Admin is aware of “Ingress Sets” i.e. sets of ports that are equivalent ingress points for a single flow. For example:

  • Depending on upstream routes, any of the static or dynamic uplinks of the Provider Router may receive a North-to-South flow. The Provider Router’s uplinks should therefore all be added to a single “Ingress Set”.
  • A tenant may have a redundant L3 VPN from their office network to their Tenant Router. Depending on the implementation, this tenant’s VPN traffic may ingress MidoNet at more than one node (on-ramp). The situation is very similar to Provider Router.
  • A VLAN L2 Gateway allows 2 physical links into an 802.1Q virtual bridge (which in turn has untagged ports virtually linked to ports on VLAN-agnostic bridges – one for each Neutron network that needs access to a physical VLAN). Depending on STP, traffic from the physical workloads can ingress MidoNet at either of the 802.1Q bridge’s “uplink” ports.
  • The VXLAN L2 Gateway allows any MidoNet node to tunnel traffic directly to a physical VTEP; the traffic is forwarded to the set of port/vlan pairs associated with the VNI. The physical vtep forwards on-virtual-switch traffic directly to MidoNet host that is local to (the VM interface that owns) the destination MAC. For on-virtual-switch traffic, MN need only consider forward ingress sets for traffic headed to the VTEP, and return ingress sets for traffic coming from the VTEP.
    • The physical vtep is instructed to forward off-virtual-switch traffic (i.e. from a physical server, through a virtual router in MidoNet) to specific MidoNet hosts that act as VXLAN proxy nodes. For this traffic, MN need also consider the proxies (which may be per-virtual-bridge) to be an ingress set for the flow.

MidoNet provides a new optional API and CLI commands to allow MidoNet users to group ports into Ingress Sets.

 

Leave a Reply

Your email address will not be published. Required fields are marked *