- Part 1: Let's overlay -- Basic info about VXLAN, addressing and headers.
- Part 2: It's all about knowledge -- Packet forwarding overview, VTEP control plane learning options
- Part 3: Hands On #1 -- Configuration on Cisco Nexus Devices, Flood and Learn.
- Part 4: Hands On #2 -- Configuration on Cisco Nexus Devices, EVPN.
- Part 5: NSX Overview
So if you're interested in any other topic that you think is not going to be covered kindly ping me and will add.
After all that prelude I think we can start. As we see in Part I, we cover the header added into the original frame in order to be forwarded into an L3 network and also we end the post by giving an overview of packet forwarding. In order to reference later here is a pic of a VXLAN packet:
Figure 1: VXLAN Packet header
In later post we just reach a point in where host (hypervisor or device with VTEP, I will use any of these indistinctly) get a packet (VXLAN) which is not local and need to be delivered. Let's think like any L2 forwarding plane, we need to know where to route/send out this packet, this process is made by a lookup made by the host based on DST MAC Address of Original L2 frame (see picture above) and based on that we should get a destination port (hehe no L2 switching ) destination VTEP Address. This post would cover different methods of learning and populating this internal table, and as usual for forwarding it's all about knowledge.
VxLAN Flood and Learn
This scenario was the first introduced, it relies in head end replication, meaning that end host in case of not having any entry for the destination MAC address will send out an ARP to other devices / VTEPs in the VXLAN network. This is done by sending the request to the VXLAN multicast group for this Bridge domain, remote VTEPs will get the packet and answer accordingly direct to the originating VTEP (Here we can be aware of two requirements for running this: multicast core, IGP or unicast reachability between VTEP Addresses)
Figure 2: VXLAN Peer Discoveries and Tenant Address Learning
I will base the explanation using this amazing pic that I just stole from cisco web page :)
- End System A (ES-A) sends out an ARP request for IP-B on its Layer 2 VXLAN network (note the Dst MAC Address).
- VTEP-1 receives the ARP request. Since he doesn't have a mapping for IP-B yet, it encapsulates the ARP request in an IP multicast packet and forwards it to the VXLAN multicast group for that specific segment (VNI). The encapsulated multicast packet has the IP address of VTEP-1 as the source IP address and the VXLAN multicast group address as the destination IP address.
- The IP multicast packet is distributed to all members in the tree, VTEP-2 and VTEP-3 receive the encapsulated multicast packet because they’ve joined that specific VXLAN multicast group, after that they decapsulate the packet and forward it locally to the local VXLAN network. In this process, if no prior communication was made between VTEP-1 to them, they insert into his local tablet the mapping between Mac Address of ES-A with IP of VTEP-1.
- After the local transport of ARP, End System B (ES-B) gets the request forwarded by VTEP-2 and responds with its own MAC address (MAC‑B), and learns the IP-A-to-MAC-A mapping.
- VTEP-2 receives the ARP reply of ES-B that has MAC-A as the destination MAC address, as per step 3 he knows about MAC-A-to- VTEP-1 mapping and therefore it can use the unicast tunnel to forward the ARP reply back to VTEP-1. The ARP reply is encapsulated in the UDP payload of a packet sourced from VTEP-2 and destined to VTEP-1.
- VTEP-1 receives the encapsulated ARP reply from VTEP-2. It decapsulates and forwards the ARP reply back to ES-A, also it learns the IP address of VTEP-2 from the outer IP address header and inspects the original packet to learn MAC-B-to-VTEP-2 IP mapping.
- Subsequent IP packets between ES-A and B are unicast forwarded, based on the mapping information on VTEP-1 and VTEP-2, using the VXLAN tunnel between them.
- VTEP-1 can optionally perform proxy ARPs for subsequent ARP requests for IP-B to reduce the flooding over the transport network.
Head-end Replication
When you are working with VXLAN and reading literature also is common to hear or read the concept of head-end replication, what this essentially means is that the local VTEP has the overhead of replicate the broadcast traffic out to the other VTEPs, in the original release of VXLAN which uses multicast as underlying layer to reach VTEPs this only means encapsulate packet and sent out to multicast group, but also there is the possibility of have unicast peering (full-mesh) with all the VTEPs and in this scenario the head-end replication has a notorious impact.
Figure 3: Head-end replication example in unicast VTEP reachability
VxLAN MAC Distribution
Another well know method is VXLAN MAC Distribution, head-end replication is still used to deliver broadcast and multicast frames
to remote VTEPs, but.. what about unknown unicast? You shouldn't have any (wish, read further). In this scenario MAC learning is not based on data plane activity and instead of that we have a central control
unity (Nexus 1000V VSM, NSX controller, etc) which is used to keep track of all MAC addresses
in the domain and send this information to the VTEPs on the system. Why do I say that this is a wish? Basically things are there to be broken, just like anY mapping table (CAM i.e.) entries have an aging associated to it, so if in first scenario VTEP-2 announces MAC-B entry through it and VTEP-1 gets populated with that all traffic will flow accordingly and VTEP-1, if doesn't have an entry for MAC-B, will query controller to get this info. Here two branches appears, a) controller has an entry and reply back to VTEP-1, entry gets installed and unicast traffic flow; b) controller doesn't have an entry for MAC-B and reply with an invalid entry so VTEP-1 must use head-end replication to reach learn where to send his packet (*this may vary depending on VTEPs OS/SW implementation).
Also there is another case in which VTEP-1 has a valid entry but it lost connectivity to controller and that entry gets old (and removed from table), in this case controller can't be queried and head-end replication will be used again.
VxLAN BGP EVPN Control plane
Quick disclaimer: Before starting with this I will say that you will find a lot of literature for this approach, also a lot of information regarding configuration to make this possible. This is the desired scenario for any real / production environment, Flood and Learn was showed just to understand what we got in the beginning and how we came up with a real control plane solution (and in a standard fashion way!).
EVPN overlay specifies adaptations to the BGP MPLS-based EVPN solution to enable it to be applied
as a network virtualization overlay with VXLAN encapsulation, essentially this bring us great benefits (I will add more later):
- Standardized solution: BGP plus VxLAN
- Real Control Plane learning
For this approach what we made is (for MPLS EVPN knowers):
- VTEP/network virtualization edge (NVE) is the equivalent to PE node
- VTEPs use control plane learning/distribution via BGP for remote MAC addresses instead of data plane learning.
- Broadcast, unknown unicast and multicast (BUM) data traffic is sent using a shared multicast tree.
- In order to reduce the need of full mesh between VTEPs we can rely on BGP route reflector (RR)
- Enhanced security by using well known Route filtering and constrained route distribution (control plane traffic for a given overlay is only distributed to the VTEPs that are in that overlay instance).
- Host (MAC) mobility mechanism to ensure that all the VTEPs in the overlay instance know the specific VTEP associated with the MAC
MP BGP could be used for L2 VXLAN and also for L3 VXLAN (instead of Mac addresses learning think of IP association to VTEPs, do you remember LISP?). It's not my goal to enumerate all the benefits of running BGP EVPN control plane for VXLAN, apart of greater scalability, well known and proven protocols, etc. instead of that I will focus in the life of a packet in this new scenario and hopefully in next post we can cover all the variations for this (anycast GW, asymmetric. symmetric IRB, etc)
Packet forwarding in L2 VxLAN Segment
In this scenario we are covering L2 VxLAN communication, Host-A and Host-B belong to same VNI: 30000.
- Host-A sends traffic to his local VTEP V1 (post ARP resolution), DST MAC B.
- V1 will lookup in his table for an entry for MAC B.
- V1 has an entry for MAC B thru VTEP V2, it encapsulate the packets and unicast send to V2.
- V2 gets the packet, decapsulate and locally deliver to Host-B
End of happy tale, right? What about L3 traffic between VxLAN (see that we didn't cover this in flood and learn, since in that approach traffic should reach a device with the two VxLAN segments involved and logically route)
Packet forwarding between different L2 VxLAN VNI
In this scenario Host-A (VNI 30000) sends packet to Host-F (VNI 30001), core network is using VNI 50000, based on that the process is similar to:
- Host-A sends traffic to DG (post ARP) which is configured on the locally attached VTEP V1.
- V1 make a FIB lookup based on DST IP
- V1 routes the packet to VTEP V2, but VXLAN packet is using core VNI 50000.
- V2 gets the packet, it decapsulates, made a FIB lookup determining that DST VNI is 30001, rewrites the packet and deliver locally.
Ok, so now what about the ugly tale? As you can see I made two examples saying "this happens post ARP resolution", but how do we process ARP?
There is an ARP suppression mechanism, essentially, the IP-MACs learnt locally via ARP as well as those learnt over BGP-EVPN are stored in a local ARP suppression cache. ARP request sent from the end host is trapped at the source ToR and a lookup is performed in the ARP suppression cache with the destination IP as the key. If there is a HIT, then the ToR proxies on behalf of the destination with the destination MAC.
In case the lookup results in a MISS, when the destination is unknown or a silent end host, the ToR re-injects the ARP request received from the requesting end host and broadcasts it within the layer-2 VNI. This entails sending the ARP request out locally over the server facing ports as well as sending a VXLAN encapsulated packet with the layer-2 VNI over the IP core. This follows same process that we saw but only difference is that at reply the ToR will store the MAC binding in his ARP supression cache for further usage
BGP , CCIE , CCIE-DC , DC , DCv2 , EVPN , Flood-and-Learn , MP-BGP , NSX , VSM , VXLAN