Mainly Networking, SDN, Automation, Datacenter and OpenStack as an overlay for my life

Thursday, May 18, 2017

Stretched DC, really?... ok, for L3, BGP conditional forwarding

A long time ago (I think it was years back) I was reviewing a DR solution for some internal customer who has two datacenter and a DCI between them (dark fiber). They moved initially to a stretched design extending vlans from each site and using L3 gateway on one side only at a time, since as a business requirement traffic should always leave from primary DC. However they were expecting some kind of solution to be able to automatically switchover to secondary DC in case of a failure on DC1.

For this cases it's always a pleasure to read Ivan and see how he predicts the design issues that I will face in the future (Stretched DCI), hopefully no stateful firewalls were involved here.

The main issue was not only to detect which side is alive (which is not easy without a witness, and we don't have one at all) but also how to decide which traffic should be served and from where.

So here is a big stop. After keep going with this we need to take some assumptions and business decisions:

  • If DC1 site fails but DCI and DC2 site alive, traffic will enter from DC2 side and traverse the DCI.
  • If DCI fails, traffic will continue being served from DC1 for stretched VLANs subnets, this implies move by other method those servers to the surviving side or at least shut them down.
  • If DC2 site fails but DCI and DC1 site alive, traffic will enter DC1 side and traverse DCI to reach DC2 side servers.
  • Traffic should leave and enter from DC1 whenever possible and DC2 site should not be used unless strictly necessary (this was imposed by customer)

So after reviewing lot of options, and assuming that eventually we can fail and working around that (and the fact that we need to do a stretched cluster after all) we came across a nice BGP feature which is called conditional forwarding. 
Just for your reference, BGP Conditional forwarding allows us to advertise a given network based on the information that we have in our FIB. This can be really useful for this scenario by defining an witness network from each side and advertise to each other, this should be a dummy network like 1.1.1.0/30 for DC1 and 1.1.2.0/30 for DC2 and the match statement will verify if we are getting this network advertisement and based on that will withdraw our advertisement or just let it flow.

Ok, so enough of reading and lets have a quick view on configuration (On NXOS) and behaviour:



Here is the config for the eBGP side of the DC2

router bgp 65533
bgp log-neighbor-changes
neighbor 19.21.54.245 remote-as 666
neighbor 19.21.54.245 update-source Vlan30
!
neighbor 19.21.54.245 activate
neighbor 19.21.54.245 advertise-map ADV-MAP non-exist-map NON-EXIST neighbor 19.21.54.245 next-hop-self
neighbor 19.21.54.245 soft-reconfiguration inbound
neighbor 19.21.54.245 route-map SOME_RANGE_ONLY_AT_DC2 out
route-map PUBLICAS-L3 permit 10
match ip address prefix-list PUBLICAS-L3
!
route-map NON-EXIST-HOR permit 10
match ip address prefix-list NON-EXIST-HOR !
!
route-map ADV-MAP-L3 permit 10
match ip address prefix-list ADV-MAP-L3
!
!# ADV-MAP: This are the routes that will be advertised in case that the non-exist route map succeeds.
ip prefix-list ADV-MAP seq 5 permit 201.212.14.128/26
ip prefix-list ADV-MAP seq 10 permit 201.212.14.0/24
ip prefix-list ADV-MAP seq 15 permit 201.212.15.0/24 !
!# NON-EXIST: This will trigger the withdrawal based on the existence of this networks
ip prefix-list NON-EXIST seq 5 permit 0.0.0.0/32
ip prefix-list NON-EXIST seq 10 permit 1.1.1.0/30
!
ip prefix-list SOME_RANGE_ONLY_AT_DC2 seq 5 permit 200.1.33.80/28
!

Based on that normal behaviour would behave like this (routes will be withdrawn):
dc1-side# sh ip bgp summary
BGP summary information for VRF default, address family IPv4 Unicast BGP router identifier 10.120.32.240, local AS number 65533
BGP table version is 276, IPv4 Unicast config peers 3, capable peers 3 49 network entries and 84 paths using 7756 bytes of memory
BGP attribute entries [5/640], BGP AS path entries [1/10]
BGP community entries [0/0], BGP clusterlist entries [0/0]
42 received paths for inbound soft reconfiguration
41 identical, 1 modified, 0 filtered received paths using 8 bytes
Neighbor V AS MsgRcvd MsgSent
10.x.143.5 4 xx 296153 296254
10.12.32.2 4 65533 296111 296115
10.33.32.21 4 65533 10212 10280 276 0 0 1w0d 5
N7K-1-BORDER_VDC# sh ip bgp neighbors 10.120.32.248 advertised-routes
Peer 10.120.32.248 routes for address family IPv4 Unicast:
BGP table version is 276, local router ID is 10.120.32.240
Status: s-suppressed, x-deleted, S-stale, d-dampened, h-history, *-valid, >-best Path type: i-internal, e-external, c-confed, l-local, a-aggregate, r-redist
Origin codes: i - IGP, e - EGP, ? - incomplete, | - multipath
Network *>e0.0.0.0/0 *>l1.1.1.0/30
Next Hop 10.110.143.5 0.0.0.0
Metric
LocPrf Weight Path
150 0 xx 3549 i
100 32768 i // Trigger route injected
# !!!!!!!!!!!!!!
dc2-side#sh ip bgp summary
Metric
LocPrf Weight Path
Next Hop 10.110.143.1
150
100
0 666 354
32768 i // Trigger route injected
0.0.0.0
BGP router identifier 10.120.32.74, local AS number 65533 BGP table version is 129, main routing table version 129
8 network entries using 936 bytes of memory
11 path entries using 572 bytes of memory
5/3 BGP path/bestpath attribute entries using 800 bytes of memory 2 BGP AS-PATH entries using 48 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 2356 total bytes of memory
BGP activity 44/36 prefixes, 85/74 paths, scan interval 60 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
10.x.32.2 4 65533 xy 15496 129 0 01w3d 2
10.12.32.1 4 65533 xu 15494 129 0 01w3d 2
19.21.54.245 4 3549 xn 11109 129 0 0 1w0d 1
dc2-side#sh ip bgp neighbors 10.y.6.12 received-routes
BGP table version is 129, local router ID is 10.120.32.74
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network *>i0.0.0.0 *>i1.1.1.0/30
Next Hop 10.120.32.240
10.1.2.240
Metric LocPrf Weight Path 150 0 xxx 3549 i
100 0 I // Trigger route received
Total number of prefixes 2
dc2-side#sh ip bgp neighbors 10.120.6.12 received-routes
BGP table version is 129, local router ID is 10.120.32.74
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal,
r RIB-failure, S Stale
Origin codes: i - IGP, e - EGP, ? - incomplete
Network
* i0.0.0.0
* i1.1.1.0/30
Next Hop 10.120.32.241
10.120.32.241
Metric LocPrf Weight Path 150 0 xxx 3549 i
100 0 i //Trigger route received
Total number of prefixes 2
###
### Important verification here, check at the end
###
dc2-side#sh ip bgp neighbors 190.216.54.245
BGP neighbor is 190.216.54.245, remote AS 3549, external link
BGP version 4, remote router ID 67.17.82.239
BGP state = Established, up for 1w0d
Last read 00:00:38, last write 00:00:56, hold time is 180, keepalive interval is 60 seconds Neighbor capabilities:
Route refresh: advertised and received(new) Four-octets ASN Capability: advertised and received Address family IPv4 Unicast: advertised and received
Message statistics: InQ depth is 0 OutQ depth is 0
Sent
Rcvd 2
0 74085
Opens: Notifications: Updates: Keepalives:
Route Refresh: Total: 11109
2 0
13 11093
12291 0
1
Default minimum time between advertisement runs is 30 seconds
86378
For address family: IPv4 Unicast
BGP table version 129, neighbor version 129/0
Output queue size : 0
Index 2, Offset 0, Mask 0x4
2 update-group member
Inbound soft reconfiguration allowed
NEXT_HOP is always this router
Outbound path policy configured
Route map for outgoing advertisements is SOME_PUBLIC_SUBNETS_AT_DC2_SIDE
###
Condition-map NON-EXIST, Advertise-map ADV-MAP, status: Withdraw


Now if we have a failure on DC1 side, conditional trigger will take place and start advertising from DC2.



dc2-site#sh ip bgp 1.1.1.0/30
% Network not in table //Trigger route NOT received
VSS-GC-L3#sh ip bgp summary
BGP router identifier 10.120.32.74, local AS number 65533 BGP table version is 134, main routing table version 134
7 network entries using 819 bytes of memory
9 path entries using 468 bytes of memory
4/2 BGP path/bestpath attribute entries using 640 bytes of memory 2 BGP AS-PATH entries using 48 bytes of memory
0 BGP route-map cache entries using 0 bytes of memory
0 BGP filter-list cache entries using 0 bytes of memory
BGP using 1975 total bytes of memory
BGP activity 44/37 prefixes, 85/76 paths, scan interval 60 secs
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
19.21.54.245 4 xys 86397 11126 131 0 0 1w0d 1 //Only L3 peer is alive
dc2-site#sh ip bgp neighbors 19.21.54.245
BGP neighbor is 19.21.54.245, remote AS 3549, external link
BGP version 4, remote router ID 67.17.82.239
BGP state = Established, up for 1w0d
Last read 00:00:24, last write 00:00:40, hold time is 180, keepalive interval is 60 seconds Neighbor capabilities:
Route refresh: advertised and received(new) Four-octets ASN Capability: advertised and received Address family IPv4 Unicast: advertised and received
Message statistics: InQ depth is 0 OutQ depth is 0
Sent
Rcvd 2
Opens: Notifications: Updates: Keepalives:
Route Refresh: Total: 11126
2 0
0 74085
13
11110 12310
1 0 86397
Default minimum time between advertisement runs is 30 seconds
For address family: IPv4 Unicast
BGP table version 134, neighbor version 134/0
Output queue size : 0
Index 2, Offset 0, Mask 0x4
2 update-group member
Inbound soft reconfiguration allowed
NEXT_HOP is always this router
Outbound path policy configured
Route map for outgoing advertisements is SOME_PUBLIC_SUBNETS_ON_DC2
Condition-map NON-EXIST, Advertise-map ADV-MAP, status: Advertise //Routes in advertise-map
ADV-MAP are being advertised.


Is this all what we need? Definitely No... There are still lot of things to resolve and we don't have an optimal design (we can discuss here, if we are meeting business requirements is there anything else to do?), but apart from that notice that stretching a VLAN is not a good choice, guess why? you're extending your fault domain and that doesn't simplify things it also make more complex the isolation and detection. so let's start wondering why we made such poor decisions and why we can't start talking about application level aware resiliency, making our life better by allowing us to use different subnets/networks at each site being able to handle traffic in/out more flexible by leveraging existing methods (long talk about BGP attributes and policy control enters here).


Some references:

Cisco. (Agosto de 2010). Cisco IP Routing. http://www.cisco.com/en/US/tech/tk365/technologies_configuration_example09186a0080094309.shtml











, , , ,

Article By: Ariel Liguori

CCIE DC #55292 / VCIX-NV / JNCIP "Network Architect mainly focused on SDN/NFV, Openstack adoption, Datacenter technologies and automations running on top of it :) "

No comments:

Post a Comment