Building a Scalable Kubernetes Network Architecture with Cilium, BGP, and L2 Announcements
Introduction
When I began setting up my Kubernetes cluster’s networking infrastructure, I faced a common but challenging problem: how to efficiently manage a limited public IP range while maintaining a clear separation between public and private services. I had a small /29 subnet (X.X.55.89/29) for public services, which meant I needed to be extremely conservative with IP usage. This constraint led me to develop a hybrid networking solution that combines Border Gateway Protocol (BGP) for public services with Layer 2 (L2) announcements for private services.
In this article, I’ll share my journey through this setup, including the challenges I encountered, the solutions I implemented, and the reasoning behind my decisions. I’ll provide a detailed walkthrough that will enable you to replicate this setup in your own environment.
Infrastructure Overview
My setup is built on a Talos-based Kubernetes cluster with five nodes:
- A control plane node (control-01) at 10.121.0.10
- Four worker nodes, each with dual network interfaces:
- node01: 10.121.0.11 and 10.122.0.11
- node02: 10.121.0.12 and 10.122.0.12
- node03: 10.121.0.13 and 10.122.0.13
- node04: 10.121.0.14 and 10.122.0.14
The dual network interfaces serve a crucial purpose in my design: the 10.121.0.0/24 network handles public-facing traffic through BGP, while the 10.122.0.0/24 network manages private services using L2 announcements. This separation provides both security and flexibility in handling different types of network traffic.
IP Pools for BPG and L2
First, let’s look at how we define our IP pools in the platform configuration (values-platform-prod.yaml):
externalIP:
ipAddressIngressPublicPool:
- "X.X.55.93/32"
ipAddressServicePublicPool:
- "X.X.55.94/32"
ipAddressIngressPrivatePool:
- "10.122.0.128/28"
ipAddressServicePrivatePool:
- "10.122.0.160/28"
This configuration establishes four planned IP pools:
- Public Ingress Pool (X.X.55.93/32): A single dedicated public IP for the public-facing ingress controller.
- Public Service Pool (X.X.55.94/32): A single public IP address for services that need direct public access.
- Private Ingress Pool (10.122.0.128/28): Provides 16 IP addresses (10.122.0.128 to 10.122.0.143) for private ingress services, giving us enough addresses for multiple ingress controllers and their failovers
- Private Service Pool (10.122.0.160/28): Another 16 IP addresses (10.122.0.160 to 10.122.0.175) reserved for internal services that need stable, dedicated IP addresses
Understanding the Network Architecture
Before diving into the implementation, it’s important to understand why I chose this hybrid approach. The architecture combines two complementary networking methods:
BGP handles public-facing services by dynamically advertising routes to my Ubiquiti Dream Machine Pro (UDM Pro). This approach is perfect for managing the limited public IP space because it allows for efficient IP utilization and automatic failover. When a service needs a public IP, BGP ensures it’s accessible through any available node, providing both high availability and efficient resource usage.
L2 announcements, on the other hand, manage private services within my internal network. This method is simpler and more efficient for internal traffic, as it operates at the data link layer and doesn’t require the overhead of BGP routing. It’s particularly well-suited for services that only need to be accessible within the cluster or from my private network.
Setting Up the UDM Pro
The first step in implementing this architecture was configuring BGP on my UDM Pro. I needed to set up the router as a BGP speaker that could communicate with my Kubernetes nodes. Here’s the configuration I uploaded to https://your.udm.pro/network/default/settings/routing/bgp:
frr defaults traditional
log file stdout
router bgp 65000
bgp ebgp-requires-policy
bgp router-id 10.121.0.1
maximum-paths 1
! Peer group for ASN 65001
neighbor CI-65001 peer-group
neighbor CI-65001 remote-as 65001
neighbor CI-65001 soft-reconfiguration inbound
neighbor CI-65001 timers 15 45
neighbor CI-65001 timers connect 15
neighbor CI-65001 default-originate
! Peer group for ASN 65002
neighbor CI-65002 peer-group
neighbor CI-65002 remote-as 65002
neighbor CI-65002 soft-reconfiguration inbound
neighbor CI-65002 timers 15 45
neighbor CI-65002 timers connect 15
neighbor CI-65002 default-originate
! Neighbors for ASN 65001
neighbor 10.121.0.11 peer-group CI-65001
neighbor 10.121.0.12 peer-group CI-65001
neighbor 10.121.0.13 peer-group CI-65001
! Neighbors for ASN 65002
neighbor 10.122.0.11 peer-group CI-65002
neighbor 10.122.0.12 peer-group CI-65002
neighbor 10.122.0.13 peer-group CI-65002
address-family ipv4 unicast
redistribute connected
redistribute kernel
!
neighbor CI-65001 activate
neighbor CI-65001 route-map ALLOW-ALL in
neighbor CI-65001 route-map ALLOW-ALL out
!
neighbor CI-65002 activate
neighbor CI-65002 route-map ALLOW-ALL in
neighbor CI-65002 route-map ALLOW-ALL out
exit-address-family
route-map ALLOW-ALL permit 10
!
line vty
!
This configuration establishes my UDM Pro as BGP ASN 65000 and creates two peer groups: one for the public ingress network (ASN 65001) and one for the public service network (ASN 65002). The configuration includes keepalive and hold timers to ensure stable connections, and route maps to control traffic flow.
Installing and Configuring Cilium
With the router configured, the next step was setting up Cilium in my Kubernetes cluster. I used Helm for the installation, starting with this Chart.yaml:
apiVersion: v2
name: cilium
version: 1.0.0
dependencies:
- name: cilium
repository: https://helm.cilium.io/
version: 1.16.4
The Cilium configuration in values-platform-prod.yaml enables both BGP and L2 announcement features:
cilium:
debug:
enabled: true
bgpControlPlane:
enabled: true
ipam:
mode: kubernetes
kubeProxyReplacement: true
k8sServiceHost: localhost
k8sServicePort: 7445
l2announcements:
enabled: true
leaseDuration: 15s
leaseRenewDeadline: 5s
leaseRetryPeriod: 2s
externalIPs:
enabled: true
enableRuntimeDeviceDetection: true
Implementing BGP in Kubernetes
The BGP setup in Kubernetes requires three types of resources that work together to create a complete BGP implementation. Let’s examine each component in detail to understand how they create our BGP architecture.
Understanding the BGP Cluster Configuration
The CiliumBGPClusterConfig resource serves as the foundation of our BGP setup. This configuration defines our BGP topology and establishes the fundamental relationships between our Kubernetes nodes and the UDM Pro router. Let’s examine the configuration in detail (cilium-bpg-configuration.yaml):
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
name: cilium-bgp
spec:
nodeSelector:
matchLabels:
public: "true"
bgpInstances:
- name: "instance-65001"
localASN: 65001
peers:
- name: "peer-router-65001"
peerASN: 65000
peerAddress: 10.121.0.1
peerConfigRef:
name: "router-peer-config-65001"
- name: "instance-65002"
localASN: 65002
peers:
- name: "peer-router-65002"
peerASN: 65000
peerAddress: 10.122.0.1
peerConfigRef:
name: "router-peer-config-65002"
The cluster configuration establishes two BGP instances, one for each network segment. Let’s break down its components:
NodeSelector: The
nodeSelectorfield withpublic: "true"ensures that only nodes intended for public traffic participate in BGP routing. This gives us precise control over which nodes can advertise routes.BGP Instances: We define two separate BGP instances:
- Instance-65001: Handles public-facing ingress (ASN 65001)
- Instance-65002: Manages public-facing services (ASN 65002)
Each instance specifies:
- A local ASN (Autonomous System Number) that identifies our cluster segments
- The peer relationship with our UDM Pro (ASN 65000)
- A reference to the detailed peer configuration
This setup effectively creates two separate BGP domains, allowing us to manage public and private traffic independently while maintaining a clear network topology.
Configuring BGP Peer Relationships
The CiliumBGPPeerConfig resource defines the detailed parameters for our BGP sessions. This configuration ensures reliable and secure communication between our Kubernetes nodes and the UDM Pro. Here’s the peer configuration for ASN 65001 and ASN 65002 (cilium-bgp-peer-65001-65002.yaml):
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
name: router-peer-config-65001
spec:
timers:
holdTimeSeconds: 30
keepAliveTimeSeconds: 10
ebgpMultihop: 1
gracefulRestart:
enabled: true
restartTimeSeconds: 60
families:
- afi: ipv4
safi: unicast
advertisements:
matchLabels:
advertise: "65001"
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
name: router-peer-config-65002
spec:
timers:
holdTimeSeconds: 30
keepAliveTimeSeconds: 10
ebgpMultihop: 1
gracefulRestart:
enabled: true
restartTimeSeconds: 60
families:
- afi: ipv4
safi: unicast
advertisements:
matchLabels:
advertise: "65002"
Let’s examine each component of the peer configuration:
Timers:
holdTimeSeconds: 30: The maximum time to wait for a KEEPALIVE or UPDATE message before considering the peer downkeepAliveTimeSeconds: 10: How often to send KEEPALIVE messages to maintain the BGP session
EBGP Multihop: The
ebgpMultihop: 1setting allows BGP sessions between peers that aren’t directly connected. In our case, it’s set to 1 since our peers are on the same network segment.Graceful Restart:
enabled: true: Allows for smooth handling of BGP session restartsrestartTimeSeconds: 60: Gives peers 60 seconds to recover from a restart without disrupting traffic
Address Families:
- Specifies IPv4 unicast as our address family
- Links advertisements to services with the “advertise: 65001” or “advertise: 65002” label
- This selective advertisement ensures that only intended services are exposed via BGP
Controlling Route Advertisements
The CiliumBGPAdvertisement resource determines which Kubernetes services should be advertised via BGP. This configuration gives us fine-grained control over service exposure. Here’s our advertisement configuration (cilium-bgp-advertisement-65001-65002.yaml):
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
name: bgp-advertisements-65001
labels:
advertise: "65001"
spec:
advertisements:
- advertisementType: "Service"
service:
addresses:
- LoadBalancerIP
selector:
matchExpressions:
- { key: bgp, operator: In, values: [ "65001" ] }
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
name: bgp-advertisements-65002
labels:
advertise: "65002"
spec:
advertisements:
- advertisementType: "Service"
service:
addresses:
- LoadBalancerIP
selector:
matchExpressions:
- { key: bgp, operator: In, values: [ "65002" ] }
The advertisement configuration deserves careful examination:
Labels:
- The
advertise: "65001"oradvertise: "65002"label matches the corresponding peer configuration - This label-based approach allows for precise control over which routes are advertised to which peers
- The
Advertisement Specification:
advertisementType: "Service": Indicates we’re advertising Kubernetes service IPsaddresses: [ "LoadBalancerIP" ]: Specifies that we want to advertise LoadBalancer service IPs- The selector ensures only services with the appropriate BGP label are advertised
Selector Logic:
- Uses matchExpressions to identify services that should be advertised
- The
bgp: "65001"orbgp: "65002"key-value pair must be present on services to be advertised - This provides a flexible way to control which services get exposed through BGP
This three-part BGP configuration (Cluster, Peer, and Advertisement) creates a robust and flexible system for managing service exposure through BGP. The cluster config establishes our BGP topology, the peer config ensures reliable BGP sessions, and the advertisement config controls which services are exposed. Together, they enable precise control over how our Kubernetes services are made available to the broader network.
Setting Up L2 Announcements
After setting up BGP, I needed to configure L2 announcements and IP pools for our private network services. This setup involves managing two distinct types of resources: ingress controllers and internal services. Each type needs its own IP pool and corresponding L2 announcement policy.
Private Service Configuration
For internal services that need their own IP addresses, we have a separate IP pool configuration (dmz-private-service-vlan11.yaml):
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: "dmz-private-service-vlan11"
spec:
blocks:
- cidr: "10.122.0.160/28"
serviceSelector:
matchLabels:
cidr: private
This pool is dedicated to private internal services, using a different CIDR range (10.122.0.160/28) than the ingress pool. Services that need an IP from this pool must have the label cidr: private.
The corresponding L2 announcement policy for these services (dmz-private-service-vlan11-policy.yaml) ensures proper network advertisement:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
name: dmz-private-service-vlan11-policy
spec:
serviceSelector:
matchLabels:
cidr: private
nodeSelector:
matchExpressions:
- key: private
operator: Exists
interfaces:
- enp86s0.122
externalIPs: true
loadBalancerIPs: true
This policy configuration mirrors the ingress policy but targets services with the cidr: private label instead of focusing on the ingress namespace.
Private Ingress Configuration
For internal ingress that need their own IP addresses, we have a separate IP pool configuration (dmz-private-ingress-vlan11.yaml):
apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
name: "dmz-private-ingress-vlan11"
spec:
blocks:
- cidr: "10.122.0.128/28"
serviceSelector:
matchLabels:
cidr: private
This pool is dedicated to private internal ingress, using a different CIDR range (10.122.0.128/28) than the servicee pool. Ingress that need an IP from this pool must have the label cidr: private.
The corresponding L2 announcement policy for these ingress (dmz-private-ingress-vlan11-policy.yaml) ensures proper network advertisement:
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
name: dmz-private-ingress-vlan11-policy
spec:
serviceSelector:
matchLabels:
"io.kubernetes.service.namespace": "ingress-nginx-private"
nodeSelector:
matchExpressions:
- key: private
operator: Exists
interfaces:
- enp86s0.122
externalIPs: true
loadBalancerIPs: true
This policy configuration mirrors the service policy but targets ingress with the cidr: private label and focusing on the ingress namespace "io.kubernetes.service.namespace": "ingress-nginx-private"
Implementing Services
Our networking architecture supports three distinct types of services, each with its own configuration pattern and networking approach. Let’s examine how each type is implemented and why they’re configured differently.
Public Ingress Services with BGP
For services that need to be accessible from the internet, we use BGP to advertise their availability. Here’s how we configure our public ingress controller:
ingressClass: nginx-public
service:
externalTrafficPolicy: "Local"
labels:
bgp: "65001"
external:
enabled: true
internal:
enabled: false
In this configuration, several key elements work together:
- The
externalTrafficPolicy: "Local"setting ensures that incoming traffic is processed by the node that receives it. This preserves client IP addresses and reduces unnecessary network hops. - The label
bgp: "65001"connects this service to our BGP advertisement policy, causing Cilium to advertise the service’s IP address through BGP ASN 65001. - Setting
external: enabledandinternal: disabledexplicitly marks this service for external access.
This configuration works in conjunction with our earlier BGP advertisements and peer configurations to make the service available through our UDM Pro router.
Public Service with BGP
apiVersion: v1
kind: Service
metadata:
name: dummy-service
namespace: dummy
annotations:
"io.cilium/lb-ipam-ips": "XX.XX.55.94"
labels:
app: dummy
bgp: "65002"
spec:
type: LoadBalancer
externalTrafficPolicy: Local
ports:
- port: 25565
targetPort: 25565
protocol: TCP
selector:
app: dummy
In this configuration, several key elements work together:
- The
externalTrafficPolicy: "Local"setting ensures that incoming traffic is processed by the node that receives it. This preserves client IP addresses and reduces unnecessary network hops. - The label
bgp: "65002"connects this service to our BGP advertisement policy, causing Cilium to advertise the service’s IP address through BGP ASN 65001. - Setting
io.cilium/lb-ipam-ipswhich assignsXX.XX.55.94to this service.
Private Services with L2 Announcements
For internal services that need stable, private IP addresses (like databases or internal APIs), we use L2 announcements. Here’s an example configuration for a PostgreSQL service:
apiVersion: v1
kind: Service
metadata:
name: affine-svc-lb
labels:
app: cnpg-affine-cluster
cnpg.io/cluster: affine-cluster
postgresql: affine-cluster
cidr: private
annotations:
io.cilium/lb-ipam-ips: "10.122.0.171"
spec:
ports:
- name: postgres
protocol: TCP
port: 5432
targetPort: 5432
selector:
cnpg.io/cluster: affine-cluster
role: primary
type: LoadBalancer
internalTrafficPolicy: Cluster
The key components of this configuration are:
- The
cidr: privatelabel connects this service to our L2 announcement policy for private services. - The
io.cilium/lb-ipam-ips: "10.122.0.171"annotation assigns a specific IP from our private service pool (10.122.0.160/28). internalTrafficPolicy: Clusteroptimizes routing for internal access, as this service will only be accessed from within our private network.
Private Ingress Services with L2 Announcements
For internal web applications that need to be accessed across our private network, we use a private ingress controller with L2 announcements:
service:
externalTrafficPolicy: "Local"
annotations:
"io.cilium/lb-ipam-ips": "10.122.0.129"
external:
enabled: true
internal:
enabled: false
This configuration:
- Uses
externalTrafficPolicy: "Local"for efficient traffic routing, just like our public ingress - Assigns a specific IP (10.122.0.129) from our private ingress pool (10.122.0.128/28)
- Enables external connections (from within our private network) while keeping the service isolated from the internet
The relationship between these configurations and our IP pools is straightforward:
- Public services use BGP labels and our public IP range (X.X.55.94/32 or X.X.55.93/32)
- Private services use the 10.122.0.160/28 pool with L2 announcements
- Private ingress uses the 10.122.0.128/28 pool with L2 announcements
This structured approach ensures that each type of service gets the appropriate networking configuration and IP address range for its purpose, whether that’s public internet access, private network access, or internal cluster communication.
Verifying the Setup
After implementing all components, I verify the setup using various commands. For BGP status ssh into your DreamMachine:
systemctl restart frr
vtysh -c 'show ip bgp summary'
A successful setup shows output like this:
IPv4 Unicast Summary (VRF default):
BGP router identifier 10.121.0.1, local AS number 65000 vrf-id 0
BGP table version 61
RIB entries 54, using 9936 bytes of memory
Peers 6, using 4338 KiB of memory
Peer groups 2, using 128 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt Desc
10.121.0.11 4 65001 132702 132898 0 0 0 1d09h40m 1 29 N/A
10.121.0.12 4 65001 132702 132899 0 0 0 1d09h39m 1 29 N/A
10.121.0.13 4 65001 132705 132895 0 0 0 1d09h40m 1 29 N/A
10.122.0.11 4 65002 132680 132891 0 0 0 1d09h40m 0 29 N/A
10.122.0.12 4 65002 132682 132890 0 0 0 1d09h39m 1 29 N/A
10.122.0.13 4 65002 132676 132887 0 0 0 1d09h40m 0 29 N/A
Or take a look at the BGP table.
vtysh -c 'show ip bgp'
A successful setup shows output like this:
BGP table version is 61, local router ID is 10.121.0.1, vrf id 0
Default local pref 100, local AS 65000
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Network Next Hop Metric LocPrf Weight Path
*> XX.XX.55.88/29 0.0.0.0 0 32768 ?
*> XX.XX.55.93/32 10.122.0.12 0 65002 i
* XX.XX.55.94/32 10.121.0.12 0 65001 i
* 10.121.0.13 0 65001 i
*> 10.121.0.11 0 65001 i
Lessons Learned and Best Practices
Through implementing this setup, I’ve learned several valuable lessons:
Separation of Concerns: Keeping public and private traffic on different subnets simplifies security and troubleshooting. BGP handles public services efficiently, while L2 announcements provide a simpler solution for internal services.
IP Management: The combination of BGP and L2 announcements allows for efficient use of both public and private IP ranges. BGP’s dynamic nature helps manage the limited public IP space, while L2 announcements provide flexibility for internal services.
High Availability: The setup ensures service availability through both BGP’s routing capabilities and L2’s efficient local network handling. If a node fails, traffic automatically routes to available nodes.
Scalability: This architecture easily accommodates growth. Adding new nodes or services simply requires following the established patterns for either BGP or L2 announcements.
Conclusion
This hybrid networking approach has proven to be both robust and flexible. It efficiently manages my limited public IP space while providing simple and effective networking for private services. The combination of BGP and L2 announcements, managed through Cilium, creates a sophisticated yet maintainable network architecture that serves both public and private networking needs effectively.
The setup might seem complex at first, but each component serves a specific purpose in creating a complete networking solution. Whether you’re managing a small cluster like mine or a larger infrastructure, these principles and configurations can help you build a reliable and efficient network architecture.