Building a Scalable Kubernetes Network Architecture with Cilium, BGP, and L2 Announcements

Introduction

When I began setting up my Kubernetes cluster’s networking infrastructure, I faced a common but challenging problem: how to efficiently manage a limited public IP range while maintaining a clear separation between public and private services. I had a small /29 subnet (X.X.55.89/29) for public services, which meant I needed to be extremely conservative with IP usage. This constraint led me to develop a hybrid networking solution that combines Border Gateway Protocol (BGP) for public services with Layer 2 (L2) announcements for private services.

In this article, I’ll share my journey through this setup, including the challenges I encountered, the solutions I implemented, and the reasoning behind my decisions. I’ll provide a detailed walkthrough that will enable you to replicate this setup in your own environment.

Infrastructure Overview

My setup is built on a Talos-based Kubernetes cluster with five nodes:

  • A control plane node (control-01) at 10.121.0.10
  • Four worker nodes, each with dual network interfaces:
    • node01: 10.121.0.11 and 10.122.0.11
    • node02: 10.121.0.12 and 10.122.0.12
    • node03: 10.121.0.13 and 10.122.0.13
    • node04: 10.121.0.14 and 10.122.0.14

The dual network interfaces serve a crucial purpose in my design: the 10.121.0.0/24 network handles public-facing traffic through BGP, while the 10.122.0.0/24 network manages private services using L2 announcements. This separation provides both security and flexibility in handling different types of network traffic.

IP Pools for BPG and L2

First, let’s look at how we define our IP pools in the platform configuration (values-platform-prod.yaml):

externalIP:
  ipAddressIngressPublicPool:
    - "X.X.55.93/32"
  ipAddressServicePublicPool:
    - "X.X.55.94/32"
  ipAddressIngressPrivatePool:
    - "10.122.0.128/28"
  ipAddressServicePrivatePool:
    - "10.122.0.160/28"

This configuration establishes four planned IP pools:

  1. Public Ingress Pool (X.X.55.93/32): A single dedicated public IP for the public-facing ingress controller.
  2. Public Service Pool (X.X.55.94/32): A single public IP address for services that need direct public access.
  3. Private Ingress Pool (10.122.0.128/28): Provides 16 IP addresses (10.122.0.128 to 10.122.0.143) for private ingress services, giving us enough addresses for multiple ingress controllers and their failovers
  4. Private Service Pool (10.122.0.160/28): Another 16 IP addresses (10.122.0.160 to 10.122.0.175) reserved for internal services that need stable, dedicated IP addresses

Understanding the Network Architecture

Before diving into the implementation, it’s important to understand why I chose this hybrid approach. The architecture combines two complementary networking methods:

BGP handles public-facing services by dynamically advertising routes to my Ubiquiti Dream Machine Pro (UDM Pro). This approach is perfect for managing the limited public IP space because it allows for efficient IP utilization and automatic failover. When a service needs a public IP, BGP ensures it’s accessible through any available node, providing both high availability and efficient resource usage.

L2 announcements, on the other hand, manage private services within my internal network. This method is simpler and more efficient for internal traffic, as it operates at the data link layer and doesn’t require the overhead of BGP routing. It’s particularly well-suited for services that only need to be accessible within the cluster or from my private network.

Setting Up the UDM Pro

The first step in implementing this architecture was configuring BGP on my UDM Pro. I needed to set up the router as a BGP speaker that could communicate with my Kubernetes nodes. Here’s the configuration I uploaded to https://your.udm.pro/network/default/settings/routing/bgp:

frr defaults traditional
log file stdout

router bgp 65000
 bgp ebgp-requires-policy
 bgp router-id 10.121.0.1
 maximum-paths 1

 ! Peer group for ASN 65001
 neighbor CI-65001 peer-group
 neighbor CI-65001 remote-as 65001
 neighbor CI-65001 soft-reconfiguration inbound
 neighbor CI-65001 timers 15 45
 neighbor CI-65001 timers connect 15
 neighbor CI-65001 default-originate

 ! Peer group for ASN 65002
 neighbor CI-65002 peer-group
 neighbor CI-65002 remote-as 65002
 neighbor CI-65002 soft-reconfiguration inbound
 neighbor CI-65002 timers 15 45
 neighbor CI-65002 timers connect 15
 neighbor CI-65002 default-originate

 ! Neighbors for ASN 65001
 neighbor 10.121.0.11 peer-group CI-65001
 neighbor 10.121.0.12 peer-group CI-65001
 neighbor 10.121.0.13 peer-group CI-65001

 ! Neighbors for ASN 65002
 neighbor 10.122.0.11 peer-group CI-65002
 neighbor 10.122.0.12 peer-group CI-65002
 neighbor 10.122.0.13 peer-group CI-65002

 address-family ipv4 unicast
  redistribute connected
  redistribute kernel
  !
  neighbor CI-65001 activate
  neighbor CI-65001 route-map ALLOW-ALL in
  neighbor CI-65001 route-map ALLOW-ALL out
  !
  neighbor CI-65002 activate
  neighbor CI-65002 route-map ALLOW-ALL in
  neighbor CI-65002 route-map ALLOW-ALL out
 exit-address-family

route-map ALLOW-ALL permit 10
!
line vty
!

This configuration establishes my UDM Pro as BGP ASN 65000 and creates two peer groups: one for the public ingress network (ASN 65001) and one for the public service network (ASN 65002). The configuration includes keepalive and hold timers to ensure stable connections, and route maps to control traffic flow.

Installing and Configuring Cilium

With the router configured, the next step was setting up Cilium in my Kubernetes cluster. I used Helm for the installation, starting with this Chart.yaml:

apiVersion: v2
name: cilium
version: 1.0.0
dependencies:
  - name: cilium
    repository: https://helm.cilium.io/
    version: 1.16.4

The Cilium configuration in values-platform-prod.yaml enables both BGP and L2 announcement features:

cilium:
  debug:
    enabled: true
  bgpControlPlane:
    enabled: true
  ipam:
    mode: kubernetes
  kubeProxyReplacement: true
  k8sServiceHost: localhost
  k8sServicePort: 7445
  l2announcements:
    enabled: true
    leaseDuration: 15s
    leaseRenewDeadline: 5s
    leaseRetryPeriod: 2s
  externalIPs:
    enabled: true
  enableRuntimeDeviceDetection: true

Implementing BGP in Kubernetes

The BGP setup in Kubernetes requires three types of resources that work together to create a complete BGP implementation. Let’s examine each component in detail to understand how they create our BGP architecture.

Understanding the BGP Cluster Configuration

The CiliumBGPClusterConfig resource serves as the foundation of our BGP setup. This configuration defines our BGP topology and establishes the fundamental relationships between our Kubernetes nodes and the UDM Pro router. Let’s examine the configuration in detail (cilium-bpg-configuration.yaml):

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
spec:
  nodeSelector:
    matchLabels:
      public: "true"
  bgpInstances:
    - name: "instance-65001"
      localASN: 65001
      peers:
        - name: "peer-router-65001"
          peerASN: 65000
          peerAddress: 10.121.0.1
          peerConfigRef:
            name: "router-peer-config-65001"
    - name: "instance-65002"
      localASN: 65002
      peers:
        - name: "peer-router-65002"
          peerASN: 65000
          peerAddress: 10.122.0.1
          peerConfigRef:
            name: "router-peer-config-65002"

The cluster configuration establishes two BGP instances, one for each network segment. Let’s break down its components:

  1. NodeSelector: The nodeSelector field with public: "true" ensures that only nodes intended for public traffic participate in BGP routing. This gives us precise control over which nodes can advertise routes.

  2. BGP Instances: We define two separate BGP instances:

    • Instance-65001: Handles public-facing ingress (ASN 65001)
    • Instance-65002: Manages public-facing services (ASN 65002)

Each instance specifies:

  • A local ASN (Autonomous System Number) that identifies our cluster segments
  • The peer relationship with our UDM Pro (ASN 65000)
  • A reference to the detailed peer configuration

This setup effectively creates two separate BGP domains, allowing us to manage public and private traffic independently while maintaining a clear network topology.

Configuring BGP Peer Relationships

The CiliumBGPPeerConfig resource defines the detailed parameters for our BGP sessions. This configuration ensures reliable and secure communication between our Kubernetes nodes and the UDM Pro. Here’s the peer configuration for ASN 65001 and ASN 65002 (cilium-bgp-peer-65001-65002.yaml):

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  name: router-peer-config-65001
spec:
  timers:
    holdTimeSeconds: 30
    keepAliveTimeSeconds: 10
  ebgpMultihop: 1
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 60
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "65001"
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPPeerConfig
metadata:
  name: router-peer-config-65002
spec:
  timers:
    holdTimeSeconds: 30
    keepAliveTimeSeconds: 10
  ebgpMultihop: 1
  gracefulRestart:
    enabled: true
    restartTimeSeconds: 60
  families:
    - afi: ipv4
      safi: unicast
      advertisements:
        matchLabels:
          advertise: "65002"

Let’s examine each component of the peer configuration:

  1. Timers:

    • holdTimeSeconds: 30: The maximum time to wait for a KEEPALIVE or UPDATE message before considering the peer down
    • keepAliveTimeSeconds: 10: How often to send KEEPALIVE messages to maintain the BGP session
  2. EBGP Multihop: The ebgpMultihop: 1 setting allows BGP sessions between peers that aren’t directly connected. In our case, it’s set to 1 since our peers are on the same network segment.

  3. Graceful Restart:

    • enabled: true: Allows for smooth handling of BGP session restarts
    • restartTimeSeconds: 60: Gives peers 60 seconds to recover from a restart without disrupting traffic
  4. Address Families:

    • Specifies IPv4 unicast as our address family
    • Links advertisements to services with the “advertise: 65001” or “advertise: 65002” label
    • This selective advertisement ensures that only intended services are exposed via BGP

Controlling Route Advertisements

The CiliumBGPAdvertisement resource determines which Kubernetes services should be advertised via BGP. This configuration gives us fine-grained control over service exposure. Here’s our advertisement configuration (cilium-bgp-advertisement-65001-65002.yaml):

apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements-65001
  labels:
    advertise: "65001"
spec:
  advertisements:
    - advertisementType: "Service"
      service:
        addresses:
          - LoadBalancerIP
      selector:
        matchExpressions:
          - { key: bgp, operator: In, values: [ "65001" ] }
---
apiVersion: cilium.io/v2alpha1
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements-65002
  labels:
    advertise: "65002"
spec:
  advertisements:
    - advertisementType: "Service"
      service:
        addresses:
          - LoadBalancerIP
      selector:
        matchExpressions:
          - { key: bgp, operator: In, values: [ "65002" ] }

The advertisement configuration deserves careful examination:

  1. Labels:

    • The advertise: "65001" or advertise: "65002" label matches the corresponding peer configuration
    • This label-based approach allows for precise control over which routes are advertised to which peers
  2. Advertisement Specification:

    • advertisementType: "Service": Indicates we’re advertising Kubernetes service IPs
    • addresses: [ "LoadBalancerIP" ]: Specifies that we want to advertise LoadBalancer service IPs
    • The selector ensures only services with the appropriate BGP label are advertised
  3. Selector Logic:

    • Uses matchExpressions to identify services that should be advertised
    • The bgp: "65001" or bgp: "65002" key-value pair must be present on services to be advertised
    • This provides a flexible way to control which services get exposed through BGP

This three-part BGP configuration (Cluster, Peer, and Advertisement) creates a robust and flexible system for managing service exposure through BGP. The cluster config establishes our BGP topology, the peer config ensures reliable BGP sessions, and the advertisement config controls which services are exposed. Together, they enable precise control over how our Kubernetes services are made available to the broader network.

Setting Up L2 Announcements

After setting up BGP, I needed to configure L2 announcements and IP pools for our private network services. This setup involves managing two distinct types of resources: ingress controllers and internal services. Each type needs its own IP pool and corresponding L2 announcement policy.

Private Service Configuration

For internal services that need their own IP addresses, we have a separate IP pool configuration (dmz-private-service-vlan11.yaml):

apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "dmz-private-service-vlan11"
spec:
  blocks:
    - cidr: "10.122.0.160/28"
  serviceSelector:
    matchLabels:
      cidr: private

This pool is dedicated to private internal services, using a different CIDR range (10.122.0.160/28) than the ingress pool. Services that need an IP from this pool must have the label cidr: private.

The corresponding L2 announcement policy for these services (dmz-private-service-vlan11-policy.yaml) ensures proper network advertisement:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: dmz-private-service-vlan11-policy
spec:
  serviceSelector:
    matchLabels:
      cidr: private
  nodeSelector:
    matchExpressions:
      - key: private
        operator: Exists
  interfaces:
    - enp86s0.122
  externalIPs: true
  loadBalancerIPs: true

This policy configuration mirrors the ingress policy but targets services with the cidr: private label instead of focusing on the ingress namespace.

Private Ingress Configuration

For internal ingress that need their own IP addresses, we have a separate IP pool configuration (dmz-private-ingress-vlan11.yaml):

apiVersion: "cilium.io/v2alpha1"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "dmz-private-ingress-vlan11"
spec:
  blocks:
    - cidr: "10.122.0.128/28"
  serviceSelector:
    matchLabels:
      cidr: private

This pool is dedicated to private internal ingress, using a different CIDR range (10.122.0.128/28) than the servicee pool. Ingress that need an IP from this pool must have the label cidr: private.

The corresponding L2 announcement policy for these ingress (dmz-private-ingress-vlan11-policy.yaml) ensures proper network advertisement:

apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: dmz-private-ingress-vlan11-policy
spec:
  serviceSelector:
    matchLabels:
      "io.kubernetes.service.namespace": "ingress-nginx-private"
  nodeSelector:
    matchExpressions:
      - key: private
        operator: Exists
  interfaces:
    - enp86s0.122
  externalIPs: true
  loadBalancerIPs: true

This policy configuration mirrors the service policy but targets ingress with the cidr: private label and focusing on the ingress namespace "io.kubernetes.service.namespace": "ingress-nginx-private"

Implementing Services

Our networking architecture supports three distinct types of services, each with its own configuration pattern and networking approach. Let’s examine how each type is implemented and why they’re configured differently.

Public Ingress Services with BGP

For services that need to be accessible from the internet, we use BGP to advertise their availability. Here’s how we configure our public ingress controller:

ingressClass: nginx-public
service:
  externalTrafficPolicy: "Local"
  labels:
    bgp: "65001"
  external:
    enabled: true
  internal:
    enabled: false

In this configuration, several key elements work together:

  • The externalTrafficPolicy: "Local" setting ensures that incoming traffic is processed by the node that receives it. This preserves client IP addresses and reduces unnecessary network hops.
  • The label bgp: "65001" connects this service to our BGP advertisement policy, causing Cilium to advertise the service’s IP address through BGP ASN 65001.
  • Setting external: enabled and internal: disabled explicitly marks this service for external access.

This configuration works in conjunction with our earlier BGP advertisements and peer configurations to make the service available through our UDM Pro router.

Public Service with BGP

apiVersion: v1
kind: Service
metadata:
  name: dummy-service
  namespace: dummy
  annotations:
    "io.cilium/lb-ipam-ips": "XX.XX.55.94"
  labels:
    app: dummy
    bgp: "65002"
spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  ports:
    - port: 25565
      targetPort: 25565
      protocol: TCP
  selector:
    app: dummy

In this configuration, several key elements work together:

  • The externalTrafficPolicy: "Local" setting ensures that incoming traffic is processed by the node that receives it. This preserves client IP addresses and reduces unnecessary network hops.
  • The label bgp: "65002" connects this service to our BGP advertisement policy, causing Cilium to advertise the service’s IP address through BGP ASN 65001.
  • Setting io.cilium/lb-ipam-ips which assigns XX.XX.55.94 to this service.

Private Services with L2 Announcements

For internal services that need stable, private IP addresses (like databases or internal APIs), we use L2 announcements. Here’s an example configuration for a PostgreSQL service:

apiVersion: v1
kind: Service
metadata:
  name: affine-svc-lb
  labels:
    app: cnpg-affine-cluster
    cnpg.io/cluster: affine-cluster
    postgresql: affine-cluster
    cidr: private
  annotations:
    io.cilium/lb-ipam-ips: "10.122.0.171"
spec:
  ports:
    - name: postgres
      protocol: TCP
      port: 5432
      targetPort: 5432
  selector:
    cnpg.io/cluster: affine-cluster
    role: primary
  type: LoadBalancer
  internalTrafficPolicy: Cluster

The key components of this configuration are:

  • The cidr: private label connects this service to our L2 announcement policy for private services.
  • The io.cilium/lb-ipam-ips: "10.122.0.171" annotation assigns a specific IP from our private service pool (10.122.0.160/28).
  • internalTrafficPolicy: Cluster optimizes routing for internal access, as this service will only be accessed from within our private network.

Private Ingress Services with L2 Announcements

For internal web applications that need to be accessed across our private network, we use a private ingress controller with L2 announcements:

service:
  externalTrafficPolicy: "Local"
  annotations:
    "io.cilium/lb-ipam-ips": "10.122.0.129"
  external:
    enabled: true
  internal:
    enabled: false

This configuration:

  • Uses externalTrafficPolicy: "Local" for efficient traffic routing, just like our public ingress
  • Assigns a specific IP (10.122.0.129) from our private ingress pool (10.122.0.128/28)
  • Enables external connections (from within our private network) while keeping the service isolated from the internet

The relationship between these configurations and our IP pools is straightforward:

  • Public services use BGP labels and our public IP range (X.X.55.94/32 or X.X.55.93/32)
  • Private services use the 10.122.0.160/28 pool with L2 announcements
  • Private ingress uses the 10.122.0.128/28 pool with L2 announcements

This structured approach ensures that each type of service gets the appropriate networking configuration and IP address range for its purpose, whether that’s public internet access, private network access, or internal cluster communication.

Verifying the Setup

After implementing all components, I verify the setup using various commands. For BGP status ssh into your DreamMachine:

systemctl restart frr
vtysh -c 'show ip bgp summary'

A successful setup shows output like this:

IPv4 Unicast Summary (VRF default):
BGP router identifier 10.121.0.1, local AS number 65000 vrf-id 0
BGP table version 61
RIB entries 54, using 9936 bytes of memory
Peers 6, using 4338 KiB of memory
Peer groups 2, using 128 bytes of memory
Neighbor        V         AS   MsgRcvd   MsgSent   TblVer  InQ OutQ  Up/Down State/PfxRcd   PfxSnt Desc
10.121.0.11     4      65001    132702    132898        0    0    0 1d09h40m            1       29 N/A
10.121.0.12     4      65001    132702    132899        0    0    0 1d09h39m            1       29 N/A
10.121.0.13     4      65001    132705    132895        0    0    0 1d09h40m            1       29 N/A
10.122.0.11     4      65002    132680    132891        0    0    0 1d09h40m            0       29 N/A
10.122.0.12     4      65002    132682    132890        0    0    0 1d09h39m            1       29 N/A
10.122.0.13     4      65002    132676    132887        0    0    0 1d09h40m            0       29 N/A

Or take a look at the BGP table.

vtysh -c 'show ip bgp'

A successful setup shows output like this:

BGP table version is 61, local router ID is 10.121.0.1, vrf id 0
Default local pref 100, local AS 65000
Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
               i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes:  i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

   Network          Next Hop            Metric LocPrf Weight Path
*> XX.XX.55.88/29   0.0.0.0                  0         32768 ?
*> XX.XX.55.93/32   10.122.0.12                            0 65002 i
*  XX.XX.55.94/32   10.121.0.12                            0 65001 i
*                   10.121.0.13                            0 65001 i
*>                  10.121.0.11                            0 65001 i

Lessons Learned and Best Practices

Through implementing this setup, I’ve learned several valuable lessons:

  1. Separation of Concerns: Keeping public and private traffic on different subnets simplifies security and troubleshooting. BGP handles public services efficiently, while L2 announcements provide a simpler solution for internal services.

  2. IP Management: The combination of BGP and L2 announcements allows for efficient use of both public and private IP ranges. BGP’s dynamic nature helps manage the limited public IP space, while L2 announcements provide flexibility for internal services.

  3. High Availability: The setup ensures service availability through both BGP’s routing capabilities and L2’s efficient local network handling. If a node fails, traffic automatically routes to available nodes.

  4. Scalability: This architecture easily accommodates growth. Adding new nodes or services simply requires following the established patterns for either BGP or L2 announcements.

Conclusion

This hybrid networking approach has proven to be both robust and flexible. It efficiently manages my limited public IP space while providing simple and effective networking for private services. The combination of BGP and L2 announcements, managed through Cilium, creates a sophisticated yet maintainable network architecture that serves both public and private networking needs effectively.

The setup might seem complex at first, but each component serves a specific purpose in creating a complete networking solution. Whether you’re managing a small cluster like mine or a larger infrastructure, these principles and configurations can help you build a reliable and efficient network architecture.