Categories
Cloud DevOps & Cloud Infrastructure

Kubernetes Pod Networking in Production

Explore real-world Kubernetes pod networking architecture for SaaS, including lessons learned, networking patterns, and trade-offs.

Kubernetes Pod Networking in Production: Architecture, Lessons, and Pitfalls from a Real SaaS Migration

If you’re running Kubernetes at scale, pod networking isn’t just a theoretical problem—it’s a daily operational headache. This post walks through the architecture and real-world lessons from a SaaS provider’s migration to Kubernetes, focusing on pod networking: CNI decisions, network policy, service discovery, and how to survive audits and outages. If you’ve already mastered Docker and Linux networking basics, this is the deep-dive on cluster networking trade-offs and gotchas you won’t find in hello-world guides.

Key Takeaways:

You landed the Cloud Storage of the future internet. Cloud Storage Services Sesame Disk by NiHao Cloud

Use it NOW and forever!

Support the growth of a Team File sharing system that works for people in China, USA, Europe, APAC and everywhere else.
  • See a real-world Kubernetes pod networking architecture for a multi-tenant SaaS
  • Understand how pod-to-pod, pod-to-service, and cross-namespace traffic actually flows
  • Get working YAML and CLI for network policy enforcement, CNI troubleshooting, and segmentation patterns
  • Learn production lessons from outages, audits, and scale events—not just theory
  • Understand critical trade-offs and alternatives to Kubernetes networking for regulated and high-scale environments

Architecture Overview: Networking in a Multi-Tenant SaaS Cluster

Our scenario: A SaaS provider with over 50 Kubernetes clusters, each serving multiple customer tenants. Workloads range from web APIs to AI/ML batch jobs. The clusters run on a mix of AWS EKS and bare metal, with strict regulatory requirements (GDPR, PCI-DSS) and uptime SLAs.

Key Design Requirements

  • Each pod gets its own routable IP, and pod-to-pod communication does not require NAT by default.
  • Fine-grained network isolation between customer namespaces, with some shared platform services
  • Fast and reliable service discovery across namespaces
  • Support for bursty, AI-optimized workloads (dynamic scaling, in-place pod resize from Kubernetes 1.35+ per InfoQ)

Chosen Stack

  • Kubernetes: v1.35 (to leverage in-place pod resize and AI scheduling)
  • CNI: Calico (for native Kubernetes NetworkPolicy support and egress controls)
  • CoreDNS: for internal service discovery
  • MetalLB: for on-prem load balancing

Pod Networking Model (per Kubernetes documentation)

  • Every pod gets its own IP address—no port mapping needed for pod-to-pod traffic
  • Pod-to-pod communication does not require NAT; all pods can reach each other by default (unless blocked by NetworkPolicy)
  • Pod-to-service: ClusterIP services use kube-proxy to virtualize access, but traffic ultimately lands on pod IPs

This architecture enables scalable, policy-driven multi-tenant networking, but it also creates operational and security risks if misconfigured.

Reference Diagram

(For illustration, see official Kubernetes networking doc for pod/service communication diagrams.)

Pod-to-Pod Communication Patterns and Service Discovery

While basic guides show simple intra-namespace traffic, real SaaS environments require:

  • Cross-namespace communication (e.g., shared logging or metrics services accessed from customer namespaces)
  • Pod-to-pod traffic between tightly coupled microservices with strict affinity/anti-affinity rules

Pod Discovery via DNS

Pods discover services using internal DNS names:

The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

# Example: A pod in customer-a namespace calling a shared metrics service
curl http://metrics-service.platform.svc.cluster.local:8080/metrics

DNS is provided by CoreDNS, with *.svc.cluster.local resolving to the appropriate ClusterIP or endpoint. This works seamlessly inside the cluster but can fail if CoreDNS is overloaded or misconfigured.

Direct Pod-to-Pod (Advanced Use Case)

For high-performance jobs (e.g., AI/ML batch), direct pod-to-pod traffic is sometimes required, bypassing services:

# Get all pod IPs for a label selector, then communicate peer-to-peer
kubectl get pods -l job=ai-batch -o wide

# Output columns include IP; applications can build a mesh using this info

However, this pattern is brittle—if pods restart, their IPs change. Consider a Service or headless Service for more robust discovery.

Service Mesh Integration

For teams needing MTLS, traffic shaping, or complex routing, a service mesh (e.g., Istio, Linkerd) is layered on top. This adds sidecar proxies to each pod, intercepting traffic and enforcing policies beyond what CNI offers natively.

For a foundational guide to Docker and Linux container networking, see our Docker Networking Hands-On guide and Linux Networking for DevOps.

Network Policies and Segmentation: Real Enforcement Examples

  • By default, all pod-to-pod communication is allowed unless restricted by NetworkPolicy.
  • In a multi-tenant SaaS, this is unacceptable. Calico’s CNI plugin enables enforcement of NetworkPolicy objects to restrict traffic.

    Baseline Namespace Isolation

    The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

    The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: default-deny
      namespace: customer-a
    spec:
      podSelector: {}
      policyTypes:
      - Ingress
      - Egress
    

    This denies all ingress and egress traffic for pods in customer-a unless explicitly allowed. Without this, accidental cross-tenant leaks are possible.

    Allow Ingress from a Shared Service

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-from-metrics
      namespace: customer-a
    spec:
      podSelector: {}
      ingress:
      - from:
        - namespaceSelector:
            matchLabels:
              name: platform
        ports:
        - protocol: TCP
          port: 8080
    

    This allows only pods in the platform namespace to reach pods in customer-a on TCP 8080 (for metrics scraping, for example).

    Production Debugging

    To verify policy enforcement:

    # Check all effective policies in a namespace
    kubectl get networkpolicies -n customer-a
    
    # Test pod connectivity (run a temporary pod)
    kubectl run -n customer-a --rm -it --image=busybox testbox -- sh
    # Inside the pod, use wget/curl/ping to test allowed and blocked endpoints
    

    For ongoing enforcement and audit, integrate with logging tools and SIEMs, and regularly run synthetic traffic tests to catch regressions during upgrades.

    For a dense reference on network policy patterns and gotchas, see the “Network Policies” section of our Container Security Cheat Sheet.

    Production Lessons Learned and Edge Cases

    Resource Contention and Pod Networking

    In early deployments, resource limits were often omitted for sidecar proxies and CoreDNS pods. This led to:

    • DNS timeouts under load (pods couldn’t resolve service names, leading to cascading failures)
    • Network policy controller crashes (leaving traffic unfiltered until recovery)

    Remedy: Always set resource requests/limits for all networking-related pods, not just application workloads. See 7 Common Kubernetes Pitfalls for more on this issue.

    Audit and Compliance Realities

    During a PCI-DSS audit, network policies were found missing in several legacy namespaces—despite IaC claiming otherwise. Manual verification and automated policy scanning are now mandatory after every deployment.

    Handling Pod IP Exhaustion

    With thousands of pods per cluster, the default pod CIDR range was exhausted. This caused sporadic pod startup failures and service outages. The fix:

    The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

    The following code is an illustrative example and has not been verified against official documentation. Please refer to the official docs for production-ready code.

    # Example kube-controller-manager flag (adjust per your CNI’s docs)
    --cluster-cidr=10.244.0.0/16
    

    Monitor IP usage and right-size your CIDR blocks before scaling up.

    AI Workload Scheduling and Pod Resize

    With Kubernetes 1.35+, in-place pod resize enabled dynamic CPU/memory allocation for AI jobs (source: InfoQ). This reduced pod restarts during batch processing, but required CNI compatibility testing to ensure no dropped connections or policy bypasses during live resize events.

    Ingress Controller Retirement

    With Kubernetes setting a March 2026 retirement for NGINX Ingress (source), teams had to plan for migration to alternative ingress solutions, affecting both north-south routing and some pod-level egress rules tied to NGINX IPs.

    (source), teams had to plan for migration to alternative ingress solutions, affecting both north-south routing and some pod-level egress rules tied to NGINX IPs.

    Considerations, Trade-offs, and Alternatives

    Operational Complexity

    • True pod-level networking enables powerful isolation and flexibility, but debugging CNI issues and network policies is far more complex than VM-era firewalls.
    • Teams need both platform expertise and application developer buy-in to avoid accidental outages due to overly restrictive rules or missing policies.

    Performance and Scalability

    • CNI plugins have varying performance characteristics—Calico offers eBPF acceleration, but VXLAN overlays (used by some CNIs) can add latency.
    • Large clusters may require custom tuning of kube-proxy, CoreDNS, and CNI daemonsets for scale.

    Security Risks

    • By default, Kubernetes networking is open—without NetworkPolicy, any pod can talk to any pod, including accidental egress to the internet.
    • Network policies are only as good as their coverage; missing a namespace or not matching the right labels can leave gaps.

    Alternatives and Competitors

    SolutionApproachProsCons
    Kubernetes w/ CalicoCNI + NetworkPolicy + BGP supportFine-grained controls, cloud/on-prem, eBPF supportSteep learning curve, policy complexity
    OpenShift SDNCNI, built-in policy, Red Hat supportEnterprise support, simplified defaultsVendor lock-in, less flexibility
    VM-based isolation (legacy)Traditional VLAN/firewallFamiliar to legacy ops, simple auditNo pod-level isolation, poor multi-tenancy
    Service Mesh (Istio/Linkerd)Layer 7 traffic control, MTLSAdvanced policy, observability, encryptionExtra complexity, resource overhead
    Alternatives (DC/OS, Mirantis MOSK)Alternative schedulers, integrated SDNDifferent operating/model, sometimes lower costSmaller ecosystem, less CNI ecosystem maturity

    See TechRepublic and SDXCentral for more on Kubernetes networking alternatives and competitors.

    Common Pitfalls and Pro Tips

    • Omitting Resource Limits on Networking Pods: As covered above, this is one of the top causes of DNS and CNI instability. Monitor kubectl top pods -n kube-system for resource usage.
    • Assuming Default-Deny: Kubernetes networking is allow all by default. Always apply a default-deny NetworkPolicy to every new namespace.
    • Label Drift: NetworkPolicies match on pod/namespace labels. If labels change (CI/CD or GitOps drift), policies can silently stop applying.
    • Pod IP Reuse: Rapid pod churn can result in IP reuse, leading to stale connections and hard-to-debug traffic leaks.
    • Service/Pod DNS Outages: CoreDNS is a critical dependency. Always run multiple replicas, set resource limits, and monitor for latency spikes during scale events.
    • Ingress Lifecycle Management: With NGINX Ingress retiring in March 2026, plan migrations early to prevent last-minute outages or unsupported configurations (source).
    • Audit Your Network Policies Regularly: Automated tools like kube-bench or custom scripts can help verify policy coverage and catch drift or missing enforcement.

    Conclusion and Next Steps

    Kubernetes pod networking offers immense power—true multi-tenant isolation, fine-grained controls, and rich service discovery—but also introduces operational complexity and real security risks. No CNI or policy framework is a silver bullet. Your architecture must match your scale, compliance, and developer workflow priorities.

    For foundational skills, revisit our Docker networking guide and Linux networking for DevOps. For advanced Pod Security enforcement, see our Kubernetes Pod Security Standards case study. If you’re deploying at scale, audit your network policies, monitor resource usage, and start planning for major ingress changes ahead of the 2026 deadline.

    Want to go deeper? Review the Kubernetes cluster networking docs and track upcoming CNI and ingress changes at kubernetes.io.

    Sources and References

    This article was researched using the following sources:

    References

    Critical Analysis

    Additional Reading

    By Thomas A. Anderson

    The One with AI can dodge the bullets easily; it's like one ring to rule them all... sort of...

    Start Sharing and Storing Files for Free

    You can also get your own Unlimited Cloud Storage on our pay as you go product.
    Other cool features include: up to 100GB size for each file.
    Speed all over the world. Reliability with 3 copies of every file you upload. Snapshot for point in time recovery.
    Collaborate with web office and send files to colleagues everywhere; in China & APAC, USA, Europe...
    Tear prices for costs saving and more much more...
    Create a Free Account Products Pricing Page