Recently my team was tasked with changing the default VxLAN port, used by OpenShift to create SDN networks for the individual namespace (Project). Problem was, that we needed to do it on a running cluster, with no downtime.

Why would you ever want to do something like that?!

Well… we really did not want to do that. We were however forced by VmWares, highly questionable, implementation of VxLAN in there NSX product. It turns out, that when VxLAN management of a VMware cluster is turned on in the NSX it not only uses the standard port for VxLANS, it also blocks that same port between physical hosts in the cluster.

By “blocking” i mean that if you, as in our case, are running Open vSwitch in your OpenShift Cluster, using port 4879 as the VXLAN default, then your pods will not be able to communicate internally if your nodes resides on different physical hosts. In other words - your OpenShift cluster will be down.

Reference for this post is the VXLAN standard:

https://tools.ietf.org/html/rfc7348

and RedHat documentation of the conflict:

https://access.redhat.com/solutions/3083121

however, the issue in question is how to do something about it.

So, because IT Operations really really wants to deploy NSX on a VMware cluster where we were already using port 4879, and because WMware for some reason insist on going beyond the VXLAN standard and operate on the network between physical hosts, we were forced to change the OpenShift standard port.

But we are already live?!

So fortunately OpenShift 3.11 supports changing the SDN port. However, there is really not put a lot of effort into figuring out how to do it on a running cluster. I guess that it is not something that is in high demand. And should not be.

A modern, and sensible, approach would be to just build a new cluster with the desired SDN port and ask the application teams to use the power of config-as-code and just redeploy their entire workloads to the new cluster. Then you move the load balancer, destroy the old cluster and everyone lives happily ever after.

Unfortunately this is not currently an option for us. We currently have too little spare capacity and far too long delivery time on changes on the hypervisor. We were forced to consider a live port-migration due to timelines and other projects.

What makes this a problem?

Kubernetes (OpenShift) is a self-correcting platform. If it experience a problem or an inconsistency it will do anything in its power to correct this. That is a strong trade in many situations, but also a trap that can potentially bite you in a number of occasions. To change the SDN port we will need to

  1. change the configuration
  2. restart the SDN pod on all nodes

and that would create a small network glitch on all servers. Should the health check discover this (30s interval) it will react. And the reaction will be to destroy all networking on that node and rebuild from scratch. In effect restarting all containers. On a node with 60 (or even 250 as is supported) java based micro services, that is going to take much longer time to get up and running again than the network change itself.

Further more, if this is done one node at a time, it will create the effect that a distributed application running in multiple nodes, for a short while will be split in two clusters without connectivity. This is a secondary problem and needs to be addressed by the applications - will they survive that? Most resilient applications should survive a glitch in the network, but knowing developers there is probably not a uniform answer to that. However, here I will not consider this challenge further.

Another thing, we have (by accident) experienced before is, that if you don’t take care to pre-pull the one-node image to every server-node, you may have to wait significantly longer than expected to get your SDNs up again. This naturally enlarges the problem for the “chatty” applications on the platform.

So what will you do about it?

Not wanting to create havoc in the development departments, while forced to do something, we decided to go beyond the standard procedure.

We decided to pull the source code for OpenShift and modify it to not react on the planned network glitch.

At this point, all credit for pulling the code, finding the exact health check, building our custom ose-node container, and testing this entire plan, goes to Alin Balutoiu.

So we came up with the following solution:

First we build a custom node-container with the health check disabled. This should prevent the node from reacting when the port is changed.

Here is the exact function that returns “false” if it finds a network error and where we basically just commented out the return false

https://github.com/openshift/origin/blob/1221cce3133d3d98e926843ffe03559164a04d68/pkg/network/node/ovscontroller.go#L61-L66

We then compiled the new ace-node container and pushed it to the registry with the tag:

registry.[domain]/default/ose-node:3.11-custom

For the change procedure itself we have the following plan.

Steps to follow:

  1. Pre-pull image “registry.[your domain]/default/ose-node:v3.11-custom” on ALL nodes to avoid extended downtime!
  2. Change the image for the SDN daemonset:

    • backup sdn daemonset yaml:

      oc get --export ds sdn -o yaml > sdn_saved.yml

    • change sdn image:

      • oc edit ds sdn
      • delete the whole “image.openshift.io/triggers” field
      • change from image: registry.redhat.io/openshift3/ose-node:v3.11 to image: registry.[domain]/default/ose-node:v3.11-custom
      • save changes
    • Wait for SDN pods to be updated with the new image

  3. Add iptables rules on ALL nodes! IMPORTANT STEP, NETWORK CONNECTION WILL BE DOWN IF THIS STEP IS NOT EXECUTED ON ALL NODES

     iptables -A OS_FIREWALL_ALLOW -p udp -m state --state NEW -m udp --dport 4889 -j ACCEPT
    
  4. Change the SDN port using the following script:

    date
    oc get nodes | grep Ready | awk '{print $1}' | while read nodename; do
     PODID=$(ssh -n -q $nodename "docker ps | grep k8s_sdn_sdn" | awk '{print $1}')
     ssh -n -q $nodename "hostname
      docker exec $PODID ovs-vsctl set interface vxlan0 options:dst_port=4889
      echo "... Done"
     "
    done
    date
    

    Note: the above shouldn’t take more than a few seconds, from our tests it takes around 2-3 seconds for 4 nodes in a lab environment

    Nodes already using the new vxlan port for overlay tunnels from this point.

  5. Follow Redhat document for changing the port https://docs.openshift.com/container-platform/3.11/install_config/configuring_sdn.html#config-changing-vxlan-port-for-cluster-network

    • delete the cluster network: oc delete clusternetwork default
    • edit the /etc/origin/master/master-config.yaml and change the vxlan port
    • restart master api and controller on the master nodes: “/usr/local/bin/master-restart api restart && /usr/local/bin/master-restart controllers restart”

    Note: There is no need for the iptables rules since it is already there from the STEP 3

  6. Change back the image for the SDN daemonset:

    • reverting back to the original image
    • change from “image: registry.[domain]/default/ose-node:v3.11-custom” to “image: registry.redhat.io/openshift3/ose-node:v3.11”
    • add back the “image.openshift.io/triggers” field by getting it from the sdn_saved.yml file

Note: SDN pods will be automatically restarted with the new image. No network rebuild will happen.

Downtime as result of this:

  • a few seconds for STEP 4 until it runs the command via SSH on all nodes from the cluster, network will be restored after the loop finishes
  • API will be unaccessible during STEP 5 when API and controller pods are getting restarted, can be reduced by waiting for one master to restart before proceeding to the new one
  • When this is done sequentially, it will look to the application, as if they are quickly being transferred to a different cluster. Faster than an actual restart, but a few applications might be sensitive to this glitch.

Risk

The analogy we use when explaining the risk of this, is that we are driving in a car on the highway. Then we must take off the seatbelt for a short while to fix something in the back seat. If everything continues as expected we should be fine for a short while. If not, then we could be in trouble rather quickly. Now please stay focused everyone…