Skip to content

Conversation

@vsoch
Copy link
Contributor

@vsoch vsoch commented Jun 26, 2025

@AkihiroSuda - this is a small note for the README to comment on the order of installing components. As you know, the setup uses annotation targeted at flannel to use a host external ip for a multi-node setup. The issue arises with order of operations. If we install flannel with the control plane, that means when new nodes come up, their flannel pods will be created (along with the control plane) to use the "host" discovered IP, which is the usernetes 10.x one. If these addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It becomes an issue in an HPC or similar environment where the private 10.x address goes to a router and is not known, and the packets are dropped. We ran into this issue on our HPC system, and I realized it was because of the order of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments. In summary, the current instructions describe:

bring up control plane
install flannel

bring up workers
add annotation and patches

And the order should be:

bring up control plane
bring up workers
add annotation and patches
install flannel

README.md Outdated

The container engine defaults to Docker.
To change the container engine, set `export CONTAINER_ENGINE=podman` or `export CONTAINER_ENGINE=nerdctl`.
For multi-host, you will want to `make install-flannel` after `make sync-external-ip` when worker pods are up. The sync command adds an annotation `flannel.alpha.coreos.com/public-ip-overwrite` for flannel to direct the nodes to use the physical node host IP. If the flannel pod has already been created for a node, it would need to be restarted to recheck the annotation. The easiest approach is to install flannel after the annotations have been applied.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step doesn't seem needed on the CI AFAICS?

# Bootstrap a cluster with host0
${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-init install-flannel kubeconfig join-command
# Let host1 join the cluster
${LIMACTL} copy host0:~/usernetes/join-command host1:~/usernetes/join-command
${LIMACTL} shell host1 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-join
${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" sync-external-ip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the router for between nodes can route the address, there shouldn't be an issue. I don't know how the setup here (GitHub actions) is designed, but for HPC environments we have really strict rules that may not know how to forward the packet of the initial address that flannel is assigned. I'm not even allowed to know, I just know working with our sysadmin that with the current setup, the packet is dropped for the address when I do it in the order.

What would be the point of adding the annotation, period, if it has no effect unless the pods are restarted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue seems specific to your network, so this should be hinted only in https://github.com/rootless-containers/usernetes?tab=readme-ov-file#advanced-topics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of adding the annotation in the current state if it isn't used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the pods have to be recreated after setting the pod annotations, but how does make install-flannel recreate pods?
Does it automatically do kubectl rollout restart?

Copy link
Contributor Author

@vsoch vsoch Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make-flannel is making a call to install the flannel CRD via helm, which is going to deploy the Daemonset. If the pod nodes have already been created (they will have with the current setup) their pods will already exist, so they would need to be restarted or recreated. A kubectl rollout restart to the daemonset I think would work, although I haven't tested it. make install-flannel does not currently recreate anything (which makes it erroneous if that is expected).

Instead of having to do the pod creation twice, I'm just issuing make install-flannel once at the end after the nodes are annotated with the labels for flannel. I did experiments this weekend on a larger cluster (N=32) and I liked this ordering because I can see the different nodes go explicitly from NotReady to Ready and that is something I can (eventually) programmatically wait for. A rollout restart would already deem the nodes ready, and I would not have that same ability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this comment in my email, but I can't find it here:

Wondering if we can just let make sync-external-ip call kubectl rollout restart

I do think if we decide that make sync-external-ip is decidedly the last command to run for any usernetes setup, at which point the nodes are created for multi-cluster setups too, that might work. It would also just work to install flannel at that point too... what is the benefit to doing it earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can just append make install-flannel here and call it a day

@echo 'make sync-external-ip'

make sync-external-ip

Copy link
Contributor Author

@vsoch vsoch Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are OK for now, but I wanted to point this out since the annotations (without the restart) are essentially non-functional. Most of our logic (you can see here) for each of the control plane and worker nodes is orchestrated by a service script and actually, the make install-flannel is not represented there because it's going to be run one level up, by the orchestrator that is starting services on each node. This is going to be done under a flux batch job that will be able to bring up the lead broker, wait for the join-command, and then start all the workers, and when the count of nodes is reached, run each of:

make sync-external-ip
make install-flannel

And then wait for NotReady to transition to Ready for all nodes before the kubeconfig is provided to the running instance for running services and apps alongside traditional HPC (simulation, etc). I haven't done that yet because it's not possible to start a user level service (I get a bus error). But we have a change going into Flux hopefully this week that will allow cgroups and services to start cleanly under a flux allocation - right now I need to shell into each node to get the functionality needed. Starting the size 32 cluster took me about 15 minutes! 😆 But I did ~8 hours of experiments so it's a small amount compared to that.

We are close! It's really exciting. 🥳 If you are interested, here is my size 32 cluster (which I deployed twice) over the weekend.

image

That's the first of that size ever deployed on our systems, and maybe the first ever on a production cluster. It was so lovely to be running kubectl on HPC, something I never imagined would work. I don't know the user base that you intended, but if we get this working (and share the word) Usernetes is going to be a deal-breaker for the HPC community. We just don't have anything like it.

Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsoch vsoch force-pushed the update-deploy-order branch from 351a334 to 5833fe6 Compare June 26, 2025 13:59
@vsoch
Copy link
Contributor Author

vsoch commented Jun 26, 2025

@AkihiroSuda I've moved it to advanced topic. I also fixed a detail that, as stated, was incomplete. Higher ports are not required for multi-tenancy. The reason we need them for some systems is if the system does not allow the lower port range. If the ports are allowed, the different nodes have no issue using the same ports.

@vsoch vsoch force-pushed the update-deploy-order branch from 5833fe6 to b0ea001 Compare June 26, 2025 14:01
@vsoch
Copy link
Contributor Author

vsoch commented Jun 26, 2025

And I think it would be unlikely for multiple users to be using the same physical node with Usernetes.

Update: I added it back, but put experimental. I suppose it could be done, but it's unlikely, and the port customization does support that.

flannel requires an annotation to use a host external ip for a multi-node setup. If the ip addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It is only an issue in an HPC or similar environment where the private 10.x address might go to a router and not be understood (and dropped). We ran into this issue on our HPC system, and I realized it was because of the order
of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch vsoch force-pushed the update-deploy-order branch from b0ea001 to 3c88803 Compare June 26, 2025 14:08
### Multi-tenancy

Multiple users on the hosts may create their own instances of Usernetes, but the port numbers have to be changed to avoid conflicts.
Multiple users on the hosts may create their own instances of Usernetes. For systems that do not allow the lower port range, or for multiple usernetes deployments on the same physical node (experimental), the port numbers can be changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"multiple usernetes deployments on the same physical node" is different from "Multiple users on the hosts" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You could have a single user that has a physical node under a job, and on that node create two usernetes "nodes." That is different from two users having jobs on the same node, and both wanting their own usernetes node. Both cases need to consider conflict of ports. For the second reason, the customization is needed for centers that are strict about users only having access to higher ranges.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds too complicated to mix up multiple topics in this "Multi-tenancy" section here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To step back, the point is about the ports. You'd want to be able to customize them for either of the cases above:

  • I have multiple users sharing a physical node (and thus ports could conflict)
  • I am a single user running multiple usernetes pods on one physical node (this is technically a variant of multi-tenancy, but the tenant is the rootless container node)
  • I am only allowed to run on higher level ports.

I think making these cases clear has value to the reader. Please let me know the sections you'd like, or how to divide this, and I'll do it tomorrow, going to sleep now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants