docs: note on order of starting components #374

vsoch · 2025-06-26T03:09:51Z

@AkihiroSuda - this is a small note for the README to comment on the order of installing components. As you know, the setup uses annotation targeted at flannel to use a host external ip for a multi-node setup. The issue arises with order of operations. If we install flannel with the control plane, that means when new nodes come up, their flannel pods will be created (along with the control plane) to use the "host" discovered IP, which is the usernetes 10.x one. If these addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It becomes an issue in an HPC or similar environment where the private 10.x address goes to a router and is not known, and the packets are dropped. We ran into this issue on our HPC system, and I realized it was because of the order of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments. In summary, the current instructions describe:

bring up control plane
install flannel

bring up workers
add annotation and patches

And the order should be:

bring up control plane
bring up workers
add annotation and patches
install flannel

AkihiroSuda · 2025-06-26T04:06:07Z

README.md


 The container engine defaults to Docker.
 To change the container engine, set `export CONTAINER_ENGINE=podman` or `export CONTAINER_ENGINE=nerdctl`.
+For multi-host, you will want to `make install-flannel` after `make sync-external-ip` when worker pods are up. The sync command adds an annotation `flannel.alpha.coreos.com/public-ip-overwrite` for flannel to direct the nodes to use the physical node host IP. If the flannel pod has already been created for a node, it would need to be restarted to recheck the annotation. The easiest approach is to install flannel after the annotations have been applied.


This step doesn't seem needed on the CI AFAICS?

usernetes/hack/create-cluster-lima.sh

Lines 45 to 51 in e337e8d

# Bootstrap a cluster with host0

${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-init install-flannel kubeconfig join-command

# Let host1 join the cluster

${LIMACTL} copy host0:~/usernetes/join-command host1:~/usernetes/join-command

${LIMACTL} shell host1 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-join

${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" sync-external-ip

If the router for between nodes can route the address, there shouldn't be an issue. I don't know how the setup here (GitHub actions) is designed, but for HPC environments we have really strict rules that may not know how to forward the packet of the initial address that flannel is assigned. I'm not even allowed to know, I just know working with our sysadmin that with the current setup, the packet is dropped for the address when I do it in the order.

What would be the point of adding the annotation, period, if it has no effect unless the pods are restarted?

The issue seems specific to your network, so this should be hinted only in https://github.com/rootless-containers/usernetes?tab=readme-ov-file#advanced-topics

What is the point of adding the annotation in the current state if it isn't used?

I agree that the pods have to be recreated after setting the pod annotations, but how does make install-flannel recreate pods?
Does it automatically do kubectl rollout restart?

make-flannel is making a call to install the flannel CRD via helm, which is going to deploy the Daemonset. If the pod nodes have already been created (they will have with the current setup) their pods will already exist, so they would need to be restarted or recreated. A kubectl rollout restart to the daemonset I think would work, although I haven't tested it. make install-flannel does not currently recreate anything (which makes it erroneous if that is expected).

Instead of having to do the pod creation twice, I'm just issuing make install-flannel once at the end after the nodes are annotated with the labels for flannel. I did experiments this weekend on a larger cluster (N=32) and I liked this ordering because I can see the different nodes go explicitly from NotReady to Ready and that is something I can (eventually) programmatically wait for. A rollout restart would already deem the nodes ready, and I would not have that same ability.

I see this comment in my email, but I can't find it here:

Wondering if we can just let make sync-external-ip call kubectl rollout restart

I do think if we decide that make sync-external-ip is decidedly the last command to run for any usernetes setup, at which point the nodes are created for multi-cluster setups too, that might work. It would also just work to install flannel at that point too... what is the benefit to doing it earlier?

Maybe you can just append make install-flannel here and call it a day

usernetes/Makefile

Line 65 in d0eb4ed

@echo 'make sync-external-ip'

usernetes/README.md

Line 123 in d0eb4ed

make sync-external-ip

We are OK for now, but I wanted to point this out since the annotations (without the restart) are essentially non-functional. Most of our logic (you can see here) for each of the control plane and worker nodes is orchestrated by a service script and actually, the make install-flannel is not represented there because it's going to be run one level up, by the orchestrator that is starting services on each node. This is going to be done under a flux batch job that will be able to bring up the lead broker, wait for the join-command, and then start all the workers, and when the count of nodes is reached, run each of:

make sync-external-ip make install-flannel

And then wait for NotReady to transition to Ready for all nodes before the kubeconfig is provided to the running instance for running services and apps alongside traditional HPC (simulation, etc). I haven't done that yet because it's not possible to start a user level service (I get a bus error). But we have a change going into Flux hopefully this week that will allow cgroups and services to start cleanly under a flux allocation - right now I need to shell into each node to get the functionality needed. Starting the size 32 cluster took me about 15 minutes! 😆 But I did ~8 hours of experiments so it's a small amount compared to that.

We are close! It's really exciting. 🥳 If you are interested, here is my size 32 cluster (which I deployed twice) over the weekend.

That's the first of that size ever deployed on our systems, and maybe the first ever on a production cluster. It was so lovely to be running kubectl on HPC, something I never imagined would work. I don't know the user base that you intended, but if we get this working (and share the word) Usernetes is going to be a deal-breaker for the HPC community. We just don't have anything like it.

AkihiroSuda

#374 (comment)

vsoch · 2025-06-26T14:00:25Z

@AkihiroSuda I've moved it to advanced topic. I also fixed a detail that, as stated, was incomplete. Higher ports are not required for multi-tenancy. The reason we need them for some systems is if the system does not allow the lower port range. If the ports are allowed, the different nodes have no issue using the same ports.

vsoch · 2025-06-26T14:03:10Z

And I think it would be unlikely for multiple users to be using the same physical node with Usernetes.

Update: I added it back, but put experimental. I suppose it could be done, but it's unlikely, and the port customization does support that.

flannel requires an annotation to use a host external ip for a multi-node setup. If the ip addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It is only an issue in an HPC or similar environment where the private 10.x address might go to a router and not be understood (and dropped). We ran into this issue on our HPC system, and I realized it was because of the order of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments. Signed-off-by: vsoch <vsoch@users.noreply.github.com>

AkihiroSuda · 2025-06-30T01:52:50Z

README.md

 ### Multi-tenancy

-Multiple users on the hosts may create their own instances of Usernetes, but the port numbers have to be changed to avoid conflicts.
+Multiple users on the hosts may create their own instances of Usernetes. For systems that do not allow the lower port range, or for multiple usernetes deployments on the same physical node (experimental), the port numbers can be changed.


"multiple usernetes deployments on the same physical node" is different from "Multiple users on the hosts" ?

Yes. You could have a single user that has a physical node under a job, and on that node create two usernetes "nodes." That is different from two users having jobs on the same node, and both wanting their own usernetes node. Both cases need to consider conflict of ports. For the second reason, the customization is needed for centers that are strict about users only having access to higher ranges.

Sounds too complicated to mix up multiple topics in this "Multi-tenancy" section here.

To step back, the point is about the ports. You'd want to be able to customize them for either of the cases above:

I have multiple users sharing a physical node (and thus ports could conflict)

I am a single user running multiple usernetes pods on one physical node (this is technically a variant of multi-tenancy, but the tenant is the rootless container node)

I am only allowed to run on higher level ports.

I think making these cases clear has value to the reader. Please let me know the sections you'd like, or how to divide this, and I'll do it tomorrow, going to sleep now.

AkihiroSuda reviewed Jun 26, 2025

View reviewed changes

AkihiroSuda mentioned this pull request Jun 26, 2025

bug: multi-node setup needs unique network names #375

Open

AkihiroSuda requested changes Jun 26, 2025

View reviewed changes

vsoch force-pushed the update-deploy-order branch from 351a334 to 5833fe6 Compare June 26, 2025 13:59

vsoch force-pushed the update-deploy-order branch from 5833fe6 to b0ea001 Compare June 26, 2025 14:01

vsoch force-pushed the update-deploy-order branch from b0ea001 to 3c88803 Compare June 26, 2025 14:08

AkihiroSuda reviewed Jun 30, 2025

View reviewed changes

	# Bootstrap a cluster with host0
	${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-init install-flannel kubeconfig join-command

	# Let host1 join the cluster
	${LIMACTL} copy host0:~/usernetes/join-command host1:~/usernetes/join-command
	${LIMACTL} shell host1 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-join
	${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" sync-external-ip

docs: note on order of starting components #374

Are you sure you want to change the base?

docs: note on order of starting components #374

Uh oh!

Conversation

vsoch commented Jun 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vsoch Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vsoch Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AkihiroSuda left a comment

Choose a reason for hiding this comment

Uh oh!

vsoch commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vsoch commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vsoch Jun 30, 2025 •

edited

Loading

vsoch Jun 30, 2025 •

edited

Loading

vsoch commented Jun 26, 2025 •

edited

Loading

vsoch commented Jun 26, 2025 •

edited

Loading