-
Notifications
You must be signed in to change notification settings - Fork 68
docs: note on order of starting components #374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
README.md
Outdated
|
|
||
| The container engine defaults to Docker. | ||
| To change the container engine, set `export CONTAINER_ENGINE=podman` or `export CONTAINER_ENGINE=nerdctl`. | ||
| For multi-host, you will want to `make install-flannel` after `make sync-external-ip` when worker pods are up. The sync command adds an annotation `flannel.alpha.coreos.com/public-ip-overwrite` for flannel to direct the nodes to use the physical node host IP. If the flannel pod has already been created for a node, it would need to be restarted to recheck the annotation. The easiest approach is to install flannel after the annotations have been applied. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step doesn't seem needed on the CI AFAICS?
usernetes/hack/create-cluster-lima.sh
Lines 45 to 51 in e337e8d
| # Bootstrap a cluster with host0 | |
| ${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-init install-flannel kubeconfig join-command | |
| # Let host1 join the cluster | |
| ${LIMACTL} copy host0:~/usernetes/join-command host1:~/usernetes/join-command | |
| ${LIMACTL} shell host1 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" kubeadm-join | |
| ${LIMACTL} shell host0 ${SERVICE_PORTS} CONTAINER_ENGINE="${CONTAINER_ENGINE}" make -C "${guest_home}/usernetes" sync-external-ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the router for between nodes can route the address, there shouldn't be an issue. I don't know how the setup here (GitHub actions) is designed, but for HPC environments we have really strict rules that may not know how to forward the packet of the initial address that flannel is assigned. I'm not even allowed to know, I just know working with our sysadmin that with the current setup, the packet is dropped for the address when I do it in the order.
What would be the point of adding the annotation, period, if it has no effect unless the pods are restarted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue seems specific to your network, so this should be hinted only in https://github.com/rootless-containers/usernetes?tab=readme-ov-file#advanced-topics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of adding the annotation in the current state if it isn't used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the pods have to be recreated after setting the pod annotations, but how does make install-flannel recreate pods?
Does it automatically do kubectl rollout restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make-flannel is making a call to install the flannel CRD via helm, which is going to deploy the Daemonset. If the pod nodes have already been created (they will have with the current setup) their pods will already exist, so they would need to be restarted or recreated. A kubectl rollout restart to the daemonset I think would work, although I haven't tested it. make install-flannel does not currently recreate anything (which makes it erroneous if that is expected).
Instead of having to do the pod creation twice, I'm just issuing make install-flannel once at the end after the nodes are annotated with the labels for flannel. I did experiments this weekend on a larger cluster (N=32) and I liked this ordering because I can see the different nodes go explicitly from NotReady to Ready and that is something I can (eventually) programmatically wait for. A rollout restart would already deem the nodes ready, and I would not have that same ability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this comment in my email, but I can't find it here:
Wondering if we can just let make sync-external-ip call kubectl rollout restart
I do think if we decide that make sync-external-ip is decidedly the last command to run for any usernetes setup, at which point the nodes are created for multi-cluster setups too, that might work. It would also just work to install flannel at that point too... what is the benefit to doing it earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are OK for now, but I wanted to point this out since the annotations (without the restart) are essentially non-functional. Most of our logic (you can see here) for each of the control plane and worker nodes is orchestrated by a service script and actually, the make install-flannel is not represented there because it's going to be run one level up, by the orchestrator that is starting services on each node. This is going to be done under a flux batch job that will be able to bring up the lead broker, wait for the join-command, and then start all the workers, and when the count of nodes is reached, run each of:
make sync-external-ip
make install-flannelAnd then wait for NotReady to transition to Ready for all nodes before the kubeconfig is provided to the running instance for running services and apps alongside traditional HPC (simulation, etc). I haven't done that yet because it's not possible to start a user level service (I get a bus error). But we have a change going into Flux hopefully this week that will allow cgroups and services to start cleanly under a flux allocation - right now I need to shell into each node to get the functionality needed. Starting the size 32 cluster took me about 15 minutes! 😆 But I did ~8 hours of experiments so it's a small amount compared to that.
We are close! It's really exciting. 🥳 If you are interested, here is my size 32 cluster (which I deployed twice) over the weekend.
That's the first of that size ever deployed on our systems, and maybe the first ever on a production cluster. It was so lovely to be running kubectl on HPC, something I never imagined would work. I don't know the user base that you intended, but if we get this working (and share the word) Usernetes is going to be a deal-breaker for the HPC community. We just don't have anything like it.
AkihiroSuda
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
351a334 to
5833fe6
Compare
|
@AkihiroSuda I've moved it to advanced topic. I also fixed a detail that, as stated, was incomplete. Higher ports are not required for multi-tenancy. The reason we need them for some systems is if the system does not allow the lower port range. If the ports are allowed, the different nodes have no issue using the same ports. |
5833fe6 to
b0ea001
Compare
|
And I think it would be unlikely for multiple users to be using the same physical node with Usernetes. Update: I added it back, but put experimental. I suppose it could be done, but it's unlikely, and the port customization does support that. |
flannel requires an annotation to use a host external ip for a multi-node setup. If the ip addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It is only an issue in an HPC or similar environment where the private 10.x address might go to a router and not be understood (and dropped). We ran into this issue on our HPC system, and I realized it was because of the order of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments. Signed-off-by: vsoch <vsoch@users.noreply.github.com>
b0ea001 to
3c88803
Compare
| ### Multi-tenancy | ||
|
|
||
| Multiple users on the hosts may create their own instances of Usernetes, but the port numbers have to be changed to avoid conflicts. | ||
| Multiple users on the hosts may create their own instances of Usernetes. For systems that do not allow the lower port range, or for multiple usernetes deployments on the same physical node (experimental), the port numbers can be changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"multiple usernetes deployments on the same physical node" is different from "Multiple users on the hosts" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You could have a single user that has a physical node under a job, and on that node create two usernetes "nodes." That is different from two users having jobs on the same node, and both wanting their own usernetes node. Both cases need to consider conflict of ports. For the second reason, the customization is needed for centers that are strict about users only having access to higher ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds too complicated to mix up multiple topics in this "Multi-tenancy" section here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To step back, the point is about the ports. You'd want to be able to customize them for either of the cases above:
- I have multiple users sharing a physical node (and thus ports could conflict)
- I am a single user running multiple usernetes pods on one physical node (this is technically a variant of multi-tenancy, but the tenant is the rootless container node)
- I am only allowed to run on higher level ports.
I think making these cases clear has value to the reader. Please let me know the sections you'd like, or how to divide this, and I'll do it tomorrow, going to sleep now.

@AkihiroSuda - this is a small note for the README to comment on the order of installing components. As you know, the setup uses annotation targeted at flannel to use a host external ip for a multi-node setup. The issue arises with order of operations. If we install flannel with the control plane, that means when new nodes come up, their flannel pods will be created (along with the control plane) to use the "host" discovered IP, which is the usernetes 10.x one. If these addresses that are in the private space can be routed between nodes (possible in some clouds) this is not an issue. It becomes an issue in an HPC or similar environment where the private 10.x address goes to a router and is not known, and the packets are dropped. We ran into this issue on our HPC system, and I realized it was because of the order of operations - we should make sync-external-ip first (adding the annotation) and then make install-flannel to use it. This would only be a bug for specific, multi-node environments. In summary, the current instructions describe:
And the order should be: