Skip to content

Comments

HDDS-14619. Add option in Container Balancer CLI for excluding containers#9785

Open
sravani-revuri wants to merge 4 commits intoapache:masterfrom
sravani-revuri:HDDS-14619
Open

HDDS-14619. Add option in Container Balancer CLI for excluding containers#9785
sravani-revuri wants to merge 4 commits intoapache:masterfrom
sravani-revuri:HDDS-14619

Conversation

@sravani-revuri
Copy link
Contributor

@sravani-revuri sravani-revuri commented Feb 18, 2026

What changes were proposed in this pull request?

We already have a configuration for this - hdds.container.balancer.exclude.containers. The aim of this jira is to support this configuration in ContainerBalancerStartSubcommand similar to the existing options.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14619

How was this patch tested?

Manually tested using the following commands.

Creating Imbalance:

bash-5.1$ ozone admin datanode maintenance datanode1
Entering maintenance mode on datanode(s):
datanode1

bash-5.1$ ozone sh volume create /volume1
ozone sh bucket create --replication THREE --type RATIS /volume1/bucket1

dd if=/dev/urandom of=/tmp/100mb bs=1048576 count=100
for i in {1..3}; do
  ozone sh key put /volume1/bucket1/file-$i.txt /tmp/100mb --replication=THREE --type=RATIS
done

100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.20387 s, 514 MB/s

Recomissioning the node:

bash-5.1$ ozone admin datanode recommission datanode1
Started recommissioning datanode(s):
datanode1

Running the balancer with --exclude-containers CLI command:

bash-5.1$ ozone admin containerbalancer start --exclude-containers "1" -t 0.1 -d 100 -i 3
ozone admin containerbalancer status --verbose

Container Balancer started successfully.
ContainerBalancer is Running.
Started at: 2026-02-18 06:48:54
Balancing duration: 1s

Container Balancer Configuration values:
Key                                                Value
Threshold                                          0.1
Max Datanodes to Involve per Iteration(percent)    100
Max Size to Move per Iteration                     0GB
Max Size Entering Target per Iteration             26GB
Max Size Leaving Source per Iteration              26GB
Number of Iterations                               3
Time Limit for Single Container's Movement         65min
Time Limit for Single Container's Replication      50min
Interval between each Iteration                    0min
Whether to Enable Network Topology                 false
Whether to Trigger Refresh Datanode Usage Info     false
Container IDs to Exclude from Balancing            1
Datanodes Specified to be Balanced                 None
Datanodes Excluded from Balancing                  None

Current iteration info:
Key                                                Value
Iteration number                                   1
Iteration duration                                 1s
Iteration result                                   -
Size scheduled to move                             200 MB
Moved data size                                    0 B
Scheduled to move containers                       2
Already moved containers                           0
Failed to move containers                          0
Failed to move containers by timeout               0
Entered data to nodes                              
47b58f63-35c7-484f-99f4-79c337c33036 <- 100 MB
e7b69dd1-0be0-4bc8-99d8-ca01c87ddb45 <- 100 MB
Exited data from nodes                             
cf2e4b7a-278a-4354-adbc-7119e9ffec71 -> 100 MB
c6befdb9-ef5c-4b40-8592-11dee4d0bd2a -> 100 MB

Logs showing container 1 is avoided:

2026-02-18 06:48:54,566 [scm1-ContainerBalancerTask-1] INFO balancer.ContainerBalancerTask: ContainerBalancer is trying to move container #3 with size 104857600B from source datanode c6befdb9-ef5c-4b40-8592-11dee4d0bd2a(ozone-balancer-datanode3-1.ozone-balancer_default/172.19.0.3) to target datanode e7b69dd1-0be0-4bc8-99d8-ca01c87ddb45(ozone-balancer-datanode4-1.ozone-balancer_default/172.19.0.12)
2026-02-18 06:48:54,570 [scm1-ContainerBalancerTask-1] INFO balancer.ContainerBalancerTask: ContainerBalancer is trying to move container #2 with size 104857600B from source datanode cf2e4b7a-278a-4354-adbc-7119e9ffec71(ozone-balancer-datanode5-1.ozone-balancer_default/172.19.0.15) to target datanode 47b58f63-35c7-484f-99f4-79c337c33036(ozone-balancer-datanode1-1.ozone-balancer_default/172.19.0.13)

Running the balancer without the --exclude-containers CLI command:

bash-5.1$ ozone admin containerbalancer start -t 0.1 -d 100 -i 3
ozone admin containerbalancer status --verbose

Container Balancer started successfully.
ContainerBalancer is Running.
Started at: 2026-02-18 07:26:01
Balancing duration: 0s

Container Balancer Configuration values:
Key                                                Value
Threshold                                          0.1
Max Datanodes to Involve per Iteration(percent)    100
Max Size to Move per Iteration                     0GB
Max Size Entering Target per Iteration             26GB
Max Size Leaving Source per Iteration              26GB
Number of Iterations                               3
Time Limit for Single Container's Movement         65min
Time Limit for Single Container's Replication      50min
Interval between each Iteration                    0min
Whether to Enable Network Topology                 false
Whether to Trigger Refresh Datanode Usage Info     false
Container IDs to Exclude from Balancing            None
Datanodes Specified to be Balanced                 None
Datanodes Excluded from Balancing                  None

Current iteration info:
Key                                                Value
Iteration number                                   1
Iteration duration                                 0s
Iteration result                                   -
Size scheduled to move                             300 MB
Moved data size                                    0 B
Scheduled to move containers                       3
Already moved containers                           0
Failed to move containers                          0
Failed to move containers by timeout               0
Entered data to nodes                              
36eb6bbd-c1cc-42ae-b20c-720d9fa35c24 <- 100 MB
964d38ff-e20c-434f-b80b-52a662c3c4da <- 100 MB
bfa848dc-ac96-4f2f-a88d-9be15112aa29 <- 100 MB
Exited data from nodes                             
fb8b6e1f-82f5-4a0e-afc1-7300f08abf84 -> 100 MB
f8135171-5323-408f-94e6-faf59f3165db -> 100 MB
8b651307-140c-4b0c-a44f-bc8bfc85d35f -> 100 MB

Logs showing container 1 is not avoided:

2026-02-18 07:26:01,063 [scm1-ContainerBalancerTask-1] INFO balancer.ContainerBalancerTask: ContainerBalancer is trying to move container #3 with size 104857600B from source datanode 8b651307-140c-4b0c-a44f-bc8bfc85d35f(ozone-balancer-datanode6-1.ozone-balancer_default/172.19.0.14) to target datanode 36eb6bbd-c1cc-42ae-b20c-720d9fa35c24(ozone-balancer-datanode1-1.ozone-balancer_default/172.19.0.8)
2026-02-18 07:26:01,065 [scm1-ContainerBalancerTask-1] INFO balancer.ContainerBalancerTask: ContainerBalancer is trying to move container #2 with size 104857600B from source datanode f8135171-5323-408f-94e6-faf59f3165db(ozone-balancer-datanode4-1.ozone-balancer_default/172.19.0.10) to target datanode 964d38ff-e20c-434f-b80b-52a662c3c4da(ozone-balancer-datanode5-1.ozone-balancer_default/172.19.0.15)
2026-02-18 07:26:01,066 [scm1-ContainerBalancerTask-1] INFO balancer.ContainerBalancerTask: ContainerBalancer is trying to move container #1 with size 104857600B from source datanode fb8b6e1f-82f5-4a0e-afc1-7300f08abf84(ozone-balancer-datanode3-1.ozone-balancer_default/172.19.0.4) to target datanode bfa848dc-ac96-4f2f-a88d-9be15112aa29(ozone-balancer-datanode2-1.ozone-balancer_default/172.19.0.9)

@adoroszlai adoroszlai changed the title HDDS-14619. Include an option for excluding containers in the Container Balancer CLI HDDS-14619. Add option in Container Balancer CLI for excluding containers Feb 18, 2026
Copy link
Contributor

@sreejasahithi sreejasahithi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sravani-revuri for working on this.

Copy link
Contributor

@siddhantsangwan siddhantsangwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change largely looks good. However I'm not sure the robot test is explicitly testing whether containers were actually excluded. It's just testing whether it started and was able to balance, right?

What I think you can do is add another robot test case/function in the existing robot test that starts balancer with all containers excluded (before the current code where balancer is started normally). It should start successfully but stop without moving any containers/moving any data. Then the existing test case can run that starts balancer without excluding containers, and then it will move some data and balance the cluster.

Additionally you can add some code to testIfCBCLIOverridesConfigs() to test that excluded containers passed in from the CLI override the default, which is empty.

@siddhantsangwan
Copy link
Contributor

I'm happy with the manual test as well - you can try excluding all containers to prove that no containers are picked. This + some changes to testIfCBCLIOverridesConfigs() will be sufficient as well for now if changing the robot test turns out to be too messy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants