Skip to content

Commit 5372008

Browse files
committed
Merge remote-tracking branch 'origin/4.15' into main
2 parents 045266e + 65f540a commit 5372008

File tree

4 files changed

+197
-35
lines changed

4 files changed

+197
-35
lines changed

source/adminguide/api.rst

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -43,8 +43,15 @@ possible as well. For example, see Using an LDAP Server for User
4343
Authentication.
4444

4545

46-
User Data and Meta Data via the Virtual Router
47-
----------------------------------------------
46+
User Data and Meta Data
47+
~~~~~~~~~~~~~~~~~~~~~~~
48+
49+
The user-data service on a Shared or Isolated Network can be provided through the
50+
Virtual Router or through an attached iso called the Config drive.
51+
52+
User Data and Meta Data Via Virtual Router
53+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
54+
4855

4956
CloudStack provides API access to attach up to 32KB of user data to a
5057
deployed VM. Deployed VMs also have access to instance metadata via the
@@ -57,16 +64,12 @@ the user data:
5764
#. Run the following command to find the virtual router.
5865

5966
.. code:: bash
60-
6167
# cat /var/lib/dhclient/dhclient-eth0.leases | grep dhcp-server-identifier | tail -1
62-
6368
#. Access user data by running the following command using the result of
6469
the above command
6570

6671
.. code:: bash
67-
6872
# curl http://10.1.1.1/latest/user-data
69-
7073
Meta Data can be accessed similarly, using a URL of the form
7174
http://10.1.1.1/latest/meta-data/{metadata type}. (For backwards
7275
compatibility, the previous URL http://10.1.1.1/latest/{metadata type}
@@ -88,10 +91,7 @@ is also supported.) For metadata type, use one of the following:
8891
- instance-id. The instance name of the VM
8992

9093
User Data and Meta Data via Config Drive
91-
----------------------------------------
92-
93-
The user-data service on a Shared or L2 Network can be provided through the
94-
Virtual Router or through an attached iso called the Config drive.
94+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9595

9696
Config drive is an ISO file that is mounted as a cd-rom on a user VM and
9797
contains the user VM related userdata, metadata (incl. ssh-keys) and
@@ -103,8 +103,8 @@ To use the config drive the network offering must have the “ConfigDrive”
103103
provider selected for the userdata service.
104104

105105
If the networkoffering uses ConfigDrive for userdata and the template is
106-
password enabled, the password string for the VM is placed in password.txt file
107-
and it is included in the ISO.
106+
password enabled, the password string for the VM is placed in the
107+
vm_password.txt file and it is included in the ISO.
108108

109109
ConfigDrive availability
110110
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -113,8 +113,8 @@ user instance, such that any other ISO image (e.g. boot image or vmware tools)
113113
is mounted on 1st cd/dvd drive. This means existing functionality of
114114
supporting 1 cd rom drive is still available.
115115

116-
At Password reset or update of user data, Secondary Storage VM will rebuild the
117-
ConfigDrive ISO image. That is the existing ISO is mounted on a temporary directory,
116+
At password reset or update of user data, the Config Drive ISO
117+
will be rebuilt. The existing ISO is mounted on a temporary directory,
118118
password, userdata or ssh-keys are updated and a new ISO is built from the
119119
updated directory structure.
120120

@@ -123,9 +123,12 @@ To access the updated userdata, the user needs to remount the config drive ISO.
123123

124124
When a VM is stopped, the ConfigDrive network element will trigger the
125125
Secondary Storage VM to remove the ISO from the secondary storage.
126+
If the config drive is stored on primary storage, the network element will
127+
trigger the host to remove the ISO.
126128

127-
Since the ISO is available on secondary storage, there is no need for an extra
128-
implementation in case of migration.
129+
The config drive ISO can be stored on primary storage by setting the global
130+
setting vm.configdrive.primarypool.enabled to true. This is currently only
131+
supported with use of the KVM Hypervisor.
129132

130133
Supporting ConfigDrive
131134
~~~~~~~~~~~~~~~~~~~~~~
@@ -176,4 +179,4 @@ VMdata - a list of String arrays representing [“directory”, “filename”,
176179

177180
For more detailed information about the Config Drive implementation refer to
178181
the `Wiki Article
179-
<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Using+ConfigDrive+for+Metadata%2C+Userdata+and+Password#:~:text=CLOUDSTACK%2D9813%20%2D%20(),%2Dkeys)%20and%20password%20files>`_
182+
<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Using+ConfigDrive+for+Metadata%2C+Userdata+and+Password#:~:text=CLOUDSTACK%2D9813%20%2D%20(),%2Dkeys)%20and%20password%20files>`_

source/adminguide/networking.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,8 @@ Basic zones or Advanced Zones with Security Groups.
8888
Network” <networking_and_traffic.html#configuring-a-shared-guest-network>`_.
8989

9090

91-
L2 Networks
92-
~~~~~~~~~~~
91+
L2 (Layer 2) Networks
92+
~~~~~~~~~~~~~~~~~~~~~
9393

9494
L2 networks provide network isolation without any other services. This
9595
means that there will be no virtual router. It is assumed that the end

source/adminguide/networking/advanced_zone_config.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ configure the base guest network:
6060
want to assign a special domain name to the guest VM network, specify a
6161
DNS suffix.
6262

63+
6364
#. Click OK.
6465

6566

source/adminguide/reliability.rst

Lines changed: 174 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -61,25 +61,82 @@ still available but the system VMs will not be able to contact the
6161
management server.
6262

6363

64-
HA-Enabled Virtual Machines
65-
---------------------------
64+
Multiple Management Servers Support on agents
65+
---------------------------------------------
6666

67-
The user can specify a virtual machine as HA-enabled. By default, all
68-
virtual router VMs and Elastic Load Balancing VMs are automatically
69-
configured as HA-enabled. When an HA-enabled VM crashes, CloudStack
70-
detects the crash and restarts the VM automatically within the same
71-
Availability Zone. HA is never performed across different Availability
72-
Zones. CloudStack has a conservative policy towards restarting VMs and
73-
ensures that there will never be two instances of the same VM running at
74-
the same time. The Management Server attempts to start the VM on another
75-
Host in the same cluster.
67+
In a Cloudstack environment with multiple management servers, an agent can be
68+
configured, based on an algorithm, to which management server to connect to.
69+
This can be useful as an internal loadbalancer or for high availability.
70+
An administrator is responsible for setting the list of management servers and
71+
choosing a sorting algorithm using global settings.
72+
The management server is responsible for propagating the settings to the
73+
connected agents (running inside of the Secondary Storage
74+
Virtual Machine, Console Proxy Virtual Machine or the KVM hosts).
7675

77-
HA features work with iSCSI or NFS primary storage. HA with local
78-
storage is not supported.
76+
The three global settings that need to be configured are the following:
77+
78+
- hosts: a comma seperated list of management server IP addresses
79+
- indirect.agent.lb.algorithm: The algorithm for the indirect agent LB
80+
- indirect.agent.lb.check.interval: The preferred host check interval
81+
for the agent's background task that checks and switches to an agent's
82+
preferred host.
83+
84+
These settings can be configured from the global settings page in the UI or
85+
using the updateConfiguration API call.
86+
87+
The indirect.agent.lb.algorithm setting supports following algorithm options:
7988

89+
- static: Use the list of management server IP addresses as provided.
90+
- roundrobin: Evenly spread hosts across management servers, based on the
91+
host's id.
92+
- shuffle: Pseudo Randomly sort the list (this is not recommended for
93+
production).
8094

81-
HA for Hosts
82-
------------
95+
.. note::
96+
The 'static' and 'roundrobin' algorithms, strictly checks for the order as
97+
expected by them, however, the 'shuffle' algorithm just checks for content
98+
and not the order of the comma separate management server host addresses.
99+
100+
Any changes to the global settings - `indirect.agent.lb.algorithm` and
101+
`host` does not require restarting of the management server(s) and the
102+
agents. A change in these global settings will be propagated to all connected
103+
agents.
104+
105+
The comma-separated management server list is propagated to agents in
106+
following cases:
107+
- An addition of an agent (including ssvm, cpvm system VMs).
108+
- Connection or reconnection of an agent to a management server.
109+
- After an administrator changes the 'host' and/or the
110+
'indirect.agent.lb.algorithm' global settings.
111+
112+
On the agent side, the 'host' setting is saved in its properties file as:
113+
`host=<comma separated addresses>@<algorithm name>`.
114+
115+
From the agent's perspective, the first address in the propagated list
116+
will be considered the preferred host. A new background task can be
117+
activated by configuring the `indirect.agent.lb.check.interval` which is
118+
a cluster level global setting from CloudStack and administrators can also
119+
override this by configuring the 'host.lb.check.interval' in the
120+
`agent.properties` file.
121+
122+
When an agent gets a host and algorithm combination, the host specific
123+
background check interval is also sent and is dynamically reconfigured
124+
in the background task without need to restart agents.
125+
126+
To make things more clear, consider this example:
127+
Suppose an environment which has 3 management servers: A, B and C and
128+
3 KVM agents.
129+
130+
Setting 'host' = 'A,B,C', agents will receive lists depending on
131+
'direct.agent.lb' value:
132+
133+
'static': Each agent will receive the list: 'A,B,C'
134+
'roundrobin': First agent receives: 'A,B,C', second agent
135+
receives: 'B,C,A', third agent receives: 'C,B,A'
136+
'shuffle': Each agent will receive a list in random order.
137+
138+
HA-Enabled Virtual Machines
139+
---------------------------
83140

84141
The user can specify a virtual machine as HA-enabled. By default, all
85142
virtual router VMs and Elastic Load Balancing VMs are automatically
@@ -96,7 +153,7 @@ storage is not supported.
96153

97154

98155
Dedicated HA Hosts
99-
~~~~~~~~~~~~~~~~~~
156+
------------------
100157

101158
One or more hosts can be designated for use only by HA-enabled VMs that
102159
are restarting due to a host failure. Setting up a pool of such
@@ -126,6 +183,107 @@ that you want to dedicate to HA-enabled VMs.
126183
a crash.
127184

128185

186+
HA-Enabled Hosts
187+
----------------
188+
189+
The user can specify a host as HA-enabled, In the event of a host
190+
failure, attemps will be made to recover the failed host by first
191+
issuing some OOBM commands. If the host recovery fails the host will be
192+
fenced and placed into maintenance mode. To restore the host to normal
193+
operation, manual intervention would then be required.
194+
195+
Out of band management is a requirement of HA-Enabled hosts and has to be
196+
confiured on all intended participating hosts.
197+
(see `“Out of band management” <hosts.html#out-of-band-management>`_).
198+
199+
Host-HA has granular configuration on a host/cluster/zone level. In a large
200+
environment, some hosts from a cluster can be HA-enabled and some not,
201+
202+
Host-HA uses a state machine design to manage the operations of recovering
203+
and fencing hosts. The current status of a host is reported when quering a
204+
specific host.
205+
206+
Timely health investigations are done on HA-Enabled hosts to monitor for
207+
any failures. Specific thresholds can be set for failed investigations,
208+
only when it’s exceeded, will the host transition to a different state.
209+
210+
Host-HA uses both health checks and activity checks to make decisions on
211+
recovering and fencing actions. Once determined that the host is in faulty
212+
state (health checks failed) it runs activity checks to figure out if there is
213+
any disk activity on the VMs running on the specific host.
214+
215+
The HA Resource Management Service manages the check/recovery cycle including
216+
periodic execution, concurrency management, persistence, back pressure and
217+
clustering operations. Administrators associate a provider with a partition
218+
type (e.g. KVM HA Host provider to clusters) and may override the provider on a
219+
per-partition (i.e. zone, cluster, or pod) basis. The service operates on all
220+
resources of the type supported by the provider contained in a partition.
221+
Administrators can also enable or disable HA operations globally or on a
222+
per-partition basis.
223+
224+
Only one (1) HA provider per resource type may be specified for a partition.
225+
Nested HA providers by resource type is not supported (e.g. a pod
226+
specifying an HA resource provider for hosts and a containing cluster
227+
specifying a HA resource provider for hosts). The service is designed to be
228+
opt-in where by only resources with a defined provider and HA enabled will be
229+
managed.
230+
231+
For each resource in an HA partition, the HA Resource Management Service
232+
maintains and persists an "Finite State Machine" composed of the following
233+
states:
234+
235+
- AVAILABLE - The feature is enabled and Host-HA is available.
236+
- SUSPECT - There are health checks failing with the host.
237+
- CHECKING - Activity checks are being performed.
238+
- DEGRADED - The host is passing the activity check ratio and still providing
239+
service to the end user, but it cannot be managed from the CloudStack
240+
management server.
241+
- RECOVERING - The Host-HA framework is trying to recover the host by issuing
242+
OOBM jobs.
243+
- RECOVERED - The Host-HA framework has recovered the host successfully.
244+
- FENCING - The Host-HA framework is trying to fence the host by issuing OOBM
245+
jobs.
246+
- FENCED - The Host-HA framework has fenced the host successfully.
247+
- DISABLED - The feature is disabled for the host.
248+
- INELIGIBLE - The feature is enabled, but it cannot be managed successfully by
249+
the Host-HA framework. (OOBM is possibly not configured properly)
250+
251+
When HA is enabled for a partition, the HA state of all contained resources
252+
will be transitioned from DISABLED to AVAILABLE. Based on the state models, the
253+
following failure scenarios and their responses will be handled by the HA
254+
resource management service:
255+
256+
- Activity check operation fails on the resource: Provide a semantic in the
257+
activity check protocol to express that an error while performing the
258+
activity check and a reason for the failure (e.g. unable to access the NFS
259+
mount). If the maximum number of activity check attempts has not been
260+
exceeded, the activity check will be retried.
261+
262+
- Slow activity check operation: After a configurable timeout, the HA resource
263+
management service abandons the check. The response to this condition would
264+
be the same as a failure to recover the resource.
265+
266+
- Traffic flood due to a large number of resource recoveries: The HA resource
267+
management service must limit the number of concurrent recovery operations
268+
permitted to avoid overwhelming the management server with resource status
269+
updates as recovery operations complete.
270+
271+
- Processor/memory starvation due to large number of activity check
272+
operations: The HA resource management service must limit the number of
273+
concurrent activity check operations permitted per management server to
274+
prevent checks from starving other management server activities of scarce
275+
processor and/or memory resources.
276+
277+
- A SUSPECT, CHECKING, or RECOVERING resource passes a health check before the
278+
state action completes: The HA resource management service refreshes the HA
279+
state of the resource before transition. If it does not match the expected
280+
current state, the result of state action is ignored.
281+
282+
For further information around the inner workings of Host HA, refer
283+
to the design document at
284+
`https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
285+
<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA>`_
286+
129287
Primary Storage Outage and Data Loss
130288
------------------------------------
131289

0 commit comments

Comments
 (0)