Merge remote-tracking branch 'origin/4.15' into main

yadvr · yadvr · commit 5372008304b0 · 2021-11-15T13:59:54.000+05:30
diff --git a/source/adminguide/api.rst b/source/adminguide/api.rst
@@ -43,8 +43,15 @@ possible as well. For example, see Using an LDAP Server for User
 Authentication.
 
 
-User Data and Meta Data via the Virtual Router
-----------------------------------------------
+User Data and Meta Data
+~~~~~~~~~~~~~~~~~~~~~~~
+
+The user-data service on a Shared or Isolated Network can be provided through the
+Virtual Router or through an attached iso called the Config drive.
+
+User Data and Meta Data Via Virtual Router
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 
 CloudStack provides API access to attach up to 32KB of user data to a
 deployed VM. Deployed VMs also have access to instance metadata via the
@@ -57,16 +64,12 @@ the user data:
 #. Run the following command to find the virtual router.
 
    .. code:: bash
-
       # cat /var/lib/dhclient/dhclient-eth0.leases | grep dhcp-server-identifier | tail -1
-
 #. Access user data by running the following command using the result of
    the above command
 
    .. code:: bash
-
       # curl http://10.1.1.1/latest/user-data
-
 Meta Data can be accessed similarly, using a URL of the form
 http://10.1.1.1/latest/meta-data/{metadata type}. (For backwards
 compatibility, the previous URL http://10.1.1.1/latest/{metadata type}
@@ -88,10 +91,7 @@ is also supported.) For metadata type, use one of the following:
 -  instance-id. The instance name of the VM
 
 User Data and Meta Data via Config Drive
-----------------------------------------
-
-The user-data service on a Shared or L2 Network can be provided through the
-Virtual Router or through an attached iso called the Config drive.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Config drive is an ISO file that is mounted as a cd-rom on a user VM and
 contains the user VM related userdata, metadata (incl. ssh-keys) and
@@ -103,8 +103,8 @@ To use the config drive the network offering must have the “ConfigDrive”
 provider selected for the userdata service.
 
 If the networkoffering uses ConfigDrive for userdata and the template is
-password enabled, the password string for the VM is placed in password.txt file
-and it is included in the ISO.
+password enabled, the password string for the VM is placed in the
+vm_password.txt file and it is included in the ISO.
 
 ConfigDrive availability
 ~~~~~~~~~~~~~~~~~~~~~~~~
@@ -113,8 +113,8 @@ user instance, such that any other ISO image (e.g. boot image or vmware tools)
 is mounted on 1st cd/dvd drive. This means existing functionality of
 supporting 1 cd rom drive is still available.
 
-At Password reset or update of user data, Secondary Storage VM will rebuild the
-ConfigDrive ISO image. That is the existing ISO is mounted on a temporary directory,
+At password reset or update of user data, the Config Drive ISO
+will be rebuilt. The existing ISO is mounted on a temporary directory,
 password, userdata or ssh-keys are updated and a new ISO is built from the
 updated directory structure.
 
@@ -123,9 +123,12 @@ To access the updated userdata, the user needs to remount the config drive ISO.
 
 When a VM is stopped, the ConfigDrive network element will trigger the
 Secondary Storage VM to remove the ISO from the secondary storage.
+If the config drive is stored on primary storage, the network element will
+trigger the host to remove the ISO.
 
-Since the ISO is available on secondary storage, there is no need for an extra
-implementation in case of migration.
+The config drive ISO can be stored on primary storage by setting the global
+setting vm.configdrive.primarypool.enabled to true. This is currently only
+supported with use of the KVM Hypervisor.
 
 Supporting ConfigDrive
 ~~~~~~~~~~~~~~~~~~~~~~
@@ -176,4 +179,4 @@ VMdata - a list of String arrays representing [“directory”, “filename”,
 
 For more detailed information about the Config Drive implementation refer to
 the `Wiki Article
-<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Using+ConfigDrive+for+Metadata%2C+Userdata+and+Password#:~:text=CLOUDSTACK%2D9813%20%2D%20(),%2Dkeys)%20and%20password%20files>`_
+<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Using+ConfigDrive+for+Metadata%2C+Userdata+and+Password#:~:text=CLOUDSTACK%2D9813%20%2D%20(),%2Dkeys)%20and%20password%20files>`_
diff --git a/source/adminguide/networking.rst b/source/adminguide/networking.rst
@@ -88,8 +88,8 @@ Basic zones or Advanced Zones with Security Groups.
    Network” <networking_and_traffic.html#configuring-a-shared-guest-network>`_.
 
 
-L2 Networks
-~~~~~~~~~~~
+L2 (Layer 2) Networks
+~~~~~~~~~~~~~~~~~~~~~
 
 L2 networks provide network isolation without any other services.  This
 means that there will be no virtual router.  It is assumed that the end
diff --git a/source/adminguide/networking/advanced_zone_config.rst b/source/adminguide/networking/advanced_zone_config.rst
@@ -60,6 +60,7 @@ configure the base guest network:
       want to assign a special domain name to the guest VM network, specify a
       DNS suffix.
 
+
 #. Click OK.
 
 
diff --git a/source/adminguide/reliability.rst b/source/adminguide/reliability.rst
@@ -61,25 +61,82 @@ still available but the system VMs will not be able to contact the
 management server.
 
 
-HA-Enabled Virtual Machines
----------------------------
+Multiple Management Servers Support on agents
+---------------------------------------------
 
-The user can specify a virtual machine as HA-enabled. By default, all
-virtual router VMs and Elastic Load Balancing VMs are automatically
-configured as HA-enabled. When an HA-enabled VM crashes, CloudStack
-detects the crash and restarts the VM automatically within the same
-Availability Zone. HA is never performed across different Availability
-Zones. CloudStack has a conservative policy towards restarting VMs and
-ensures that there will never be two instances of the same VM running at
-the same time. The Management Server attempts to start the VM on another
-Host in the same cluster.
+In a Cloudstack environment with multiple management servers, an agent can be
+configured, based on an algorithm, to which management server to connect to.
+This can be useful as an internal loadbalancer or for high availability.
+An administrator is responsible for setting the list of management servers and
+choosing a sorting algorithm using global settings.
+The management server is responsible for propagating the settings to the
+connected agents (running inside of the Secondary Storage
+Virtual Machine, Console Proxy Virtual Machine or the KVM hosts).
 
-HA features work with iSCSI or NFS primary storage. HA with local
-storage is not supported.
+The three global settings that need to be configured are the following:
+
+- hosts: a comma seperated list of management server IP addresses
+- indirect.agent.lb.algorithm: The algorithm for the indirect agent LB
+- indirect.agent.lb.check.interval: The preferred host check interval
+  for the agent's background task that checks and switches to an agent's
+  preferred host.
+
+These settings can be configured from the global settings page in the UI or
+using the updateConfiguration API call.
+
+The indirect.agent.lb.algorithm setting supports following algorithm options:
 
+- static: Use the list of management server IP addresses as provided.
+- roundrobin: Evenly spread hosts across management servers, based on the
+  host's id.
+- shuffle: Pseudo Randomly sort the list (this is not recommended for
+  production).
 
-HA for Hosts
-------------
+.. note:: 
+   The 'static' and 'roundrobin' algorithms, strictly checks for the order as
+   expected by them, however, the 'shuffle' algorithm just checks for content
+   and not the order of the comma separate management server host addresses.
+
+Any changes to the global settings - `indirect.agent.lb.algorithm` and
+`host` does not require restarting of the management server(s) and the
+agents. A change in these global settings will be propagated to all connected
+agents.
+
+The comma-separated management server list is propagated to agents in
+following cases:
+- An addition of an agent (including ssvm, cpvm system VMs).
+- Connection or reconnection of an agent to a management server.
+- After an administrator changes the 'host' and/or the
+'indirect.agent.lb.algorithm' global settings.
+
+On the agent side, the 'host' setting is saved in its properties file as:
+`host=<comma separated addresses>@<algorithm name>`.
+
+From the agent's perspective, the first address in the propagated list
+will be considered the preferred host. A new background task can be
+activated by configuring the `indirect.agent.lb.check.interval` which is
+a cluster level global setting from CloudStack and administrators can also
+override this by configuring the 'host.lb.check.interval' in the
+`agent.properties` file.
+
+When an agent gets a host and algorithm combination, the host specific
+background check interval is also sent and is dynamically reconfigured
+in the background task without need to restart agents.
+
+To make things more clear, consider this example:
+Suppose an environment which has 3 management servers: A, B and C and
+3 KVM agents.
+
+Setting 'host' = 'A,B,C', agents will receive lists depending on
+'direct.agent.lb' value:
+
+'static': Each agent will receive the list: 'A,B,C'
+'roundrobin': First agent receives: 'A,B,C', second agent 
+receives: 'B,C,A', third agent receives: 'C,B,A'
+'shuffle': Each agent will receive a list in random order.
+
+HA-Enabled Virtual Machines
+---------------------------
 
 The user can specify a virtual machine as HA-enabled. By default, all
 virtual router VMs and Elastic Load Balancing VMs are automatically
@@ -96,7 +153,7 @@ storage is not supported.
 
 
 Dedicated HA Hosts
-~~~~~~~~~~~~~~~~~~
+------------------
 
 One or more hosts can be designated for use only by HA-enabled VMs that
 are restarting due to a host failure. Setting up a pool of such
@@ -126,6 +183,107 @@ that you want to dedicate to HA-enabled VMs.
    a crash.
 
 
+HA-Enabled Hosts
+----------------
+
+The user can specify a host as HA-enabled, In the event of a host 
+failure, attemps will be made to recover the failed host by first 
+issuing some OOBM commands. If the host recovery fails the host will be
+fenced and placed into maintenance mode. To restore the host to normal 
+operation, manual intervention would then be required.
+
+Out of band management is a requirement of HA-Enabled hosts and has to be 
+confiured on all intended participating hosts.
+(see `“Out of band management” <hosts.html#out-of-band-management>`_).
+
+Host-HA has granular configuration on a host/cluster/zone level. In a large 
+environment, some hosts from a cluster can be HA-enabled and some not, 
+
+Host-HA uses a state machine design to manage the operations of recovering
+and fencing hosts. The current status of a host is reported when quering a 
+specific host.
+
+Timely health investigations are done on HA-Enabled hosts to monitor for
+any failures. Specific thresholds can be set for failed investigations,
+only when it’s exceeded, will the host transition to a different state.
+
+Host-HA uses both health checks and activity checks to make decisions on 
+recovering and fencing actions. Once determined that the host is in faulty 
+state (health checks failed) it runs activity checks to figure out if there is 
+any disk activity on the VMs running on the specific host.
+
+The HA Resource Management Service manages the check/recovery cycle including
+periodic execution, concurrency management, persistence, back pressure and 
+clustering operations. Administrators associate a provider with a partition 
+type (e.g. KVM HA Host provider to clusters) and may override the provider on a
+per-partition (i.e. zone, cluster, or pod) basis. The service operates on all
+resources of the type supported by the provider contained in a partition.
+Administrators can also enable or disable HA operations globally or on a
+per-partition basis.
+
+Only one (1) HA provider per resource type may be specified for a partition.
+Nested HA providers by resource type is not supported (e.g. a pod
+specifying an HA resource provider for hosts and a containing cluster
+specifying a HA resource provider for hosts). The service is designed to be
+opt-in where by only resources with a defined provider and HA enabled will be
+managed.
+
+For each resource in an HA partition, the HA Resource Management Service
+maintains and persists an "Finite State Machine" composed of the following
+states:
+
+- AVAILABLE - The feature is enabled and Host-HA is available.
+- SUSPECT - There are health checks failing with the host.
+- CHECKING - Activity checks are being performed.
+- DEGRADED - The host is passing the activity check ratio and still providing
+  service to the end user, but it cannot be managed from the CloudStack
+  management server.
+- RECOVERING - The Host-HA framework is trying to recover the host by issuing
+  OOBM jobs.
+- RECOVERED - The Host-HA framework has recovered the host successfully.
+- FENCING - The Host-HA framework is trying to fence the host by issuing OOBM
+  jobs.
+- FENCED - The Host-HA framework has fenced the host successfully.
+- DISABLED - The feature is disabled for the host.
+- INELIGIBLE - The feature is enabled, but it cannot be managed successfully by
+  the Host-HA framework. (OOBM is possibly not configured properly)
+
+When HA is enabled for a partition, the HA state of all contained resources 
+will be transitioned from DISABLED to AVAILABLE. Based on the state models, the
+following failure scenarios and their responses will be handled by the HA 
+resource management service:
+
+- Activity check operation fails on the resource: Provide a semantic in the 
+  activity check protocol to express that an error while performing the 
+  activity check and a reason for the failure (e.g. unable to access the NFS 
+  mount). If the maximum number of activity check attempts has not been 
+  exceeded, the activity check will be retried.
+
+- Slow activity check operation: After a configurable timeout, the HA resource
+  management service abandons the check. The response to this condition would 
+  be the same as a failure to recover the resource.
+
+- Traffic flood due to a large number of resource recoveries: The HA resource 
+  management service must limit the number of concurrent recovery operations 
+  permitted to avoid overwhelming the management server with resource status 
+  updates as recovery operations complete.
+
+- Processor/memory starvation due to large number of activity check 
+  operations: The HA resource management service must limit the number of 
+  concurrent activity check operations permitted per management server to 
+  prevent checks from starving other management server activities of scarce
+  processor and/or memory resources.
+
+- A SUSPECT, CHECKING, or RECOVERING resource passes a health check before the
+  state action completes: The HA resource management service refreshes the HA
+  state of the resource before transition. If it does not match the expected
+  current state, the result of state action is ignored.
+
+For further information around the inner workings of Host HA, refer
+to the design document at 
+`https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA 
+<https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA>`_
+
 Primary Storage Outage and Data Loss
 ------------------------------------