Skip to content

Commit 70277be

Browse files
anuragawandrijapanicsbAnurag Awasthi
authored
[WIP] Added health checks section for virtual routers (#89)
* [WIP] updates to service monitoring script * Added health checks section * Some formatting * Some formatting * Some more formatting * Missing space for numbering * Update systemvm.rst Co-authored-by: Andrija Panic <45762285+andrijapanicsb@users.noreply.github.com> Co-authored-by: Anurag Awasthi <anurag.awasthi@shapeblue.com>
1 parent b80aadd commit 70277be

File tree

1 file changed

+209
-12
lines changed

1 file changed

+209
-12
lines changed

source/adminguide/systemvm.rst

Lines changed: 209 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ Using a SSL Certificate for the Console Proxy
192192

193193
By default, the console viewing functionality uses plaintext HTTP. In
194194
any production environment, the console proxy connection should be
195-
encrypted via SSL at the mininum.
195+
encrypted via SSL at the minimum.
196196

197197
A CloudStack administrator has 2 ways to secure the console proxy
198198
communication with SSL:
@@ -241,7 +241,7 @@ proxy domain, SSL certificate, and private key:
241241
242242
openssl req -new -key yourprivate.key -out yourcertificate.csr
243243
244-
#. Head to the website of your favorite trusted Certificate
244+
#. Head to the website of your favourite trusted Certificate
245245
Authority, purchase an SSL certificate, and submit the CSR. You
246246
should receive a valid certificate in return
247247

@@ -304,11 +304,11 @@ If you still have problems and folowing errors in management.log while destroyin
304304
- Unable to build keystore for CPVMCertificate due to CertificateException
305305
- Cold not find and construct a valid SSL certificate
306306

307-
that means that still some of the Root/intermediate/server certificates or the key is not in a good format, or incorrectly encoded or multiply Root CA/Intemediate CA present in database by mistake.
307+
that means that still some of the Root/intermediate/server certificates or the key is not in a good format, or incorrectly encoded or multiply Root CA/Intermediate CA present in database by mistake.
308308

309309
Other way to renew Certificates (Root,Intermediates,Server certificates and key) - although not recommended
310310
unless you fill comfortable - is to directly edit the database,
311-
while still respect the main requirement that the private key is PKCS8 encoded, while Root CA, Intemediate and Server certificates
311+
while still respect the main requirement that the private key is PKCS8 encoded, while Root CA, Intermediate and Server certificates
312312
are still in default PEM format (no URL encoding needed here).
313313
After editing the database, please restart management server, and destroy SSVM and CPVM after that,
314314
so the new SSVM and CPVM with new certificates are created.
@@ -411,7 +411,7 @@ Service Monitoring Tool for Virtual Router
411411
Various services running on the CloudStack virtual routers can be
412412
monitored by using a Service Monitoring tool. The tool ensures that
413413
services are successfully running until CloudStack deliberately disables
414-
them. If a service goes down, the tool automatically restarts the
414+
them. If a service goes down, the tool automatically attempts to restart
415415
service, and if that does not help bringing up the service, an alert as
416416
well as an event is generated indicating the failure. A new global
417417
parameter, ``network.router.enableservicemonitoring``, has been
@@ -430,7 +430,7 @@ an unexpected reason. For example:
430430
.. note::
431431
Only those services with daemons are monitored. The services that are
432432
failed due to errors in the service/daemon configuration file cannot
433-
be restarted by the Monitoring tool.
433+
be restarted by the Monitoring tool. VPC Networks are supported (as of CloudStack 4.14)
434434

435435
The following services are monitored in a VR:
436436

@@ -453,11 +453,212 @@ The following networks are supported:
453453
This feature is supported on the following hypervisors: XenServer,
454454
VMware, and KVM.
455455

456-
Log file /var/log/routerServiceMonitor.log contains the actions undertaken/attempted by the service monitoring script (i.e. trying to restart a stopped service).
456+
Log file /var/log/routerServiceMonitor.log contains the actions undertaken/attempted
457+
by the service monitoring script (i.e. trying to restart a stopped service).
457458

458-
As of CloudStack 4.14, the internval at which the service monitoring script runs is no more hardcoded to 3 minutes, but is instead controlled via global setting router.health.checks.basic.interval.
459+
As of CloudStack 4.14, the interval at which the service monitoring script runs
460+
is no more hardcoded to 3 minutes, but is instead controlled via
461+
global setting router.health.checks.basic.interval.
459462

460463

464+
Health checks for Virtual Router
465+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
466+
467+
In addition to monitoring services as of 4.14 CloudStack adds a framework
468+
for more extensive health checks. The health checks are split into two
469+
categories - basic and advanced. The two categories have their own admin
470+
definable intervals. The split is made this way as the advanced health checks
471+
are considerably more expensive. The health checks will be available on-demand
472+
via API as well as scheduled.
473+
474+
The following tests are covered: · Basic connectivity from the management server
475+
to the virtual router
476+
477+
- Basic connectivity to virtual router its interfaces' gateways
478+
479+
- Free disk space on virtual router's disk
480+
481+
- CPU and memory usage
482+
483+
- Basic VR Sanity checks:
484+
485+
#. Ssh/dnsmasq/haproxy/httpd service running
486+
487+
- Advanced VR Sanity checks:
488+
489+
#. DHCP/DNS configuration matches mgmt server DB
490+
491+
#. IPtables rules match management server records
492+
493+
#. HAproxy config matches mgmt server DB records
494+
495+
#. VR Version against current version
496+
497+
498+
This happens in the following steps:
499+
500+
1. Management server periodically pushes data to each running virtual router
501+
including schedule intervals, tests to skip, some configuration for LB, VMs,
502+
Gateways, etc.
503+
504+
2. Basic and advanced tests as scheduled as per the intervals in the data sent
505+
by Management server. Each run of checks populates it’s results and saves it
506+
within the router at ‘/root/basic_monitor_results.json’ and
507+
'/root/advance_monitor_results.json’. Each run of checks also keeps
508+
track of the start time, end time, and duration of test run for better
509+
understanding.
510+
511+
3. Each test is also available on demand via ' getRouterHealthCheckResults'
512+
API added with the patch. The API can be executed from CLI and UI. Performing
513+
fresh checks is expensive and will cause management server doing the following:
514+
515+
a. Refresh the data from Management server records on the router for
516+
verification (repeat of step 1),
517+
518+
b. Run all the checks of both basic and advanced type,
519+
520+
c. Fetch the result of the health check from router to be sent back in response.
521+
522+
4. The patch also supports custom health checks with custom systemVM templates.
523+
This is achieved as follows:
524+
525+
a. Each executable script placed in '/root/health_scripts/' is considered an
526+
independent health check and is executed on each scheduled or on demand health check run.
527+
528+
b. The health check script can be in any language but executable (use 'chmod a+x')
529+
within '/root/health_checks/' directory. The placed script must do the following:
530+
531+
#. Accept a command line parameter for check type (basic or advanced) - this
532+
parameter is sent by the internal cron job in the VR (/etc/cron.d/process)
533+
534+
#. Proceed and perform checks as per the check type - basic or advanced
535+
536+
#. In order to be recognized as a health check and displayed in the list of health
537+
checks results, it must print some message to STDOUT which is passed back as message
538+
to management server - if the script doesn’t return anything on its STDOUT, it
539+
will not be registered as a health check/displayed in the list of the health check results
540+
541+
#. exit with status of 0 if check was successful and exit with status of 1 if
542+
check has failed
543+
544+
.. code:: bash
545+
546+
#!/bin/bash if [$1 == ‘advanced’] { do advance checks and print any message to STDOUT }
547+
else if [$1 == ‘basic’] { do basic checks and print any message to STDOUT } exit(0) if pass or exit(1) if failure
548+
549+
#. i.e. if the script is intended to be i.e. a basic check, it must checks
550+
for the presence of the 'basic' as the first parameter sent to it, and execute the
551+
wanted commands and print some output to STDOUT; otherwise if it receives 'advanced'
552+
as the first parameter, it should not execute any commands/logic nor print anything to STDOUT
553+
554+
5. There are 9 health check scripts written in default systemvm template in '/root/health_checks/'
555+
folder. These indicate the health checks described in executive summary.
556+
557+
6. The management server will connect periodically to each virtual router to confirm that the
558+
checks are running as scheduled, and retrieve the results of those checks. Any failing checks
559+
present in ``router.health.checks.failures.to.restart.vr`` will cause the VR to be recreated.
560+
On each check management server will persist only the last executed check results in its database.
561+
562+
7. UI parses the returned health check results and shows the router 'Health Check'
563+
column in 'Failed'/'Passed' if there are health check failures of any type.
564+
565+
Following global configs have been added for configuring health checks:
566+
567+
- ``router.health.checks.enabled`` - If true, router health checks are allowed
568+
to be executed and read. If false, all scheduled checks and API calls for on
569+
demand checks are disabled. Default is true.
570+
571+
- ``router.health.checks.basic.interval`` - Interval in minutes at which basic
572+
router health checks are performed. If set to 0, no tests are scheduled. Default
573+
is 3 mins as per the pre 4.14 monitor services.
574+
575+
- ``router.health.checks.advanced.interval`` - Interval in minutes at which
576+
advanced router health checks are performed. If set to 0, no tests are scheduled.
577+
Default value is 10 minutes.
578+
579+
- ``router.health.checks.config.refresh.interval`` - Interval in minutes at which
580+
router health checks config - such as scheduling intervals, excluded checks, etc
581+
is updated on virtual routers by the management server. This value should be
582+
sufficiently high (like 2x) from the router.health.checks.basic.interval and
583+
router.health.checks.advanced.interval so that there is time between new results
584+
generation for passed data. Default is 10 mins.
585+
586+
- ``router.health.checks.results.fetch.interval`` - Interval in minutes at which
587+
router health checks results are fetched by management server. On each result fetch,
588+
management server evaluates need to recreate VR as per configuration of
589+
'router.health.checks.failures.to.recreate.vr'. This value should be sufficiently
590+
high (like 2x) from the 'router.health.checks.basic.interval' and
591+
'router.health.checks.advanced.interval' so that there is time between new
592+
results generation and fetch.
593+
594+
- ``router.health.checks.failures.to.recreate.vr`` - Health checks failures defined
595+
by this config are the checks that should cause router recreation. If empty the
596+
recreate is not attempted for any health check failure. Possible values are comma
597+
separated script names from systemvm’s /root/health_scripts/ (namely - cpu_usage_check.py,
598+
dhcp_check.py, disk_space_check.py, dns_check.py, gateways_check.py, haproxy_check.py,
599+
iptables_check.py, memory_usage_check.py, router_version_check.py), connectivity.test
600+
or services (namely - loadbalancing.service, webserver.service, dhcp.service)
601+
602+
- ``router.health.checks.to.exclude`` - Health checks that should be excluded when
603+
executing scheduled checks on the router. This can be a comma separated list of
604+
script names placed in the '/root/health_checks/' folder. Currently the following
605+
scripts are placed in default systemvm template - cpu_usage_check.py,
606+
disk_space_check.py, gateways_check.py, iptables_check.py, router_version_check.py,
607+
dhcp_check.py, dns_check.py, haproxy_check.py, memory_usage_check.py.
608+
609+
- ``router.health.checks.free.disk.space.threshold`` - Free disk space threshold
610+
(in MB) on VR below which the check is considered a failure. Default is 100MB.
611+
612+
- ``router.health.checks.max.cpu.usage.threshold`` - Max CPU Usage threshold as
613+
% above which check is considered a failure.
614+
615+
- ``router.health.checks.max.memory.usage.threshold`` - Max Memory Usage threshold
616+
as % above which check is considered a failure.
617+
618+
The scripts for following health checks are provided in '/root/health_checks/'. These
619+
are not exhaustive and can be modified for covering other scenarios not covered.
620+
Details of individual checks:
621+
622+
1. Basic checks:
623+
624+
a. Services check (ssh, dnsmasq, httpd, haproxy)– this check is still done as
625+
per existing monitorServices.py script and any services not running are attempted
626+
to be restarted.
627+
628+
b. Disk space check against a threshold – python's ' statvfs' module is used to
629+
retrieve statistics and compare with the configured threshold given by
630+
management server.
631+
632+
c. CPU usage check against a threshold – we use 'top' utility to retrieve idle
633+
CPU and compare that with the configured max CPU usage threshold given by management
634+
server.
635+
636+
d. Memory usage check against a threshold – we use 'free' utility to get the
637+
used memory and compare that with the configured max memory usage threshold.
638+
639+
e. Router template and scripts version check – is done by comparing the contents
640+
of the '/etc/cloudstack-release' and '/var/cache/cloud/cloud-scripts-signature'
641+
with the data given by management server.
642+
643+
f. Connectivity to the gateways from router – this is done by analysing the success
644+
or failure of ping to the gateway IPs given by management server.
645+
646+
2. Advanced checks:
647+
648+
a. DNS config match against MS – this is checked by comparing entries of '/etc/hosts'
649+
on the VR and VM records passed by management server.
650+
651+
b. DHCP config match against MS – this is checked by comparing entries of
652+
'/etc/dhcphosts.txt' on the VR with the VM entries passed by management server.
653+
654+
c. HA Proxy config match against MS (internal LB and public LB) - this is checked
655+
by verifying the max connections, and entries for each load balancing rule in the
656+
'/etc/haproxy/haproxy.cfg' file. We do not check for stickiness properties yet.
657+
658+
d. Port forwarding match against MS in iptables. - this is checked by verifying
659+
IPs and ports in the 'iptables-save' command output against an expected list of
660+
entries from management server.
661+
461662
462663
Enhanced Upgrade for Virtual Routers
463664
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -639,7 +840,3 @@ same Debian 9 based templates.
639840
Non-Alphanumeric characters (metacharacters) are not allowed for this parameter
640841
except for the “-“ and the “.”. Any metacharacter supplied will immediately result
641842
in an immediate termination of the command and report back to the operator that an illegal character was passed
642-
643-
644-
645-

0 commit comments

Comments
 (0)