Auto-healing

Instance Groups regularly runs health checks for the instances in your instance group. If an instance has stopped or an app is taking too long to respond, Instance Groups will try to recover the instance: either it will restart it or create a new one, depending on the deployment policy.

Types of health checks

For automatic recovery, Instance Groups performs two types of health checks:

Don't confuse these checks with the load balancer health check, which doesn't result in automatic instance recovery. It only affects the deployment process: when at startup the instance switches to the status OPENING_TRAFFIC, Instance Groups waits until the instance status in the load balancer is HEALTHY. After that Instance Groups stops monitoring the instance status in the load balancer.

Instance operability check

Instance Groups checks the instance status in Compute Cloud every few seconds. If the instance has stopped or an error occurred (the statuses STOPPED, ERROR, CRASHED), Instance Groups will try to restart the instance and create a new instance, if this is allowed by the deployment policy.

Application health check on the instance

This check will detect if the app running on your instance has frozen, terminated, or is taking too long to respond. You can enable the application health check when creating or modifying an instance.

If you enabled this check, Instance Groups will periodically query the status of the application on the instance for as long as the instance group is in the status ACTIVE status.

Recommendations for instance groups with a load balancer

If you created a group with a load balancer, then use less strict settings for checks in Instance Groups than for the load balancer health checks. The load balancer allocates the load on the app, while Instance Groups only monitors the app performance.

For example, if you set the 1-second response timeout in the load balancer, then set 30 seconds in Instance Groups. If the application doesn't respond for 3-5 seconds, it might not be able to handle the current traffic. On the other hand, if the application doesn't respond for more than 30 seconds, it probably isn't working at all and you need to recover your instance.

Automatic recovery process

Recovery process and deployment policies

For the purpose of instance recovery, Instance Groups can restart and create new instances, depending on the deployment policy settings:

  • For Instance Groups to create new instances to replace those that failed the health check, set max_expansion — the maximum number of instances you can add to the target size of the instance group.

    Instance Groups will first create a new instance, wait until it passes all the checks, and then delete the instance that failed the check.

  • For Instance Groups to restart the instances that failed the check, set max_unavailable — the maximum number of instances that you can make unavailable at the same time. Instance Groups will aim not to exceed this at automatic recovery.

    This restriction does not apply to the instances in the statuses STOPPED, ERROR, and CRASHED, because they imply that the instance is already unavailable and must be restarted immediately.

  • For Instance Groups to employ all the recovery methods in parallel, set both max_expansion and max_unavailable.

    Let's say you specified max_unavailable = 1 and max_expansion = 1. When one instance fails the check, Instance Groups will begin restarting this instance and creating a new one in parallel. The instance that passes all the checks successfully will continue running and the other one will be deleted.

  • To limit the speed of recovery and deployment you can also use the following parameters:

    • max_creating — Limits the number of instances that are deployed at the same time, meaning the created and started instances in the statuses CREATING and STARTING.
    • max_deleting — Limits the number of instances undeployed at the same time, meaning instances in the STOPPING status. When deleting an instance, Instance Groups always stops it first, hence this status is used to limit the workload.

Change the instance status at recovery

Instance Groups will not restore the instance if it is no longer needed.

For example, if all 10 instances in an instance group of 10 are available and max_unavailable = 3, Instance Groups will restart the first three instances. If the remaining seven instances become operable again in the meantime, Instance Groups won't restart them.

If max_expansion = 3, Instance Groups starts creating three new instances. The old instances are not deleted until the new ones are created. If all instances of the instance group become operable again during the creation process, Instance Groups cancels the creation of new instances.

Recovery while updating instance configurations

Instance recovery has a higher priority than instance configuration update.

Let's say you have a group of 100 instances and max_unavailable = 1. When you update the instance configuration in an instance group, Instance Groups will restart the instances one by one, updating their configuration.

If one of the instances fails the application health check at that point, Instance Groups makes it the first one in the restart queue.

Recovery on instance size change

When the target size of the instance group is reduced, the instances that failed the check are deleted first (if any).

If you increase the target size of the instance group, new instances will be created in parallel with creation of instances that failed the check, provided that it is allowed by max_creating and max_expansion:

Let's say 2 out of 4 instances in an instance group failed the application health check. At that point, the target size of the instance group has increased to 6 instances. You have two instances to create and another two to recover.

If max_expansion = 1 and max_creating is not set, then Instance Groups will start creating three instances in parallel: two under the instance group expansion, and one under the recovery process.

Auto-healing preemptible instances

Preemptible instances can only be auto-healed if the computing resources in the availability zone allow for this. If the resources are insufficient, Instance Groups will resume auto-healing as soon as the resources become available, but this may take a long time.

See also