Yandex Cloud
  • Services
  • Solutions
  • Why Yandex Cloud
  • Pricing
  • Documentation
  • Contact us
Get started
Language / Region
© 2022 Yandex.Cloud LLC
Yandex Compute Cloud
  • Getting started
    • Overview
    • Creating a Linux VM
    • Creating a Windows VM
    • Creating instance groups
  • Step-by-step instructions
    • All instructions
    • Creating VMs
      • Creating a Linux VM
      • Creating a Windows VM
      • Creating a VM from a set of disks
      • Creating a VM with disks restored from snapshots
      • Creating a VM from a custom image
      • Creating a preemptible VM
      • Creating a VM with a GPU
    • DSVM
      • Overview
      • Creating a VM from a public DSVM image
    • Placement groups
      • Creating a placement group
      • Deleting a placement group
      • Creating a VM in a placement group
      • Adding a VM to a placement group
      • Removing a VM instance from a placement group
    • Images with pre-installed software
      • Creating a VM from a public image
      • Configuring software
      • Working with a VM based on a public image
      • Getting a list of public images
    • Getting information about a VM
      • Getting information about a VM
      • Getting serial port output
    • Managing VMs
      • Stopping and starting a VM
      • Attaching a disk to a VM
      • Detaching a disk from a VM
      • Moving a VM to a different availability zone
      • Moving a VM to another folder
      • Attaching a public IP address to a VM
      • Making a VM's public IP address static
      • Updating a VM
      • Changing VM computing resources
      • Deleting a VM
    • Working on VMs
      • Connecting to a VM via SSH
      • Connecting to a VM via RDP
      • Connecting to a VM via PowerShell
      • Working with Yandex Cloud from inside a VM
      • Installing NVIDIA drivers
      • Restoring access to a VM
    • Creating new disks
      • Creating an empty disk
      • Creating an empty disk with a large block
      • Creating a non-replicated disk
    • Disk management
      • Creating a disk snapshot
      • Updating a disk
      • Moving a disk to another folder
      • Deleting a disk
      • Deleting a disk snapshot
    • Disk placement groups
      • Creating a disk placement group
      • Removing a disk from a placement group
    • Creating new images
      • Preparing your disk image
      • Uploading your image
      • Creating an image from a disk
      • Creating an image from a snapshot
      • Creating an image from other custom image
    • Managing images
      • Deleting a disk image
    • File storage
      • Creating file storage
      • Attaching file storage to a VM
      • Detaching file storage from a VM
      • Updating file storage
      • Deleting file storage
    • Managing the serial console
      • Getting started
      • Connecting to a serial console via SSH
      • Connecting to a serial console via CLI
      • Start your terminal in the Windows SAC
      • Disabling access to the serial console
    • Creating instance groups
      • Creating a fixed-size instance group
      • Creating a fixed-size instance group with a network load balancer
      • Creating a fixed-size instance group with an L7 load balancer
      • Creating an automatically scaled instance group
      • Creating an instance group from a Container Optimized Image
      • Creating an instance group based on the YAML specification
    • Getting information about instance groups
      • Getting a list of instance groups
      • Getting information about an instance group
      • Getting a list of instances in a group
    • Managing instance groups
      • Editing an instance group
      • Editing an instance group based on the YAML specification
      • Configuring application health check on the VM
      • Updating a instance group
        • Incremental updates
        • Uninterrupted updates
      • Pausing an instance group
      • Resuming an instance group
      • Stopping an instance group
      • Starting an instance group
      • Deleting an instance group
    • Dedicated hosts
      • Creating a VM in a group of dedicated hosts
      • Creating a VM on a dedicated host
  • Yandex Container Optimized Solutions
  • Practical guidelines
    • Configuring NTP time synchronization
    • Running instance groups with auto scaling
    • Automatically scaling an instance group for handling messages from a queue
    • Updating an instance group under load
    • Deploying Remote Desktop Gateway
    • Transferring logs from a VM instance to Yandex Cloud Logging
  • Concepts
    • Relationship between resources
    • Virtual machines
      • Overview
      • Platforms
      • vCPU performance levels
      • Preemptible VMs
      • Network on a VM
      • Software-accelerated network
      • Live migration
      • Placement groups
      • Statuses
      • Metadata
    • Graphics accelerators (GPUs and vGPUs)
    • Disks and file storage
      • Overview
      • Disks
      • Disk snapshots
      • Non-replicated disk placement groups
      • File storage
      • Read and write operations
    • Images
    • Instance groups
      • Overview
      • Access
      • YAML specification
      • Instance template
      • Variables in an instance template
      • Policies
        • Overview
        • Allocation policy
        • Deployment policy
        • Scaling policy
      • Scaling types
      • Auto-healing
      • Updating
        • Overview
        • Allocating instances across zones
        • Deployment algorithm
        • Rules for updating instance groups
        • Changing secondary disks in an instance template
      • Stopping and pausing an instance group
      • Statuses
    • Dedicated host
    • Backups
    • Quotas and limits
  • Access management
  • Pricing policy
    • Current pricing policy
    • Archive
      • Before January 1, 2019
      • From January 1 to March 1, 2019
      • From March 1 to May 1, 2019
  • Compute API reference
    • Authentication in the API
    • gRPC
      • Overview
      • DiskPlacementGroupService
      • DiskService
      • DiskTypeService
      • FilesystemService
      • HostGroupService
      • HostTypeService
      • ImageService
      • InstanceService
      • PlacementGroupService
      • SnapshotService
      • ZoneService
      • InstanceGroupService
      • OperationService
    • REST
      • Overview
      • DiskPlacementGroup
        • Overview
        • create
        • delete
        • get
        • list
        • listDisks
        • listOperations
        • update
      • Disk
        • Overview
        • create
        • delete
        • get
        • list
        • listOperations
        • move
        • update
      • DiskType
        • Overview
        • get
        • list
      • Filesystem
        • Overview
        • create
        • delete
        • get
        • list
        • listOperations
        • update
      • HostGroup
        • Overview
        • create
        • delete
        • get
        • list
        • listHosts
        • listInstances
        • listOperations
        • update
      • HostType
        • Overview
        • get
        • list
      • Image
        • Overview
        • create
        • delete
        • get
        • getLatestByFamily
        • list
        • listOperations
        • update
      • Instance
        • Overview
        • addOneToOneNat
        • attachDisk
        • attachFilesystem
        • create
        • delete
        • detachDisk
        • detachFilesystem
        • get
        • getSerialPortOutput
        • list
        • listOperations
        • move
        • removeOneToOneNat
        • restart
        • start
        • stop
        • update
        • updateMetadata
        • updateNetworkInterface
      • PlacementGroup
        • Overview
        • create
        • delete
        • get
        • list
        • listInstances
        • listOperations
        • update
      • Snapshot
        • Overview
        • create
        • delete
        • get
        • list
        • listOperations
        • update
      • Zone
        • Overview
        • get
        • list
      • Operation
        • Overview
        • get
      • InstanceGroup
        • Overview
        • list
        • get
        • listLogRecords
        • updateFromYaml
        • updateAccessBindings
        • pauseProcesses
        • stop
        • start
        • delete
        • listInstances
        • createFromYaml
        • update
        • setAccessBindings
        • listOperations
        • create
        • listAccessBindings
        • resumeProcesses
  • Questions and answers
    • General questions
    • Virtual machines
    • Connection
    • Disks, snapshots, and images
    • Monitoring
    • Licensing
    • All questions on the same page
  1. Concepts
  2. Instance groups
  3. Auto-healing

Auto-healing

Written by
Yandex Cloud
  • Types of health checks
    • Instance operability check
    • Application health check on the instance
  • Auto-healing specifics
    • Auto-healing and deployment policies
    • Changing instance status during auto-healing
    • Healing while updating instance configurations
    • Healing on instance group size change
    • Auto-healing preemptible instances

Instance Groups runs regular health checks for instances in your group. If an instance stopped or an app is taking too long to respond, Instance Groups tries to heal the instance: it either restarts it or creates a new one, depending on the deployment policy.

Note

If for an instance group, processes are paused (status is PAUSED), instances aren't healed.

Types of health checks

To ensure auto-healing, Instance Groups performs two types of health checks:

  • Instance operability check.
  • Application health check on the instance.

Don't confuse these checks with the network load balancer health check, which doesn't auto-heal the instance. It only affects the deployment process: when at startup the instance switches to the status OPENING_TRAFFIC, Instance Groups waits until the instance status in the load balancer is HEALTHY. After that Instance Groups stops monitoring the instance status in the load balancer.

Instance operability check

Instance Groups checks the instance status in Compute Cloud every few seconds. If the instance has stopped or an error occurred (the statuses STOPPED, ERROR, and CRASHED), Instance Groups will try to restart the instance and create a new one, provided that this is allowed by the deployment policy.

Application health check on the instance

This check will detect if the app running on your instance has frozen, terminated, or is taking too long to respond. You can enable the application health check when creating or editing an instance.

If you enabled this check, Instance Groups will poll the application status on the instance at preset intervals while the instance group is in the status ACTIVE.

Recommendations for instance groups with a load balancer

If you created an instance group with a network load balancer, use less strict settings for checks in Instance Groups than for the load balancer health checks. The load balancer allocates the load on the app, while Instance Groups only monitors the app performance.

For example, if you set a 1-second response timeout in the network load balancer, then set 30 seconds in Instance Groups. If the application doesn't respond for 3-5 seconds, it might not be able to handle the current traffic. On the other hand, if the application doesn't respond for more than 30 seconds, it probably isn't working at all and you need to heal your instance.

Auto-healing specifics

Auto-healing and deployment policies

To auto-heal instances, Instance Groups may restart instances or create new ones. The healing method is set in the deployment policies.

  • Creating new instances
    Instance Groups will create new instances to replace any that fail their health checks, provided the deployment policy permits expanding the target group size. You can set the maximum number of instances that can be allocated to expand the target size of the group by using the max_expansion parameter. Acceptable values: from 0 to 100. In this case, Instance Groups will first create a new instance, wait until it passes all the checks, and then undeploy the instance that failed the check.

  • Restarting an instance
    Instance Groups will restart instances that failed their health check if the deployment policy permits reducing the target group size. You can use the max_unavailable parameter to set the maximum number of instances that can be made unavailable at the same time. Acceptable values: from 0 to 100. Instance Groups will try not to exceed this value during auto-healing.

    This restriction does not apply to instances with the statuses CRASHED, ERROR, and STOPPED, because in these cases the instance is already unavailable and must be restarted immediately.

If you set both max_expansion and max_unavailable, Instance Groups will use both auto-healing methods.

For example, you set max_expansion = 1 and max_unavailable = 1. When one instance fails the check, Instance Groups begins restarting this instance and creating a new one at the same time. The instance that passes all the checks successfully continues running and the other instance is undeployed.

To limit the speed of auto-healing and deployment, you can also set:

  • The maximum number of instances that are deployed at the same time using the max_creating parameter. This includes the created and started instances with the statuses CREATING and STARTING.

    Acceptable values: from 0 to 100. Value 0: Any number of instances within the allowed range.

  • The maximum number of instances that are undeployed at the same time, using the max_deleting parameter. This includes the instances being stopped with the STOPPING status, since Instance Groups always stops instances before undeploying them.

    Acceptable values: from 0 to 100. Value 0: Any number of instances within the allowed range.

Changing instance status during auto-healing

Instance Groups won't try to heal an instance if it is no longer needed.

For example, if all 10 instances in a group of 10 are unavailable and max_unavailable = 3, Instance Groups restarts the first three instances. If the remaining seven instances become operable again in the meantime, Instance Groups won't restart them.

If max_expansion = 3, Instance Groups starts creating three new instances. The old instances are not deleted until the new ones are created. If all instances of the instance group become operable again during the creation process, Instance Groups cancels the creation of new instances.

Healing while updating instance configurations

Instance healing has a higher priority than instance configuration update.

Let's say you have a group of 100 instances and max_unavailable = 1. When you update the instance configuration in an instance group, Instance Groups will restart the instances one by one, updating their configuration.

If one of the instances fails the application health check at that point, Instance Groups makes it the first one in the restart queue.

Healing on instance group size change

When the target size of the instance group is reduced, the instances that failed the check are deleted first (if any).

If you increase the target size of the instance group, new instances will be created in parallel with creation of instances that failed the check, provided that it is allowed by max_creating and max_expansion:

Let's say 2 out of 4 instances in an instance group failed the application health check. At that point, the target size of the instance group has increased to 6 instances. You have two instances to create and another two to heal.

If max_expansion = 1 and max_creating is not set, then Instance Groups will start creating three instances in parallel: two under the instance group expansion, and one under the auto-healing process.

Auto-healing preemptible instances

Preemptible instances can only be auto-healed if the computing resources in the availability zone allow for this. If the resources are insufficient, Instance Groups will resume auto-healing as soon as the resources become available, but this may take a long time.

Preemptible VMs must be terminated within 24 hours of their launch. In this case, there is a risk that the entire instance group will restart at the same time and stop handling the load of running applications. To avoid this, Instance Groups stops preemptible instances in the group not exactly after 24 hours, but after a random interval from 22 to 24 hours.

See also

  • Configuring application health check on the VM.

Was the article helpful?

Language / Region
© 2022 Yandex.Cloud LLC
In this article:
  • Types of health checks
  • Instance operability check
  • Application health check on the instance
  • Auto-healing specifics
  • Auto-healing and deployment policies
  • Changing instance status during auto-healing
  • Healing while updating instance configurations
  • Healing on instance group size change
  • Auto-healing preemptible instances