Executive Overview:

The site resilience features of Exchange 2013 enable an organization to deploy a messaging solution that is able to withstand a site-wide outage. Note that configuring Exchange for site resiliency is but a part of the greater redundancy and high-availability features provided in Exchange 2013. Site resiliency features can be classified under:

  • Storage/Database Architecture
  • Client Access Architecture
  • Transport/Routing Architecture

Notable Features:

  • Multi-site DAG configuration
  • Datacenter Activity Coordination (DAC)
  • Lagged Replication Copies
  • Single Global namespace (DNS)
  • Safety Net
  • Multi-site SSL

Architecture/Components:

Exchange 2013 has undergone significant architectural changes from the ground up in order to enhance it’s Site Resilience capabilities. I’ve listed a number of Multi-site specific features in Exchange 2013 below:

  • Multi-site DAG: A Database Availability Group (DAG) is a unit of high availability and site resilience in Exchange 2013. The DAG comprises of a number of Mailbox Servers that can be spread across multiple sites (<500ms latency) that host replicas of a mailbox database. Only one DAG member server can hold an active copy of the Database and sends  updates to the other replicas in the DAG via a log-shipping mechanism. In the event of a server failure, a majority of the members of the DAG must vote to fail over the database to another node. The DAG relies on Windows Server clustering services and utilizes a quorum witness to act as tie-breaker. The quorum is hosted on a Windows Server known as a File Share Witness.
  • Datacenter Activation Coordination: Is a feature that prevents a DAG from automatically mounting databases after an outage that spans multiple datacenters. This is because the outage might result in groups of Exchange Mailbox servers running independently in each datacenter activating the same database, resulting in multiple active copies of the same database, known as ‘split-brain syndrome’.
  • Lagged Database Copies: A lagged database remains a preset time period behind the live database (up to 7 days) and provides a recovery option in the event that the active mailbox copy encounters corruption. An organization can enhance the resiliency of their database solution by employing a combination of non-lagged and lagged database copies in a dag.
  • Safety Net: When a user sends a message in Exchange 2013, the Mailbox Transport Service on a Mailbox server submits it to the Transport Service on the local server for processing (routing and categorization). The Safety Net is a queue that stores copies of messages that were successfully processed by the server, in case the processed message is corrupted in-transit or fails to reach a destination. A Shadow Safety Net is a redundant copy of the Primary Safety Net and is stored on another Mailbox Server in the same site to provide further redundancy
  • Single Global namespace(Multiple Virtual IP(VIP)-to-Name mappings): A client can now receive multiple IP Addresses from DNS for a given FQDN, all of which can be used reliably to connect to a service. Since almost all client access in Exchange 2013 now relies on HTTP, if the first IP Address on a HTTP stack fails, the HTTP client will try the next and so on. If a Virtual IP of a CAS array were to fail, the client can automatically connect to other IPs to access the same service in a matter of seconds, instead of waiting minutes for DNS to failover.
  • Multi-site SSL Certificate Considerations: Microsoft recommends using a single Subject Alternative Name (SAN) certificate on each datacenter site, and including multiple host names in the certificate. The same Certificate Principal Name should be used on each certificate, and configured to use the same Principal Name in the Outlook Provider Configuration object Active Directory.

Common Administrative Tasks:

  1. Enabling DAC Mode: Set-DatabaseAvailabilityGroup -Identity DAG1 -DatacenterActivationMode DagOnly
  2. Configure DAG for Multisite deployments:
    Enable manual configuration of the DAG:
    Configure DAG IP Addresses for each site: New-DatabaseAvailabilityGroup -Name DAG1 -DatabaseAvailabilityGroupIPAddresses 192.168.1.10,192.168.2.10
  3. Configure an alternate File Share Witness Server: Set-DatabaseAvailabilityGroup -Identity DAG1 -AlternateWitnessServer AltServer
  4. Enable a lagged database copy of 1 day: Add-MailboxDatabaseCopy -Identity name -MailboxServer MBX1 -ReplayLagTime 1.00:00:00 -SeedingPostponed
  5. Monitor the status of a DAG: Get-MailboxDatabaseCopyStatus -Identity name | Format-List
  6. Recover a DAG after primary site does down:
    Mark failed servers/site as down:  Stop-DatabaseAvailabilityGroup DAG1 – ActiveDirectorySite:Site1
    Stop cluster services on remaining DAG members: Stop-Clussvc
    Activate DAG members in remaining site: Restore-DatabaseAvailabilityGroup DAG1 – ActiveDirectorySite:Site2

Top PowerShell Commands/Tools:

– Add/Set/Remove-MailboxDatabaseCopy
– New/Get/Set-DatabaseAvailabilityGroup
– Add/Remove-DatabaseAvailabilityGroupServer
– New/Get/Set/Remove -DatabaseAvailabilityGroupNetwork
– Start/Stop-DatabaseAvailabilityGroup
– Resume/Suspend/Update-MailboxDatabaseCopy
– Move-ActiveMailboxDatabase
– Restore-DatabaseAvailabilityGroup
– RedistributeActiveDatabases.ps1 script
– Get-MailboxDatabaseCopyStatus

Reference/Links:

Technet: High Availability and Site Resilience in Exchange 2013
Blogpost: Site Resilience and High Availability here.
Cmdlets: High Availability
Blogpost: Global Namespaces in Exchange 2013