postgresql HA框架对比

文章内容来自Managing High Availability in PostgreSQL® – Part III: Patroni

Patroni 对于 PostgreSQL 数据库管理员（DBA）来说是一个宝贵的工具，因为它执行 PostgreSQL 集群的端到端设置和监控。选择分布式一致性服务（DCS）和备用创建的灵活性对最终用户来说是一个优势，因为他们可以选择自己熟悉的方法。

REST API、HaProxy 集成、WatchDog支持、回调以及其功能丰富的管理使得 Patroni 成为 PostgreSQL 高可用性管理的最佳解决方案。

Standby Server Tests¶


Test Scenario	PostgreSQL Automatic Failover (PAF)	Replication Manager (repmgr)	Patroni
Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state.There was no disruption of the writer application.	Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again.There was no disruption of the writer application.	Patroni brought the PostgreSQL process back to running state.There was no disruption of the writer application.
Stop the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state.There was no disruption of the writer application.	Standby server was marked as failed. Manual intervention was required to start the PostgreSQL process again.There was no disruption of the writer application.	Patroni brought the PostgreSQL process back to running state.There was no disruption of the writer application.
Reboot the server	Standby server was marked offline initially. Once the server came up after reboot, PostgreSQL was started by Pacemaker and the server was marked as online. If fencing was enabled then node wouldn’t have been added automatically to cluster.There was no disruption of the writer application.	Standby server was marked as failed. Once the server came up after reboot, PostgreSQL was started manually and server was marked as running.There was no disruption of the writer application.	Patroni needs to be started after reboot, unless configured to not start on reboot. Once Patroni was started, it started the PostgreSQL process and setup the standby configuration.There was no disruption of the writer application.
Stop the framework agent process	Agent: pacemakerThe PostgreSQL process was stopped and was marked offline.There was no disruption of the writer application.	Agent: repmgrdThe standby server will not be part of automated failover situation.PostgreSQL service was found to be running.There was no disruption of the writer application.	Agent: patroniIt did not stop the PostgreSQL process.patronictl list did not display this server.There was no disruption of the writer application.

备机测试¶


测试场景	PostgreSQL 自动故障转移 (PAF)	复制管理器 (repmgr)	Patroni
终止 PostgreSQL 进程	Pacemaker 将 PostgreSQL 进程恢复为运行状态。写入应用没有中断。	备用服务器被标记为失败。需要手动干预重新启动 PostgreSQL 进程。写入应用没有中断。	Patroni 将 PostgreSQL 进程恢复为运行状态。写入应用没有中断。
停止 PostgreSQL 进程	Pacemaker 将 PostgreSQL 进程恢复为运行状态。写入应用没有中断。	备用服务器被标记为失败。需要手动干预重新启动 PostgreSQL 进程。写入应用没有中断。	Patroni 将 PostgreSQL 进程恢复为运行状态。写入应用没有中断。
重启服务器	初始时备用服务器被标记为离线。重启后，服务器由 Pacemaker 启动 PostgreSQL，并将服务器标记为在线。如果启用了隔离，则节点不会自动添加到集群中。写入应用没有中断。	备用服务器被标记为失败。重启后，需要手动启动 PostgreSQL，服务器被标记为运行中。写入应用没有中断。	重启后需要启动 Patroni，除非配置为重启时不启动。启动 Patroni 后，它启动了 PostgreSQL 进程并设置了备用配置。写入应用没有中断。
停止框架代理进程	代理：pacemaker PostgreSQL 进程被停止并标记为离线。写入应用没有中断。	代理：repmgrd 备用服务器将不参与自动故障转移。PostgreSQL 服务运行中。写入应用没有中断。	代理：patroni 它没有停止 PostgreSQL 进程。patronictl list 没有显示此服务器。写入应用没有中断。

Master/Primary Server Tests¶


Test Scenario	PostgreSQL Automatic Failover (PAF)	Replication Manager (repmgr)	vPatroni
Kill the PostgreSQL process	Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered.There was downtime in the writer application.	repmgrd started health check for primary server connection on all standby servers for a fixed interval. When all retries failed, an election was triggered on all the standby servers. As a result of the election, the standby which had the latest received LSN got promoted. The standby servers which lost the election will wait for the notification from the new master node and will follow it once they receive the notification.Manual intervention was required to start the postgreSQL process again.There was downtime in the writer application.	Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered.There was downtime in the writer application.
Stop the PostgreSQL process and bring it back immediately after health check expiry	Pacemaker brought the PostgreSQL process back to running state. Primary got recovered within the threshold time and hence election was not triggered.There was downtime in the writer application.	repmgrd started health check for primary server connections on all standby servers for a fixed interval. When all the retries failed, an election was triggered on all the standby nodes. However, the newly elected master didn’t notify the existing standby servers since the old master was back.Cluster was left in an indeterminate state and manual intervention was required.There was downtime in the writer application.	Patroni brought the PostgreSQL process back to running state. Patroni running on that node had primary lock and hence election was not triggered.There was downtime in the writer application.
Reboot the server	Election was triggered by Pacemaker after the threshold time for which master was not available. The most eligible standby server was promoted as the new master. Once the old master came up after reboot, it was added back to the cluster as a standby. If fencing was enabled, then node wouldn’t have been added automatically to cluster.There was downtime in the writer application.	repmgrd started election when master connection health check failed on all standby servers. The eligible standby was promoted. When this server came back, it didn’t join the cluster and was marked failed. repmgr node rejoin command was run to add the server back to the cluster.There was downtime in the writer application.	Failover happened and one of the standby servers was elected as the new master after obtaining the lock. When Patroni was started on the old master, it brought back the old master up and performed pg_rewind and started following the new master.There was downtime in the writer application.
Stop the framework agent process	Agent: pacemakerThe PostgreSQL process was stopped and it was marked offline.Election was triggered and new master was elected.There was downtime in writer application.	Agent: repmgrdThe primary server will not be part of the automated failover situation.PostgreSQL service was found to be running.There was no disruption in writer application.	Agent: patroniOne of the standby servers acquired the DCS lock and became the master by promoting itself.The old master was still running and it led to multi-master scenario. The application was still writing to the old master.Once Patroni was started on the old master, it rewound the old master (use_pg_rewind was set to true) to the new master timeline and lsn and started following the new master.

主节点测试¶


测试场景	PostgreSQL 自动故障转移 (PAF)	复制管理器 (repmgr)	Patroni
终止 PostgreSQL 进程	Pacemaker 将 PostgreSQL 进程恢复为运行状态。主服务器在阈值时间内恢复，因此未触发选举。写入应用有停机时间。	repmgrd 对所有备用服务器的主服务器连接启动了固定间隔的健康检查。当所有重试都失败时，触发了所有备用服务器上的选举。选举结果，收到最新 LSN 的备用服务器被提升。未赢得选举的备用服务器将等待来自新主节点的通知，并在收到通知后跟随它。需要手动干预再次启动 PostgreSQL 进程。写入应用有停机时间。	Patroni 将 PostgreSQL 进程恢复为运行状态。运行在该节点的 Patroni 持有主锁，因此未触发选举。写入应用有停机时间。
停止 PostgreSQL 进程并在健康检查过期后立即恢复	Pacemaker 将 PostgreSQL 进程恢复为运行状态。主服务器在阈值时间内恢复，因此未触发选举。写入应用有停机时间。	repmgrd 对所有备用服务器的主服务器连接启动了固定间隔的健康检查。当所有重试都失败时，触发了所有备用节点上的选举。然而，新选出的主服务器没有通知现有的备用服务器，因为旧主服务器已经恢复。集群处于不确定状态，需要手动干预。写入应用有停机时间。	Patroni 将 PostgreSQL 进程恢复为运行状态。运行在该节点的 Patroni 持有主锁，因此未触发选举。写入应用有停机时间。
重启服务器	Pacemaker 在主服务器不可用的阈值时间后触发选举。最合格的备用服务器被提升为新主服务器。旧主服务器重启后，被添加回集群作为备用服务器。如果启用了隔离，则节点不会自动添加到集群中。写入应用有停机时间。	当所有备用服务器上的主服务器连接健康检查失败时，repmgrd 启动选举。合格的备用被提升。当该服务器恢复时，它没有加入集群并被标记为失败。运行 repmgr node rejoin 命令将服务器重新添加到集群。写入应用有停机时间。	故障转移发生，一个备用服务器在获取锁后被选为新主服务器。旧主服务器上的 Patroni 启动时，它恢复了旧主服务器，并执行了 pg_rewind，开始跟随新主服务器。写入应用有停机时间。
停止框架代理进程	代理：pacemaker PostgreSQL 进程被停止并标记为离线。触发选举并选出新主服务器。写入应用有停机时间。	代理：repmgrd 主服务器将不参与自动故障转移情况。PostgreSQL 服务运行中。写入应用无中断。	代理：patroni 一个备用服务器获取了 DCS 锁并通过提升自己成为主服务器。旧主服务器仍在运行，导致了多主服务器情况。应用仍在向旧主服务器写入。一旦在旧主服务器上启动 Patroni，它将旧主服务器（设置了 use_pg_rewind 为真）倒带到新主服务器的时间线和 LSN，并开始跟随新主服务器。

Network Isolation Tests¶


Test Scenario	PostgreSQL Automatic Failover (PAF)	Replication Manager (repmgr)	Patroni
Network isolate the master server from other servers (split brain scenario)	Corosync traffic was blocked on the master server.PostgreSQL service was turned off and master server was marked offline due to quorum policy.A new master was elected in the majority partition.There was a downtime in the writer application.	All servers have the same value for location in repmgr configuration:repmgrd started election when master connection health check failed on all standby servers.The eligible standby was promoted, but the PostgreSQL process was still running on the old master node.There were two nodes running as master. Manual intervention was required after the network isolation was corrected.The standby servers have the same value for location but the primary had a different value for location in repmgr configuration:repmgrd started election when master connection health check failed on all standby servers.But, there was no new master elected since the standby servers had location different from that of the primary.repmgrd went into degrade monitoring mode. PostgreSQL was running on all the nodes and there was only one master in the cluster.	DCS communication was blocked for master node.PostgreSQL was demoted on the master server.A new master was elected in the majority partition.There was a downtime in the writer application.
Network-isolate the standby server from other servers	Corosync traffic was blocked on the standby server.The server was marked offline and PostgreSQL service was turned off due to quorum policy.There was no disruption in the writer application.	repmgrd went into degrade monitoring mode.The PostgreSQL process was still running on the standby node.Manual intervention was required after the network isolation was corrected.	DCS communication was blocked for the standby node.The PostgreSQL service was running, however, the node was not considered for elections.There was no disruption in the writer application.


测试场景	PostgreSQL 自动故障转移 (PAF)	复制管理器 (repmgr)	Patroni
将主服务器与其他服务器隔离（脑裂情况）	Corosync 流量在主服务器上被阻塞。PostgreSQL 服务被关闭，主服务器由于法定策略被标记为离线。在多数分区中选举出新的主服务器。写入应用有停机时间。	所有服务器在 repmgr 配置中的 location 值相同：当所有备用服务器上的主服务器连接健康检查失败时，repmgrd 启动了选举。合格的备用被提升，但旧主节点上的 PostgreSQL 进程仍在运行。有两个节点作为主服务器运行。网络隔离修复后需要手动干预。备用服务器的 location 与主服务器在 repmgr 配置中的 location 不同：当所有备用服务器上的主服务器连接健康检查失败时，repmgrd 启动了选举。但是，由于备用服务器的 location 与主服务器不同，没有新的主服务器被选举。repmgrd 进入了降级监控模式。所有节点上的 PostgreSQL 都在运行，并且集群中只有一个主服务器。	DCS 通信对主节点被阻塞。主服务器上的 PostgreSQL 被降级。在多数分区中选举出新的主服务器。写入应用有停机时间。
将备用服务器与其他服务器网络隔离	Corosync 流量在备用服务器上被阻塞。由于法定策略，服务器被标记为离线并关闭了 PostgreSQL 服务。写入应用无中断。	repmgrd 进入了降级监控模式。备用节点上的 PostgreSQL 进程仍在运行。网络隔离修复后需要手动干预。	DCS 通信对备用节点被阻塞。PostgreSQL 服务正在运行，但是，该节点没有被考虑用于选举。写入应用无中断。