Entries in database (8)

Tuesday
Apr162019

MySQL High Availability Framework Explained – Part III: Failover Scenarios

MySQL High Availability Framework Explained – Part III: Failover Scenarios

In this three-part blog series, we introduced a High Availability (HA) Framework for MySQL hosting in Part I, and discussed the details of MySQL semisynchronous replication in Part II. Now in Part III, we review how the framework handles some of the important MySQL failure scenarios and recovers to ensure high availability.

MySQL Failover Scenarios

Scenario 1 – Master MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the master MySQL is no longer available. Pacemaker demotes the master resource and tries to recover with a restart of the MySQL service, if possible.
  • At this point, due to the semisynchronous nature of the replication, all transactions committed on the master have been received by at least one of the slaves.
  • Pacemaker waits until all the received transactions are applied on the slaves and lets the slaves report their promotion scores. The score calculation is done in such a way that the score is ‘0’ if a slave is completely in sync with the master, and is a negative number otherwise.
  • Pacemaker picks the slave that has reported the 0 score and promotes that slave which now assumes the role of master MySQL on which writes are allowed.
  • After slave promotion, the Resource Agent triggers a DNS rerouting module. The module updates the proxy DNS entry with the IP address of the new master, thus, facilitating all application writes to be redirected to the new master.
  • Pacemaker also sets up the available slaves to start replicating from this new master.

Thus, whenever a master MySQL goes down (whether due to a MySQL crash, OS crash, system reboot, etc.), our HA framework detects it and promotes a suitable slave to take over the role of the master. This ensures that the system continues to be available to the applications.

Scenario 2 – Slave MySQL Goes Down

  • The Corosync and Pacemaker framework detects that the slave MySQL is no longer available.
  • Pacemaker tries to recover the resource by trying to restart MySQL on the node. If it comes up, it is added back to the current master as a slave and replication continues.
  • If recovery fails, Pacemaker reports that resource as down – based on which alerts or notifications can be generated. If necessary, the ScaleGrid support team will handle the recovery of this node.
  • In this case, there is no impact on the availability of MySQL services.

Scenario 3 – Network Partition – Network Connectivity Breaks Down Between Master and Slave Nodes

This is a classical problem in any distributed system where each node thinks the other nodes are down, while in reality, only the network communication between the nodes is broken. This scenario is more commonly known as split-brain scenario, and if not handled properly, can lead to more than one node claiming to be a master MySQL which in turn leads to data inconsistencies and corruption.

Let’s use an example to review how our framework deals with split-brain scenarios in the cluster. We assume that due to network issues, the cluster has partitioned into two groups – master in one group and 2 slaves in the other group, and we will denote this as [(M), (S1,S2)].

  • Corosync detects that the master node is not able to communicate with the slave nodes, and the slave nodes can communicate with each other, but not with the master.
  • The master node will not be able to commit any transactions as the semisynchronous replication expects acknowledgement from at least one of the slaves before the master can commit. At the same time, Pacemaker shuts down MySQL on the master node due to lack of quorum based on the Pacemaker setting ‘no-quorum-policy = stop’. Quorum here means a majority of the nodes, or two out of three in a 3-node cluster setup. Since there is only one master node running in this partition of the cluster, the no-quorum-policy setting is triggered leading to the shutdown of the MySQL master.
  • Now, Pacemaker on the partition [(S1), (S2)] detects that there is no master available in the cluster and initiates a promotion process. Assuming that S1 is up to date with the master (as guaranteed by semisynchronous replication), it is then promoted as the new master.
  • Application traffic will be redirected to this new master MySQL node and the slave S2 will start replicating from the new master.

Thus, we see that the MySQL HA framework handles split-brain scenarios effectively, ensuring both data consistency and availability in the event the network connectivity breaks between master and slave nodes.

This concludes our 3-part blog series on the MySQL High Availability (HA) framework using semisynchronous replication and the Corosync plus Pacemaker stack. At ScaleGrid, we offer highly available hosting for MySQL on AWS and MySQL on Azure that is implemented based on the concepts explained in this blog series. Please visit the ScaleGrid Console for a free trial of our solutions.

Tuesday
Feb192019

Intro to Redis Cluster Sharding – Advantages, Limitations, Deploying & Client Connections

Intro to Redis Cluster Sharding – Advantages, Limitations, Deploying & Client Connections

Redis Cluster is the native sharding implementation available within Redis that allows you to automatically distribute your data across multiple nodes without having to rely on external tools and utilities. At ScaleGrid, we recently added support for Redis Clusters on our platform through our fully managed Redis hosting plans. In this post, we’re going to introduce you to the advanced Redis Cluster sharding opportunities, discuss its advantages and limitations, when you should deploy, and how to connect to your Redis Cluster.

Sharding with Redis Cluster

Click to read more ...

Tuesday
Jan082019

Slow MySQL Start Time in GTID mode? Binary Log File Size May Be The Issue

Have you been experiencing slow MySQL startup times in GTID mode? We recently ran into this issue on one of our MySQL hosting deployments and set out to solve the problem. In this blog, we break down the issue that could be slowing down your MySQL restart times, how to debug for your deployment, and what you can do to decrease your start time and improve your understanding of GTID-based replication.

How We Found The Problem

Click to read more ...

Monday
Apr182016

Hadoop and Salesforce Integration: the Ultimate Successful Database Merger  

How we can transfer salesforce data to hadoop? It is big challenge to everyday users. What are different features of data transfer tools.

Click to read more ...

Wednesday
Feb032016

A Case Study: WordPress Migration for Shift.ms

The case study presented involves a migration from custom database to WordPress. The company with the task is Valet and it has a vast portfolio of previously done jobs that included shifts from database to WordPress, multisite-to-multisite, and multisite to single site among others. The client is Shift.ms.

Problem

The client, Shift.ms, presented a taxing problem to the team. Shift.ms had a custom database that they needed migrated to WordPress. They had installed a WordPress/BuddyPress and wanted their data moved into this new installation. All this may seem rather simple. However, there was one problem; the client had some data in the newly installed WordPress that they intended to keep.

Challenges

The main problem was that the schema for the database and that of WordPress are very different in infrastructure. The following issues arose in an effort to deal with the problem:

Click to read more ...

Monday
Dec162013

22 Recommendations for Building Effective High Traffic Web Software

This is a guest post by Ashwanth Fernando, Software Engineer from the trenches at large scale internet companies.

Inspired by the book "Effective Java" by Joshua Bloch, I wanted to share my holistic recommendations on building high traffic web software (i.e. web applications/services that serve high traffic loads). Some of these items may not be just about software design but also around surrounding areas such as the engineering organization, culture etc.

Two disclaimers up front:

1) This is my opinion.
2) There will be real world situations where the below principles will be wrong as in all things "software". Please use common sense all the time.


Consider using more than one datacenter

There have been numerous horror stories about businesses, ahem going out of business because they just had a single datacenter. Its really important to have more than one data center if you want to protect yourself from natural disasters or electrical supply failures. Run all your datacenters in active-active configuration. It may cost extra money, but its well worth it rather than having an active passive configuration and then finding out at the end that for some pieces of data, your passive hardware was not consistent with the active one.

Consider a sparse datacenter deployment

Click to read more ...

Thursday
Feb102011

Database Isolation Levels And Their Effects on Performance and Scalability

Some of us are not aware of the tremendous job databases perform, particularly their efforts to maintain the Isolation aspect of ACID. For example, some people believe that transactions are only related to data manipulation and not to queries, which is an incorrect assumption. Transaction Isolation is all about queries, and the consistency and completeness of the data retrieved by queries. This is how it works:

Click to read more ...

Thursday
Dec172009

Oracle and IBM databases: Disk-based vs In-memory databases 

Current disk based RDBMS can run out of steam when processing large data. Can these problems be solved by migrating from a disk based RDBMS to an IMDB? Any limitations? To find out, I tested one of each from the two leading vendors who together hold 70% of the market share - Oracle's 11g and TimesTen 11g, and IBM's DB2 v9.5 and solidDB 6.3.

read more at BigDataMatters.com