Quite frequently I see customers looking at recovery as on ability to restore data from backup which can be far from being enough to restore the whole system to operating state, especially for complex systems.
Instead of looking just at data restore process you better look at the whole process which is required to bring system to the working state, including data consistency requirements and times. This has to be considered for different data loss scenarios which may happen.
Let us look at simple example - a master with 1TB of database size replicating to 50 servers in 5 different Data Centers via single Replication Relay server in each. Forget the single point of failure for the second and just think what problems we may have to deal with.
First lets look at the master. What may happen to it ? We can have Master having soft crash in which case it will be unavailable for some time but we can get all the data back… or sort of. In practice you have to be very careful such as using sync-binlog innodb_flush_logs_at_trx_commit=1 and only using Innodb tables to be OK in most cases. There are still some edge cases such as modifying meta data stored in MyISAM tables which can get master out of sync with slaves in case of soft crash. Unless you got into one of these rare cases slave should be able to continue after Master is back online.
Do you have to wait for master to recover ? This is where your data consistency requirements come in play. Remember replication is asynchronous so whenever you switch to the slave in case of master failure you may loose transactions. Google semi-synchronous replication patches can help with it… but they are not yet in the stock MySQL. Yet another way is using DRBD to get a standby MySQL server or at least synchronously replicated master binary logs. If you can’t loose any single transaction you’ve can’t simply switch to the slave.
What if you can ? The switch to the slave in this case is not very easy too - all slaves can be on different positions at the master and you need to pick the most up to date to promote. Plus you need to recompute positions as they should be on promoted slave and slave should have –log-slave-updates enabled so it somethat has copy of master logs. In many cases I’ve seen people do not do that and simply point slaves to the starting position of the promoted master - this is dangerous because you’re risking all slaves to be inconsistent withe each other, plus if server was seriously behind you’re risking to get major inconsistence because relay logs will be lost if you just re-point slave. So at least you should wait for slave to process all its relay logs before re-pointing it.
Interesting enough Google has solution for us again which comes as “log mirroring” patches which make sure Slaves has copy of logs as they are on the master.
Now what do you do in case of hard crash this is when the data is lost on the master ? This is when you have master data lost, such as you have RAID or disk failure. Though it also can be things like Innodb corruption or soft crash which you can’t recover promptly enough.
In this case most typically you would plan recovery by switching to the slave (as described) or standby server via DRBD or SAN.
As you can see we never mentioned recovering from backup so far. It will be needed in the worst case of data loss which is trashing the data which gets to all slaves via replication. This can be caused by user or application error or security breach.
What choices do you have in this case ? Your main options are using Backup or Slave with delayed replication (which you already could have set up with mk-slave-delay from Maatkit.
Delayed slave is especially helpful if application can operate with just master as in this case you can switch very quickly (just skipping bad statements and catching up)
The main challenge in such failure is the fact you have many trashed copies to deal with. If you have just one or several small tables corrupted you can reload them. One option is to reload them on the master (and they will be replicated down to all slaves) the faster however (especially if you have many tiers of replication) is to bring all slaves to the same point in time and load data locally with SQL_LOG_BIN=0 set for session.
If the large portion of data trashed you may need to recover full database on all slaves which is best done in binary mode for large data sets. Such global recovery can also put very high stress on your network and backup storage and take a lot of time. It also may be extremely difficult to get the large backup in timely fashion over long distance network, meaning it is best to have local backup (and delayed slave if you use one) in each data center you have.
The complexity of recovery is another “liability” of compex replication tree setup. On the contrary sharded master-master pairs (or master with few slaves) are much easier to deal with.
Recovering the data with replication you always have to keep replication positions in mind. Such if you recover master you need to recover slaves to matching snapshot - either it has to be same state (which is hard to manage) or you need to ensure you understand the position on the master to which backup corresponds to. This becomes more complex if you have complex replication hierarchy as slave only knows its position on its own master not on the “root” master.
Note there are also some solutions based on “Continuous Data Protection” class of backup which can be very helpful to go back in time with your data. One of vendors offering solution for MySQL is R1Soft. Though I have not had a chance to look at it in details.
What is about slave loss ? The slave loss is normally less of the problem. You can reclone slave from the master, another slave or restore from backup. So this is just question of having decent capacity planning (such as being able to shut off 2 slaves and still operate normally), have LVM setup if you want to avoid shutting off slave or master to clone the data and making sure the logs on the master go far back so you can restore from several of backup generations and do point in time recovery.
Timing recovery is also important. Especially for write intensive environments it may take many days to catchup from weekly backups by binary logs so make sure to time it properly.
In the real life environments can be even more complicated - one may use partial replication, replication to different storage engine, add some tables beyond tables which are being replicated which all has to be accounted for for in the process of replication.
It is also worth to note beyond these 3 main recovery scenarios there are number of other cases which you have to deal with (which often can be resolved by doing recovery be one of these 3 protocols, but you can also take as shortcut) - for example you may have master or relay binary log corruption. Master or Slave running out of space, Slave crashing (and loosing its position on master), Replication breakage (or running out of sync) due to MySQL bugs or wrong use.
Interesting enough very few people have their data recovery practices ironed out so they can answer how they would handle at least these 3 data loss cases for each of servers they have deployed. Even fewer have gone beyond theory and have tested the processes or have regular testing in place.
Entry posted by peter | 2 comments
商业上有一本书叫《基业长青(Built to Last)》,很有名。其中归纳了11条优秀公司可以长期优秀的准则。我刚加入BeBeyond时,看过一次。不懂。但现在回忆起来,的确很有道理。其中有一条我感触特别深:“Try a lot of stuff and keep what works.”这就是我在BeBeyond实践中学会的第三课。我也把它叫做:不知道怎么走?那就先走再说!
我以前并不是这样。我以前的典型反应就是:碰到困难,要么独自苦思冥想,要么拼命查阅资料、问人,希望调查清楚后再做个好的决定。换句话说,就是“不知道怎么走?那就调查清楚,找到最佳路径后再走”。后来发现不可行。为什么?原因很多。比如:每个问题的情况各有不同,以前的经验不见得有效;书本上提供的是理论,如何结合实际情况得出方案完全是另一回事;更何况如果你在开创一个新东西(比如BeBeyond),那就更没有正确答案。
但这不是有意思的地方。真正有意思的是:就算知道这个道理,我在行为上还是一而再、再而三地犯这个错误。那是为什么呢?后来我想明白了。我觉得有这样几个原因。
一、当你什么都不知道的时候(比如没有经验),你根本就不可能知道如何寻找解决问题的方案。中国古话说“三思而后行”。这话没错,但后来我觉得,这句话是有前提的:就是你要有经验,特别是失败的经验。大家不妨想想,“三思”的基础是什么?是以前的经验和普适的道理,对吧?但当你没有经验,普遍的道理在你脑子里面还停留在理论层面,哪里有三思的基础哟?其实“三思”解决的是情绪问题,意思是让情绪影响你考虑问题的周全;但它不能解决经验缺乏的问题,不是吗?
那么经验从哪里来?实践来,犯错来。你不知道往哪里走,那么先走。可能是对的,可能是错的。对了,你就有了“对的经验”;错了,你就有了寻找“对的道路”的经验。所以,当我没有任何经验的时候,去苦思冥想又有什么用呢?
二、第二个原因直接由第一个带来,那就是对摸索带来的不确定性和可能失败后的挫败感的恐惧,或者说对“一蹴而就”成功的奢望。可以想象一个场景:当你不知道怎么烧菜,但你必须做,而且很害怕烧得不好;你查了书,也问了烧菜高手,但自己做就是没底。这种情况下怎么办?放弃或者继续琢磨?呵呵……很多人可能会说:那就先烧,烧坏了再说呗!这就对了。但我们换一个场景:烧菜的结果错了也没啥,换个结果更严重的,比如对你自己职业选择的决策。是不是有点腿发软了?
这个很正常。因为人天生就是讨厌失败和错误的,哪怕一点点错误。我为什么苦思冥想、查书问人,但迟迟不行动?很简单,思考是不会有挫折感的,但行动带来的结果是会有挫折感的。可能别人问我时,我会说“我在寻找一个好的方案”;但大家认同第一条理由的时候,就知道我根本不可能寻找出个好方案。其实我在干的,不就是延迟失败的挫折感吗?
看来尝试是不可避免的、失败和挫折感也是不可避免的、人讨厌失败和挫折感更是不可避免的,那该怎么办呢?
用意志克服“讨厌失败和挫折感”似乎是解决问题的一个途径。但心理学说:可能越克服,反弹也越厉害。呵呵……这时候我在BeBeyond学到的一个经验,就是你把注意力放在什么地方、你表扬什么地方?注意力可以放在两个地方:一个是结果,也就是成功或失败;一个是你做的尝试。你选哪个?第二个?遗憾的是,这也是第三个原因,我们的教育文化太批评结果、不表扬尝试的过程;造成我们也一样,最后只会做确定的事情,不确定的事情就渴望外界给个答案。举个简单的例子:当你用文言文写了一篇作文,我们的老师经常会说的,是表扬你用全新的方式作了尝试,还是你写的东西不好(或者高考会吃亏的)?你可以自己举很多例子,对吧?
这些教育方式在我们身上留下了印记。但我想强调:既然我们知道了,就有能力去改变。你在鼓励自己尝试吗?你会表扬自己尝试的努力,而无论结果吗?以前我没有,但现在我学会了。
最后,我想加个注脚:我这里并不是否定作决定前要做调查研究和思考。我想说的是:一定要反思,调查研究和思考是否是“害怕失败”的借口。如果调查研究了,思考了,但还是想不出,那就立刻行动,越快越好。无论结果如何,都能帮助你积累经验。有了经验,就能更好地调查研究和思考。
前段时间和一个朋友交流经验。她半年前从一家著名快速消费品公司跳槽到一家小型公司,职务也从高级财务经理变成了项目经理。这是个巨大转变:从以前有既定系统(财务系统)到没有系统;从以前布置完任务就有人执行,到从任务制定到人力分配都要自己来;从确定知道决策将发生什么,到不知道自己的决策的对错和影响……这一切都让她得到了巨大的挑战和成长。现在她要做一个招聘项目——为公司探索“招聘管理培训生”的项目,很苦恼!她问我的看法。我问:“你是不是想了很久也没有好的办法?”她说:“是。”我说:“当没有路,也找不到路的时候,那就先冲出去撞吧;撞了,疼了,或许路也就出来了!”
这就是我在BeBeyond学会的第三个Lesson。
Have you ever seen the replication stopped with message like this:
Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
This is relay relay log corruption and you can check details in the MySQL Error log file. The error message describes few reasons and indeed because there is little validation (ie no checksums) in the replication there are multiple reasons for bad event to show up in relay logs.
Really this is only one of various error messages you could see if relay log corrupted. You could also see malformed queries (with some junk), complaining about event to big etc if there is a garbage in relay logs.
If relay logs are corrupted it is surely worth to check what could cause it - it could be network (especially if replicating over unreliable long distance networks), MySQL bugs on master or slave, hardware problems and few others. In any case it is worth investigating.
Investigating is what you do later but how do you fix the problem first ? The important question you need to have answered - are logs corrupted on the master ? If logs on the master are OK you can just run SHOW SLAVE STATUS on slave experiencing error and use CHANGE MASTER TO to re-point replication to Relay_Master_Log_File:Exec_Master_Log_Pos:
This will purge existing relay logs re-fetch all events which have not been executed yet. Doing this command make sure your master is operational and it still has all the logs needed to re-fetch events.
How would you know if logs are OK on the master ? Well in this case there were probably 5 another slaves which did not have the problem - which means Master is most likely OK. In any case it is little harm to try restarting from the same position - if logs are bad on the master you would get the same error message again and can continue with investigation.
What if logs on the master are corrupted ? In this case you have couple of choices (and you also potentially have multiple slaves to deal with). You can use mysqlbinlog (or you favorite hex editor if mysqlbinlog does not work) to find the next event start and potentially recover "corrupted" event to be manually executed on the slaves.
Skipping around event makes master and slave potentially inconsistent and you should access the risks depending on applications (and on amount of events which were corrupted) you may want to let replication continue from the new position or resync the slaves to the master.
How can you recover the slave ? As all slaves are likely to be affected in this case you can't clone another slave. You also can't use classical method of recovery from backup - because you would need relay logs to roll forward, and they are corrupted. You can either re-clone the data from Master. (This is where LVM or similar techniques can help you a lot) or skip bad events as described and when use Maatkit mk-table-checksum to check what tables are out of sync and when use mk-table-sync to resync them.
Last method works in particularly well in case you can afford to run for a while with slaves which are a bit out of sync, which is quite often better than having just master available (also having extra load of data copied from it).
Entry posted by peter | No comment
八月 2008 | ||||||
一 | 二 | 三 | 四 | 五 | 六 | 日 |
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |