Faster MySQL failover with SELECT mirroring | 01 Feb 2009

21:17 More Chips Means Less Salsa » High Scalability - Building bigger, faster, more reliable websites.

Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed.

Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor.

The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.

08:28 一月创想：中国克隆2.0 » 大学小容>善用网络，助益成长！

很快春节就过去了，这次小容在纽约过春节，同时也办理了一件人生大事，细节稍后会分享在这里。

新的一年里这个blog会开始一个新的小栏目：“每月创想”。这一次会先交待一下个小栏目的背景，同时分享第一次的每月创想。

· 为什么开“每月创想”这个小栏目？

在回顾了08年的心得体会之后，这段时间，小容也在回顾Swordi个人工作室四年的情况。一直以来，Swordi个人工作室只是以虚拟的形式出现，小容把它看成是个人业余活动的品牌。过去四年中，Swordi曾帮助一些公益项目和创业项目设计标志，并同时提供一些建议。

在08年10月，小容为一个小型的商业项目设计Logo，这一次小容采取了一个新的模式，采用免费设计，但是让对方自由捐赠的模式。设计任务完成后，Swordi工作室收到了第一笔捐赠款项。在这之后，小容用这笔款项捐赠给其他自己感兴趣的项目。

这一次的尝试，成为一个契机，启迪小容重新定位Swordi工作今后的运作模式：

· 为社会公共事务提供免费的设计/咨询服务；
· 接受社会各界的捐赠；
· 将接受到的捐赠款项用于孵化有意思的新项目；

小容把这种模式称为“开放设计/开放咨询/开放研究”。在过去，Swordi工作室为现有的公益项目提供服务，在未来，Swordi工作室将寻求促成、孵化一些新的公益项目。

整理和挖掘创想，就是这一新运作模式的起点。

· 分享创想，本身也是值得普及的实践

在传统中国文化中，人们喜欢批判夸夸其谈的人，俗语“站着说话不腰疼”就是例证之一。在这样的文化背景下，人们在公共场合发言时谨小慎微，对于他人的一些构想、创意和点子，人们也不是采取鼓励的态度，而是批评和讥笑，并且往往用发言者是否有能力来实践构想来嘲讽发言者。
　　
然而在一个足够开放的社会文化中，如果你的想法足够好，即使你自己没有能力实践它，人们也不会讥笑你。人们会说，这个想法很棒，让我们看看是否有人可以帮助实现它。

在社会化网络形塑而成的分享空间，一件事情可以划分成多个细小的环节，每个环节的价值贡献不分主次，人们可以尽其所长，贡献自己的价值在最合适的环节。

当你有一个好的构想之后，你可以选择把它分享到公共空间。分享创想，本身也是值得普及的实践。

你可以把你的构想捐献出来，把它放在公共领域，有能力实施的机构会将你的构想变成现实。正如你把你偶然拍摄的照片，标注上创作共用CC授权。某一天或许你会发现，有人把它创造成另外一件艺术品，或者拿去为某个公益项目设计广告。

· 每月创想，你也有份！

04年，当小容为OOPS设计的Logo被大家投票选中之后，小容在心里许下一个愿望：

今后Swordi工作室每年至少要为一个公益项目设计Logo。

08年的岁末，小容许下另外一个心愿：

今后Swordi工作室每年至少帮助一个创想成长为一个具体的项目。

小容也在此邀请所有的新老朋友，一起来加入这个创想的队伍中，让我们各尽所长，群策群力，一点一点地改变你身边的世界。

在09年中，小容将在这个Blog上每月分享一个创想，一年有十二个创想，如果有十个人做同样的事情，那么，一年之后，我们会拥有一百二十个创想，在这其间，如果有二到三个创想演化为具体的项目的话，那么，这样的微小行动累积多年，会促成巨大的价值诞生。

· 一月创想：中国克隆2.0

创想名称：中国克隆2.0 (China Clone 2.0)

创想简介：这个创想的构思是建立一个网站，这个网站的主要目的是译介传播国外的公益传播和社会公共事务活动，鼓励中国大陆的网络用户逐一克隆这些国外项目，同时提供协助这些项目成长的资源。

创想漫谈：

既然有人克隆Flickr，YouTube，mySpace，FaceBook和43things等等其他的国外Web2.0网站？那么，为何不可以克隆类似下列项目的社会公益相关2.0网站呢？

Change.org，
SocialActions.com，
CauseCast.org，
WeDesignChange.org

有许多好事情没有发生，或许是因为有意愿、有能力做这件事情的人不知道这个好点子，而知道这个好点子的人没有能力完成这件事情。因此，搜集足够多的国外实践过的好点子，把这些好点子传播到足够宽广的领域，让有能力做这件事情的人们接触到他们。这就是促成改变的一个好的开始。

这个构想其实原来已久，早些时候小容在译言里建立了一个小组Project2.0：

经常看到一些很有创意的网站，在此收集一些有趣的在线项目，翻译这些创意给中文网络世界，让人们也可以按照同样的方式来，做同样的有创意的事情。让好创意将世界变平！

最近也和一些朋友讨论起类似的话题，他们也在筹备类似的构想，计划建立一个公益项目数据库，把国外人如何做一些公益项目的构想和具体模式以案例的方式整理，提供给大陆朋友作为参考和借鉴。

小容这个一月创想，也算是帮朋友公告一下，如果大家有兴趣把这个创想变为具体的项目，请在此留言，和我们建立联系，让大家群策群力，行动起来，促成改变。

—-

如果你也开始在blog上分享你的每月创想，请Trackback过来，或者在此留言：）

Ext2 v.s. Ext3 v.s. Ext4 性能比拼知道分子 » 车东's shared items in Google Reader

Linux kernel 自 2.6.28 开始正式支持新的文件系统 Ext4。 Ext4 是 Ext3 的改进版，修改了 Ext3 中部分重要的数据结构，而不仅仅像 Ext3 对 Ext2 那样，只是增加了一个日志功能而已。Ext4 可以提供更佳的性能和可靠性，还有更为丰富的功能：

1. 与 Ext3 兼容。执行若干条命令，就能从 Ext3 在线迁移到 Ext4，而无须重新格式化磁盘或重新安装系统。原有 Ext3 数据结构照样保留，Ext4 作用于新数据，当然，整个文件系统因此也就获得了 Ext4 所支持的更大容量。

2. 更大的文件系统和更大的文件。较之 Ext3 目前所支持的最大 16TB 文件系统和最大 2TB 文件，Ext4 分别支持 1EB（1,048,576TB， 1EB=1024PB， 1PB=1024TB）的文件系统，以及 16TB 的文件。

3. 无限数量的子目录。Ext3 目前只支持 32,000 个子目录，而 Ext4 支持无限数量的子目录。

4. Extents。Ext3 采用间接块映射，当操作大文件时，效率极其低下。比如一个 100MB 大小的文件，在 Ext3 中要建立 25,600 个数据块（每个数据块大小为 4KB）的映射表。而 Ext4 引入了现代文件系统中流行的 extents 概念，每个 extent 为一组连续的数据块，上述文件则表示为“该文件数据保存在接下来的 25,600 个数据块中”，提高了不少效率。

5. 多块分配。当写入数据到 Ext3 文件系统中时，Ext3 的数据块分配器每次只能分配一个 4KB 的块，写一个 100MB 文件就要调用 25,600 次数据块分配器，而 Ext4 的多块分配器“multiblock allocator”（mballoc）支持一次调用分配多个数据块。

6. 延迟分配。Ext3 的数据块分配策略是尽快分配，而 Ext4 和其它现代文件操作系统的策略是尽可能地延迟分配，直到文件在 cache 中写完才开始分配数据块并写入磁盘，这样就能优化整个文件的数据块分配，与前两种特性搭配起来可以显著提升性能。

7. 快速 fsck。以前执行 fsck 第一步就会很慢，因为它要检查所有的 inode，现在 Ext4 给每个组的 inode 表中都添加了一份未使用 inode 的列表，今后 fsck Ext4 文件系统就可以跳过它们而只去检查那些在用的 inode 了。

8. 日志校验。日志是最常用的部分，也极易导致磁盘硬件故障，而从损坏的日志中恢复数据会导致更多的数据损坏。Ext4 的日志校验功能可以很方便地判断日志数据是否损坏，而且它将 Ext3 的两阶段日志机制合并成一个阶段，在增加安全性的同时提高了性能。

9. “无日志”（No Journaling）模式。日志总归有一些开销，Ext4 允许关闭日志，以便某些有特殊需求的用户可以借此提升性能。

10. 在线碎片整理。尽管延迟分配、多块分配和 extents 能有效减少文件系统碎片，但碎片还是不可避免会产生。Ext4 支持在线碎片整理，并将提供 e4defrag 工具进行个别文件或整个文件系统的碎片整理。

11. inode 相关特性。Ext4 支持更大的 inode，较之 Ext3 默认的 inode 大小 128 字节，Ext4 为了在 inode 中容纳更多的扩展属性（如纳秒时间戳或 inode 版本），默认 inode 大小为 256 字节。Ext4 还支持快速扩展属性（fast extended attributes）和 inode 保留（inodes reservation）。

12. 持久预分配（Persistent preallocation）。P2P 软件为了保证下载文件有足够的空间存放，常常会预先创建一个与所下载文件大小相同的空文件，以免未来的数小时或数天之内磁盘空间不足导致下载失败。Ext4 在文件系统层面实现了持久预分配并提供相应的 API（libc 中的 posix_fallocate()），比应用软件自己实现更有效率。

13. 默认启用 barrier。磁盘上配有内部缓存，以便重新调整批量数据的写操作顺序，优化写入性能，因此文件系统必须在日志数据写入磁盘之后才能写 commit 记录，若 commit 记录写入在先，而日志有可能损坏，那么就会影响数据完整性。Ext4 默认启用 barrier，只有当 barrier 之前的数据全部写入磁盘，才能写 barrier 之后的数据。（可通过 "mount -o barrier=0" 命令禁用该特性。）

Ext4 随 Linux kernel 2.6.28 正式发布已有数周，一直苦于找不到测试用的磁盘，正巧年前 Intel 送来几块 SSD 测试样品，这两天就顺带把 SSD 也测了。测试所使用的 Linux 内核版本为 2.6.28.2，测试工具为 IOzone 3.318。

IOzone 测试命令为：

time /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4 >|/root/ext4-iozone-stdout.txt

上述命令的说明如下：

Auto Mode
File size set to 4194304 KB
Using Maximum Record Size 256 KB
Using Minimum Record Size 4 KB
Command line used: /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.

测试结果除了表明 Intel SSD 的读写速度快得令人咋舌之外，还可以说明 Ext4 的各方面性能都超过了上一代 Ext3，甚至在大多数情况下，比没有日志功能的 Ext2 还要快出不少：

	reclen	write	rewrite	read	reread	random read	random write	bkwd read	record rewrite	stride read	fwrite	frewrite	fread	freread
Ext2 real 28m12.718s user 0m10.725s sys 5m8.265s	4	218,680	216,596	630,248	245,802	88,700	138,065	106,112	1,882,623	73,538	214,175	218,364	566,570	247,381
	8	215,308	218,690	556,064	246,260	154,680	150,052	188,397	2,462,367	130,896	217,157	216,647	583,808	248,397
	16	216,457	216,843	575,046	245,701	258,660	158,750	306,842	2,654,320	220,939	216,061	218,140	598,174	246,581
	32	217,925	214,289	537,976	243,081	394,013	167,002	464,240	2,397,831	340,775	217,434	219,353	583,463	246,341
	64	215,460	219,256	527,919	244,362	503,227	162,917	609,546	2,546,079	456,243	216,875	217,692	571,707	244,264
	128	219,081	216,173	540,831	242,526	609,750	161,442	721,701	2,656,729	551,122	217,780	217,427	579,271	242,291
	256	216,091	217,631	565,111	245,157	654,274	173,955	870,547	2,574,261	634,835	216,638	219,693	563,735	247,101
Ext3 real 27m42.449s user 0m11.529s sys 7m17.049s	4	218,242	213,039	482,132	243,986	88,007	156,926	105,557	1,540,739	75,010	216,028	216,432	522,704	243,385
	8	218,390	217,915	544,892	244,979	152,424	190,454	181,486	1,945,603	130,737	218,364	216,431	530,853	243,222
	16	218,083	217,683	561,038	244,506	255,244	200,032	300,212	2,096,495	221,329	216,930	216,661	514,177	244,069
	32	216,258	217,013	569,246	243,811	389,745	198,275	446,462	1,934,853	338,785	216,809	219,296	530,634	243,446
	64	218,850	217,711	577,529	243,725	497,689	201,693	589,535	2,036,412	450,449	219,387	214,900	514,353	244,809
	128	220,234	215,687	530,519	241,615	608,244	199,619	714,295	1,992,168	553,022	217,828	218,454	513,596	241,510
	256	216,011	220,188	592,578	242,548	642,341	199,408	834,240	2,092,959	624,043	217,682	218,165	529,358	242,878
Ext4 real 27m3.485s user 0m10.847s sys 6m9.578s	4	221,823	216,992	532,488	273,668	85,210	183,195	103,036	1,862,817	74,781	225,841	220,620	523,799	272,848
	8	226,028	218,580	561,960	272,036	154,972	216,505	178,482	2,135,372	132,506	227,423	215,766	641,021	271,328
	16	222,241	217,746	547,548	270,895	260,899	223,895	295,288	2,095,966	223,135	226,055	216,210	621,287	273,475
	32	220,121	213,025	240,426	247,628	345,210	175,977	451,631	2,145,351	342,236	225,796	213,427	598,331	269,759
	64	223,983	214,437	308,696	551,577	754,941	225,897	523,130	2,218,016	448,086	227,030	214,706	582,795	272,323
	128	222,576	217,816	624,636	271,293	644,500	224,997	720,468	2,308,315	582,943	225,971	217,373	552,335	274,237
	256	221,202	222,238	541,685	270,898	671,748	228,085	845,494	2,215,381	643,715	225,411	219,166	580,066	273,342
														Kbytes/sec

注：
1. 关于 IOzone 测试方法，参考 Ben Martin 的文章：IOzone for filesystem performance benchmarking
2. 关于 Ext4 的相关内容，参考 Kernel Newbies 专页： http://kernelnewbies.org/Ext4

06:13 Faster MySQL failover with SELECT mirroring » MySQL Performance Blog

One of my favorite MySQL configurations for high availability is master-master replication, which is just like normal master-slave replication except that you can fail over in both directions. Aside from MySQL Cluster, which is more special-purpose, this is probably the best general-purpose way to get fast failover and a bunch of other benefits (non-blocking ALTER TABLE, for example).

The benefit is that you have another server with all the same data, up and running, ready to serve queries. In theory, it's a truly hot standby (stay with me -- that's not really guaranteed). You don't get this with shared storage or DRBD, although those provide stronger guarantees against data loss if mysqld crashes. And you can use the standby (passive) master for serving some SELECT queries, taking backups, etc as usual. However, if you do this you actually compromise your high-availability plan a little, because you can mask the lack of capacity that will result when one of the servers is down and you have to rely on just one server to keep everything on its feet.

If you need really high availability, you can't load the pair of servers more than a single server can handle. (You can always use the passive server for non-essential needs -- it doesn't have to be completely dead weight.) As a result, some people choose to make the passive server truly passive, handling none of the application's queries. It just sits there replicating and doing nothing else.

The problem is that the passive server's caches start to get skewed to handle the write workload from replication, and not the read workload it will have to handle if there's a planned or unplanned failover. This isn't a big problem on small systems, but with buffer pools in the dozens of gigabytes (which is arguably "small" these days), it starts to matter a lot. Warming up a system so it's actually responsive can take hours. As a result, the passive master isn't truly hot anymore. It needs to handle the workload it's supposed to be ready to take over. If you fail over to it, it might perform very badly -- get unresponsive, cause tons of I/O, etc. In reality, it can be completely unusable for a long time.

To measure how much this really matters, I did some tests for a customer who was having troubles with this type of scenario. I used mk-query-digest (with some new features) to watch the traffic on the active master and replay SELECT queries against the passive one. I timed the results and ran them through the analysis part of mk-query-digest. A simple key lookup ran in tens of milliseconds on the active master, but executed for up to dozens of seconds on the passive one.

After a couple of hours of handling SELECT traffic, these same queries were responding nicely on the passive master, too.

Is that all? "Buffer pool warmed up, performance is better, case closed!" No. This isn't as simple as it sounds on the surface. There are two things happening and both are important to understand.

The first, most obvious phenomenon is that the buffer pool gets skewed to handle the write workload. Since we're running Percona's patched server, we can actually measure what's in the buffer pool. I measured the active master's buffer pool with the following query:

PLAIN TEXT

SQL:

SELECT table_schema, table_name, page_type, count(*)
FROM information_schema.innodb_buffer_pool_content GROUP BY 1, 2, 3
INTO OUTFILE '/tmp/buffer-pool-contents.txt';

I loaded this file into a table on my laptop with LOAD DATA INFILE and kept it for later. I did the same on the slave. Then I used mk-query-digest to watch the traffic on the active master:

PLAIN TEXT

CODE:

mk-query-digest --processlist h=active \
--filter '$event->{arg} =~ m/^SELECT/i'
--execute h=passive

After a bit I CTRL-C'ed it and it printed out the analysis of the time taken to run the queries against the passive master. I restarted it and after a few hours of this I did the same thing; the query timings were dramatically better now. Then I just let it keep running without any aggregation options to avoid any overhead of storing and analyzing queries. (I added --mirror and --daemonize options so it can run in the background and follow along when the passive/active roles switch.)

After a day or so of doing this, I re-sampled the buffer pool contents on the passive server. With all three samples stored in tables on my laptop, I wrote a query against these three sets of stats to find the top tables on the active server and left-join those against the tables on the passive server, with both a mixed workload from my mirrored SELECT statements and with the "pure" replication workload. I totaled the pages up into gigabytes. Here's the result:

db_table	active	passive + SELECT	passive
site.benefits	8.30	5.73	1.32
.	3.13	0.94	0.50
site.user_actions	2.55	4.09	6.29
site.user_achievements	1.36	1.20	0.35
site.clicks	1.26	3.05	5.13
site.actions_finished	1.14	0.46	0.74
site.ratings	0.91	0.89	0.48

The difference is clear. The buffer pool contains over 8G of data for the site.benefits table on the active master, but if you just put a replication workload on the server, that falls to 1.32G. Other tables are similar. The mixed workload with some SELECT queries mirrored is somewhere between the two.

One thing we don't know is which pages are in the pool. Same table, same size of data doesn't mean same buffer pool contents. An insert-only workload will probably fill the buffer pool with the most recent data; a mixed workload will usually have some different hot spot or mixture of hot spots, so it'll bring different parts of the table into memory.

So that's the first thing that's happening. The second is the insert buffer. Notice the pages with no database or table name -- the second row in the table above. Those are a mixture of things, but it's overwhelmingly the insert buffer.

As Peter explained in his recent post on the insert buffer, the other thing the SELECTs do is keep the insert buffer in a production stead-state. The buffered records are forced to be merged by the SELECTs, and a lot more of the pages from the insert buffer are in the buffer pool, not on disk. So it's not just the buffer pool that gets skewed with a write-only workload! The insert buffer can also cause terrible performance. There are some subtleties about exactly what's happening that I'm still investigating and may write more about later, in this particular case.

So what can we conclude from this? Simply this: if you have a standby server that's not under a realistic workload, you won't be able to get good performance after a failover. You need to use some technique to mirror the read-only workload to the passive server. It doesn't have to be the tools I used -- it could be MySQL Proxy or a TCP sniffer or anything else. But if you need fast failover, you need some way to at least partially emulate a production workload on the standby machine.

PS: I see Robert Hodges just published an article on warm standby for PostgreSQL. Link love for interested readers.

Entry posted by Baron Schwartz | No comment

Add to: | | | |

	二月 2009
一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28