21:17 More Chips Means Less Salsa » High Scalability - Building bigger, faster, more reliable websites.

Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed.

Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor.

The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.

08:28 一月创想:中国克隆2.0 » 大学小容>善用网络,助益成长!

很快春节就过去了,这次小容在纽约过春节,同时也办理了一件人生大事,细节稍后会分享在这里。

新的一年里这个blog会开始一个新的小栏目:“每月创想”。这一次会先交待一下个小栏目的背景,同时分享第一次的每月创想。

DSC00009

· 为什么开“每月创想”这个小栏目?

在回顾了08年的心得体会之后,这段时间,小容也在回顾Swordi个人工作室四年的情况。一直以来,Swordi个人工作室只是以虚拟的形式出现,小容把它看成是个人业余活动的品牌。过去四年中,Swordi曾帮助一些公益项目和创业项目设计标志,并同时提供一些建议。

在08年10月,小容为一个小型的商业项目设计Logo,这一次小容采取了一个新的模式,采用免费设计,但是让对方自由捐赠的模式。设计任务完成后,Swordi工作室收到了第一笔捐赠款项。在这之后,小容用这笔款项捐赠给其他自己感兴趣的项目。

这一次的尝试,成为一个契机,启迪小容重新定位Swordi工作今后的运作模式:

· 为社会公共事务提供免费的设计/咨询服务;
· 接受社会各界的捐赠;
· 将接受到的捐赠款项用于孵化有意思的新项目;

小容把这种模式称为“开放设计/开放咨询/开放研究”。在过去,Swordi工作室为现有的公益项目提供服务,在未来,Swordi工作室将寻求促成、孵化一些新的公益项目。

整理和挖掘创想,就是这一新运作模式的起点。

· 分享创想,本身也是值得普及的实践

在传统中国文化中,人们喜欢批判夸夸其谈的人,俗语“站着说话不腰疼”就是例证之一。在这样的文化背景下,人们在公共场合发言时谨小慎微,对于他人的一些构想、创意和点子,人们也不是采取鼓励的态度,而是批评和讥笑,并且往往用发言者是否有能力来实践构想来嘲讽发言者。
  
然而在一个足够开放的社会文化中,如果你的想法足够好,即使你自己没有能力实践它,人们也不会讥笑你。人们会说,这个想法很棒,让我们看看是否有人可以帮助实现它。

在社会化网络形塑而成的分享空间,一件事情可以划分成多个细小的环节,每个环节的价值贡献不分主次,人们可以尽其所长,贡献自己的价值在最合适的环节。

当你有一个好的构想之后,你可以选择把它分享到公共空间。分享创想,本身也是值得普及的实践。

你可以把你的构想捐献出来,把它放在公共领域,有能力实施的机构会将你的构想变成现实。正如你把你偶然拍摄的照片,标注上创作共用CC授权。某一天或许你会发现,有人把它创造成另外一件艺术品,或者拿去为某个公益项目设计广告。

· 每月创想,你也有份!

04年,当小容为OOPS设计的Logo被大家投票选中之后,小容在心里许下一个愿望:

今后Swordi工作室每年至少要为一个公益项目设计Logo。

08年的岁末,小容许下另外一个心愿:

今后Swordi工作室每年至少帮助一个创想成长为一个具体的项目。

小容也在此邀请所有的新老朋友,一起来加入这个创想的队伍中,让我们各尽所长,群策群力,一点一点地改变你身边的世界。

在09年中,小容将在这个Blog上每月分享一个创想,一年有十二个创想,如果有十个人做同样的事情,那么,一年之后,我们会拥有一百二十个创想,在这其间,如果有二到三个创想演化为具体的项目的话,那么,这样的微小行动累积多年,会促成巨大的价值诞生。

· 一月创想:中国克隆2.0

创想名称:中国克隆2.0 (China Clone 2.0)

创想简介:这个创想的构思是建立一个网站,这个网站的主要目的是译介传播国外的公益传播和社会公共事务活动,鼓励中国大陆的网络用户逐一克隆这些国外项目,同时提供协助这些项目成长的资源。

创想漫谈:

既然有人克隆Flickr,YouTube,mySpace,FaceBook和43things等等其他的国外Web2.0网站?那么,为何不可以克隆类似下列项目的社会公益相关2.0网站呢?

Change.org
SocialActions.com
CauseCast.org
WeDesignChange.org

有许多好事情没有发生,或许是因为有意愿、有能力做这件事情的人不知道这个好点子,而知道这个好点子的人没有能力完成这件事情。因此,搜集足够多的国外实践过的好点子,把这些好点子传播到足够宽广的领域,让有能力做这件事情的人们接触到他们。这就是促成改变的一个好的开始。

这个构想其实原来已久,早些时候小容在译言里建立了一个小组Project2.0

经常看到一些很有创意的网站,在此收集一些有趣的在线项目,翻译这些创意给中文网络世界,让人们也可以按照同样的方式来,做同样的有创意的事情。 让好创意将世界变平!

最近也和一些朋友讨论起类似的话题,他们也在筹备类似的构想,计划建立一个公益项目数据库,把国外人如何做一些公益项目的构想和具体模式以案例的方式整理,提供给大陆朋友作为参考和借鉴。

小容这个一月创想,也算是帮朋友公告一下,如果大家有兴趣把这个创想变为具体的项目,请在此留言,和我们建立联系,让大家群策群力,行动起来,促成改变。

—-

如果你也开始在blog上分享你的每月创想,请Trackback过来,或者在此留言:)

Ext2 v.s. Ext3 v.s. Ext4 性能比拼知道分子 » 车东's shared items in Google Reader
Linux kernel 自 2.6.28 开始正式支持新的文件系统 Ext4。 Ext4 是 Ext3 的改进版,修改了 Ext3 中部分重要的数据结构,而不仅仅像 Ext3 对 Ext2 那样,只是增加了一个日志功能而已。Ext4 可以提供更佳的性能和可靠性,还有更为丰富的功能:

1. 与 Ext3 兼容。执行若干条命令,就能从 Ext3 在线迁移到 Ext4,而无须重新格式化磁盘或重新安装系统。原有 Ext3 数据结构照样保留,Ext4 作用于新数据,当然,整个文件系统因此也就获得了 Ext4 所支持的更大容量。

2. 更大的文件系统和更大的文件。较之 Ext3 目前所支持的最大 16TB 文件系统和最大 2TB 文件,Ext4 分别支持 1EB(1,048,576TB, 1EB=1024PB, 1PB=1024TB)的文件系统,以及 16TB 的文件。

3. 无限数量的子目录。Ext3 目前只支持 32,000 个子目录,而 Ext4 支持无限数量的子目录。

4. Extents。Ext3 采用间接块映射,当操作大文件时,效率极其低下。比如一个 100MB 大小的文件,在 Ext3 中要建立 25,600 个数据块(每个数据块大小为 4KB)的映射表。而 Ext4 引入了现代文件系统中流行的 extents 概念,每个 extent 为一组连续的数据块,上述文件则表示为“该文件数据保存在接下来的 25,600 个数据块中”,提高了不少效率。

5. 多块分配。当写入数据到 Ext3 文件系统中时,Ext3 的数据块分配器每次只能分配一个 4KB 的块,写一个 100MB 文件就要调用 25,600 次数据块分配器,而 Ext4 的多块分配器“multiblock allocator”(mballoc) 支持一次调用分配多个数据块。

6. 延迟分配。Ext3 的数据块分配策略是尽快分配,而 Ext4 和其它现代文件操作系统的策略是尽可能地延迟分配,直到文件在 cache 中写完才开始分配数据块并写入磁盘,这样就能优化整个文件的数据块分配,与前两种特性搭配起来可以显著提升性能。

7. 快速 fsck。以前执行 fsck 第一步就会很慢,因为它要检查所有的 inode,现在 Ext4 给每个组的 inode 表中都添加了一份未使用 inode 的列表,今后 fsck Ext4 文件系统就可以跳过它们而只去检查那些在用的 inode 了。

8. 日志校验。日志是最常用的部分,也极易导致磁盘硬件故障,而从损坏的日志中恢复数据会导致更多的数据损坏。Ext4 的日志校验功能可以很方便地判断日志数据是否损坏,而且它将 Ext3 的两阶段日志机制合并成一个阶段,在增加安全性的同时提高了性能。

9. “无日志”(No Journaling)模式。日志总归有一些开销,Ext4 允许关闭日志,以便某些有特殊需求的用户可以借此提升性能。

10. 在线碎片整理。尽管延迟分配、多块分配和 extents 能有效减少文件系统碎片,但碎片还是不可避免会产生。Ext4 支持在线碎片整理,并将提供 e4defrag 工具进行个别文件或整个文件系统的碎片整理。

11. inode 相关特性。Ext4 支持更大的 inode,较之 Ext3 默认的 inode 大小 128 字节,Ext4 为了在 inode 中容纳更多的扩展属性(如纳秒时间戳或 inode 版本),默认 inode 大小为 256 字节。Ext4 还支持快速扩展属性(fast extended attributes)和 inode 保留(inodes reservation)。

12. 持久预分配(Persistent preallocation)。P2P 软件为了保证下载文件有足够的空间存放,常常会预先创建一个与所下载文件大小相同的空文件,以免未来的数小时或数天之内磁盘空间不足导致下载失败。Ext4 在文件系统层面实现了持久预分配并提供相应的 API(libc 中的 posix_fallocate()),比应用软件自己实现更有效率。

13. 默认启用 barrier。磁盘上配有内部缓存,以便重新调整批量数据的写操作顺序,优化写入性能,因此文件系统必须在日志数据写入磁盘之后才能写 commit 记录,若 commit 记录写入在先,而日志有可能损坏,那么就会影响数据完整性。Ext4 默认启用 barrier,只有当 barrier 之前的数据全部写入磁盘,才能写 barrier 之后的数据。(可通过 "mount -o barrier=0" 命令禁用该特性。)


Ext4 随 Linux kernel 2.6.28 正式发布已有数周,一直苦于找不到测试用的磁盘,正巧年前 Intel 送来几块 SSD 测试样品,这两天就顺带把 SSD 也测了。测试所使用的 Linux 内核版本为 2.6.28.2,测试工具为 IOzone 3.318。

IOzone 测试命令为:
time /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4 >|/root/ext4-iozone-stdout.txt

上述命令的说明如下:
Auto Mode
File size set to 4194304 KB
Using Maximum Record Size 256 KB
Using Minimum Record Size 4 KB
Command line used: /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.

测试结果除了表明 Intel SSD 的读写速度快得令人咋舌之外,还可以说明 Ext4 的各方面性能都超过了上一代 Ext3,甚至在大多数情况下,比没有日志功能的 Ext2 还要快出不少:


reclen write rewrite read reread random read random write bkwd read record rewrite stride read fwrite frewrite fread freread
Ext2

real 28m12.718s
user 0m10.725s
sys 5m8.265s
4 218,680 216,596 630,248 245,802 88,700 138,065 106,112 1,882,623 73,538 214,175 218,364 566,570 247,381
8 215,308 218,690 556,064 246,260 154,680 150,052 188,397 2,462,367 130,896 217,157 216,647 583,808 248,397
16 216,457 216,843 575,046 245,701 258,660 158,750 306,842 2,654,320 220,939 216,061 218,140 598,174 246,581
32 217,925 214,289 537,976 243,081 394,013 167,002 464,240 2,397,831 340,775 217,434 219,353 583,463 246,341
64 215,460 219,256 527,919 244,362 503,227 162,917 609,546 2,546,079 456,243 216,875 217,692 571,707 244,264
128 219,081 216,173 540,831 242,526 609,750 161,442 721,701 2,656,729 551,122 217,780 217,427 579,271 242,291
256 216,091 217,631 565,111 245,157 654,274 173,955 870,547 2,574,261 634,835 216,638 219,693 563,735 247,101
Ext3

real 27m42.449s
user 0m11.529s
sys 7m17.049s
4 218,242 213,039 482,132 243,986 88,007 156,926 105,557 1,540,739 75,010 216,028 216,432 522,704 243,385
8 218,390 217,915 544,892 244,979 152,424 190,454 181,486 1,945,603 130,737 218,364 216,431 530,853 243,222
16 218,083 217,683 561,038 244,506 255,244 200,032 300,212 2,096,495 221,329 216,930 216,661 514,177 244,069
32 216,258 217,013 569,246 243,811 389,745 198,275 446,462 1,934,853 338,785 216,809 219,296 530,634 243,446
64 218,850 217,711 577,529 243,725 497,689 201,693 589,535 2,036,412 450,449 219,387 214,900 514,353 244,809
128 220,234 215,687 530,519 241,615 608,244 199,619 714,295 1,992,168 553,022 217,828 218,454 513,596 241,510
256 216,011 220,188 592,578 242,548 642,341 199,408 834,240 2,092,959 624,043 217,682 218,165 529,358 242,878
Ext4

real 27m3.485s
user 0m10.847s
sys 6m9.578s
4 221,823 216,992 532,488 273,668 85,210 183,195 103,036 1,862,817 74,781 225,841 220,620 523,799 272,848
8 226,028 218,580 561,960 272,036 154,972 216,505 178,482 2,135,372 132,506 227,423 215,766 641,021 271,328
16 222,241 217,746 547,548 270,895 260,899 223,895 295,288 2,095,966 223,135 226,055 216,210 621,287 273,475
32 220,121 213,025 240,426 247,628 345,210 175,977 451,631 2,145,351 342,236 225,796 213,427 598,331 269,759
64 223,983 214,437 308,696 551,577 754,941 225,897 523,130 2,218,016 448,086 227,030 214,706 582,795 272,323
128 222,576 217,816 624,636 271,293 644,500 224,997 720,468 2,308,315 582,943 225,971 217,373 552,335 274,237
256 221,202 222,238 541,685 270,898 671,748 228,085 845,494 2,215,381 643,715 225,411 219,166 580,066 273,342














Kbytes/sec


注:
1. 关于 IOzone 测试方法,参考 Ben Martin 的文章:IOzone for filesystem performance benchmarking
2. 关于 Ext4 的相关内容,参考 Kernel Newbies 专页: http://kernelnewbies.org/Ext4

06:13 Faster MySQL failover with SELECT mirroring » MySQL Performance Blog

One of my favorite MySQL configurations for high availability is master-master replication, which is just like normal master-slave replication except that you can fail over in both directions. Aside from MySQL Cluster, which is more special-purpose, this is probably the best general-purpose way to get fast failover and a bunch of other benefits (non-blocking ALTER TABLE, for example).

The benefit is that you have another server with all the same data, up and running, ready to serve queries. In theory, it's a truly hot standby (stay with me -- that's not really guaranteed). You don't get this with shared storage or DRBD, although those provide stronger guarantees against data loss if mysqld crashes. And you can use the standby (passive) master for serving some SELECT queries, taking backups, etc as usual. However, if you do this you actually compromise your high-availability plan a little, because you can mask the lack of capacity that will result when one of the servers is down and you have to rely on just one server to keep everything on its feet.

If you need really high availability, you can't load the pair of servers more than a single server can handle. (You can always use the passive server for non-essential needs -- it doesn't have to be completely dead weight.) As a result, some people choose to make the passive server truly passive, handling none of the application's queries. It just sits there replicating and doing nothing else.

The problem is that the passive server's caches start to get skewed to handle the write workload from replication, and not the read workload it will have to handle if there's a planned or unplanned failover. This isn't a big problem on small systems, but with buffer pools in the dozens of gigabytes (which is arguably "small" these days), it starts to matter a lot. Warming up a system so it's actually responsive can take hours. As a result, the passive master isn't truly hot anymore. It needs to handle the workload it's supposed to be ready to take over. If you fail over to it, it might perform very badly -- get unresponsive, cause tons of I/O, etc. In reality, it can be completely unusable for a long time.

To measure how much this really matters, I did some tests for a customer who was having troubles with this type of scenario. I used mk-query-digest (with some new features) to watch the traffic on the active master and replay SELECT queries against the passive one. I timed the results and ran them through the analysis part of mk-query-digest. A simple key lookup ran in tens of milliseconds on the active master, but executed for up to dozens of seconds on the passive one.

After a couple of hours of handling SELECT traffic, these same queries were responding nicely on the passive master, too.

Is that all? "Buffer pool warmed up, performance is better, case closed!" No. This isn't as simple as it sounds on the surface. There are two things happening and both are important to understand.

The first, most obvious phenomenon is that the buffer pool gets skewed to handle the write workload. Since we're running Percona's patched server, we can actually measure what's in the buffer pool. I measured the active master's buffer pool with the following query:

SQL:
  1. SELECT table_schema, table_name, page_type, count(*)
  2. FROM information_schema.innodb_buffer_pool_content GROUP BY 1, 2, 3
  3. INTO OUTFILE '/tmp/buffer-pool-contents.txt';

I loaded this file into a table on my laptop with LOAD DATA INFILE and kept it for later. I did the same on the slave. Then I used mk-query-digest to watch the traffic on the active master:

CODE:
  1. mk-query-digest --processlist h=active \
  2.   --filter '$event->{arg} =~ m/^SELECT/i'
  3.   --execute h=passive

After a bit I CTRL-C'ed it and it printed out the analysis of the time taken to run the queries against the passive master. I restarted it and after a few hours of this I did the same thing; the query timings were dramatically better now. Then I just let it keep running without any aggregation options to avoid any overhead of storing and analyzing queries. (I added --mirror and --daemonize options so it can run in the background and follow along when the passive/active roles switch.)

After a day or so of doing this, I re-sampled the buffer pool contents on the passive server. With all three samples stored in tables on my laptop, I wrote a query against these three sets of stats to find the top tables on the active server and left-join those against the tables on the passive server, with both a mixed workload from my mirrored SELECT statements and with the "pure" replication workload. I totaled the pages up into gigabytes. Here's the result:

db_table active passive + SELECT passive
site.benefits 8.30 5.73 1.32
. 3.13 0.94 0.50
site.user_actions 2.55 4.09 6.29
site.user_achievements 1.36 1.20 0.35
site.clicks 1.26 3.05 5.13
site.actions_finished 1.14 0.46 0.74
site.ratings 0.91 0.89 0.48

The difference is clear. The buffer pool contains over 8G of data for the site.benefits table on the active master, but if you just put a replication workload on the server, that falls to 1.32G. Other tables are similar. The mixed workload with some SELECT queries mirrored is somewhere between the two.

One thing we don't know is which pages are in the pool. Same table, same size of data doesn't mean same buffer pool contents. An insert-only workload will probably fill the buffer pool with the most recent data; a mixed workload will usually have some different hot spot or mixture of hot spots, so it'll bring different parts of the table into memory.

So that's the first thing that's happening. The second is the insert buffer. Notice the pages with no database or table name -- the second row in the table above. Those are a mixture of things, but it's overwhelmingly the insert buffer.

As Peter explained in his recent post on the insert buffer, the other thing the SELECTs do is keep the insert buffer in a production stead-state. The buffered records are forced to be merged by the SELECTs, and a lot more of the pages from the insert buffer are in the buffer pool, not on disk. So it's not just the buffer pool that gets skewed with a write-only workload! The insert buffer can also cause terrible performance. There are some subtleties about exactly what's happening that I'm still investigating and may write more about later, in this particular case.

So what can we conclude from this? Simply this: if you have a standby server that's not under a realistic workload, you won't be able to get good performance after a failover. You need to use some technique to mirror the read-only workload to the passive server. It doesn't have to be the tools I used -- it could be MySQL Proxy or a TCP sniffer or anything else. But if you need fast failover, you need some way to at least partially emulate a production workload on the standby machine.

PS: I see Robert Hodges just published an article on warm standby for PostgreSQL. Link love for interested readers.


Entry posted by Baron Schwartz | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


^==Back Home: www.chedong.com

^==Back Digest Home: www.chedong.com/digest/

<== 2009-01-31
  二月 2009  
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28  
==> 2009-02-02