Yes, I just got through watching the Superbowl so chips and salsa are on my mind and in my stomach. In recreational eating more chips requires downing more salsa. With mulitcore chips it turns out as cores go up salsa goes down, salsa obviously being a metaphor for speed.
Sandia National Laboratories found in their simulations: a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added. The problem is the lack of memory bandwidth as well as contention between processors over the memory bus available to each processor.
The implication for those following a diagonal scaling strategy is to work like heck to make your system fit within eight multicores. After that you'll need to consider some sort of partitioning strategy. What's interesting is the research on where the cutoff point will be.
很快春节就过去了,这次小容在纽约过春节,同时也办理了一件人生大事,细节稍后会分享在这里。
新的一年里这个blog会开始一个新的小栏目:“每月创想”。这一次会先交待一下个小栏目的背景,同时分享第一次的每月创想。
· 为什么开“每月创想”这个小栏目?
在回顾了08年的心得体会之后,这段时间,小容也在回顾Swordi个人工作室四年的情况。一直以来,Swordi个人工作室只是以虚拟的形式出现,小容把它看成是个人业余活动的品牌。过去四年中,Swordi曾帮助一些公益项目和创业项目设计标志,并同时提供一些建议。
在08年10月,小容为一个小型的商业项目设计Logo,这一次小容采取了一个新的模式,采用免费设计,但是让对方自由捐赠的模式。设计任务完成后,Swordi工作室收到了第一笔捐赠款项。在这之后,小容用这笔款项捐赠给其他自己感兴趣的项目。
这一次的尝试,成为一个契机,启迪小容重新定位Swordi工作今后的运作模式:
· 为社会公共事务提供免费的设计/咨询服务;
· 接受社会各界的捐赠;
· 将接受到的捐赠款项用于孵化有意思的新项目;
小容把这种模式称为“开放设计/开放咨询/开放研究”。在过去,Swordi工作室为现有的公益项目提供服务,在未来,Swordi工作室将寻求促成、孵化一些新的公益项目。
整理和挖掘创想,就是这一新运作模式的起点。
· 分享创想,本身也是值得普及的实践
在传统中国文化中,人们喜欢批判夸夸其谈的人,俗语“站着说话不腰疼”就是例证之一。在这样的文化背景下,人们在公共场合发言时谨小慎微,对于他人的一些构想、创意和点子,人们也不是采取鼓励的态度,而是批评和讥笑,并且往往用发言者是否有能力来实践构想来嘲讽发言者。
然而在一个足够开放的社会文化中,如果你的想法足够好,即使你自己没有能力实践它,人们也不会讥笑你。人们会说,这个想法很棒,让我们看看是否有人可以帮助实现它。
在社会化网络形塑而成的分享空间,一件事情可以划分成多个细小的环节,每个环节的价值贡献不分主次,人们可以尽其所长,贡献自己的价值在最合适的环节。
当你有一个好的构想之后,你可以选择把它分享到公共空间。分享创想,本身也是值得普及的实践。
你可以把你的构想捐献出来,把它放在公共领域,有能力实施的机构会将你的构想变成现实。正如你把你偶然拍摄的照片,标注上创作共用CC授权。某一天或许你会发现,有人把它创造成另外一件艺术品,或者拿去为某个公益项目设计广告。
· 每月创想,你也有份!
04年,当小容为OOPS设计的Logo被大家投票选中之后,小容在心里许下一个愿望:
今后Swordi工作室每年至少要为一个公益项目设计Logo。
08年的岁末,小容许下另外一个心愿:
今后Swordi工作室每年至少帮助一个创想成长为一个具体的项目。
小容也在此邀请所有的新老朋友,一起来加入这个创想的队伍中,让我们各尽所长,群策群力,一点一点地改变你身边的世界。
在09年中,小容将在这个Blog上每月分享一个创想,一年有十二个创想,如果有十个人做同样的事情,那么,一年之后,我们会拥有一百二十个创想,在这其间,如果有二到三个创想演化为具体的项目的话,那么,这样的微小行动累积多年,会促成巨大的价值诞生。
· 一月创想:中国克隆2.0
创想名称:中国克隆2.0 (China Clone 2.0)
创想简介:这个创想的构思是建立一个网站,这个网站的主要目的是译介传播国外的公益传播和社会公共事务活动,鼓励中国大陆的网络用户逐一克隆这些国外项目,同时提供协助这些项目成长的资源。
创想漫谈:
既然有人克隆Flickr,YouTube,mySpace,FaceBook和43things等等其他的国外Web2.0网站?那么,为何不可以克隆类似下列项目的社会公益相关2.0网站呢?
Change.org,
SocialActions.com,
CauseCast.org,
WeDesignChange.org
有许多好事情没有发生,或许是因为有意愿、有能力做这件事情的人不知道这个好点子,而知道这个好点子的人没有能力完成这件事情。因此,搜集足够多的国外实践过的好点子,把这些好点子传播到足够宽广的领域,让有能力做这件事情的人们接触到他们。这就是促成改变的一个好的开始。
这个构想其实原来已久,早些时候小容在译言里建立了一个小组Project2.0:
经常看到一些很有创意的网站,在此收集一些有趣的在线项目,翻译这些创意给中文网络世界,让人们也可以按照同样的方式来,做同样的有创意的事情。 让好创意将世界变平!
最近也和一些朋友讨论起类似的话题,他们也在筹备类似的构想,计划建立一个公益项目数据库,把国外人如何做一些公益项目的构想和具体模式以案例的方式整理,提供给大陆朋友作为参考和借鉴。
小容这个一月创想,也算是帮朋友公告一下,如果大家有兴趣把这个创想变为具体的项目,请在此留言,和我们建立联系,让大家群策群力,行动起来,促成改变。
—-
如果你也开始在blog上分享你的每月创想,请Trackback过来,或者在此留言:)
time /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4 >|/root/ext4-iozone-stdout.txt
Auto Mode
File size set to 4194304 KB
Using Maximum Record Size 256 KB
Using Minimum Record Size 4 KB
Command line used: /opt/iozone/bin/iozone -a -s 4G -q 256 -y 4
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.
reclen | write | rewrite | read | reread | random read | random write | bkwd read | record rewrite | stride read | fwrite | frewrite | fread | freread | |
Ext2 real 28m12.718s user 0m10.725s sys 5m8.265s | 4 | 218,680 | 216,596 | 630,248 | 245,802 | 88,700 | 138,065 | 106,112 | 1,882,623 | 73,538 | 214,175 | 218,364 | 566,570 | 247,381 |
8 | 215,308 | 218,690 | 556,064 | 246,260 | 154,680 | 150,052 | 188,397 | 2,462,367 | 130,896 | 217,157 | 216,647 | 583,808 | 248,397 | |
16 | 216,457 | 216,843 | 575,046 | 245,701 | 258,660 | 158,750 | 306,842 | 2,654,320 | 220,939 | 216,061 | 218,140 | 598,174 | 246,581 | |
32 | 217,925 | 214,289 | 537,976 | 243,081 | 394,013 | 167,002 | 464,240 | 2,397,831 | 340,775 | 217,434 | 219,353 | 583,463 | 246,341 | |
64 | 215,460 | 219,256 | 527,919 | 244,362 | 503,227 | 162,917 | 609,546 | 2,546,079 | 456,243 | 216,875 | 217,692 | 571,707 | 244,264 | |
128 | 219,081 | 216,173 | 540,831 | 242,526 | 609,750 | 161,442 | 721,701 | 2,656,729 | 551,122 | 217,780 | 217,427 | 579,271 | 242,291 | |
256 | 216,091 | 217,631 | 565,111 | 245,157 | 654,274 | 173,955 | 870,547 | 2,574,261 | 634,835 | 216,638 | 219,693 | 563,735 | 247,101 | |
Ext3 real 27m42.449s user 0m11.529s sys 7m17.049s | 4 | 218,242 | 213,039 | 482,132 | 243,986 | 88,007 | 156,926 | 105,557 | 1,540,739 | 75,010 | 216,028 | 216,432 | 522,704 | 243,385 |
8 | 218,390 | 217,915 | 544,892 | 244,979 | 152,424 | 190,454 | 181,486 | 1,945,603 | 130,737 | 218,364 | 216,431 | 530,853 | 243,222 | |
16 | 218,083 | 217,683 | 561,038 | 244,506 | 255,244 | 200,032 | 300,212 | 2,096,495 | 221,329 | 216,930 | 216,661 | 514,177 | 244,069 | |
32 | 216,258 | 217,013 | 569,246 | 243,811 | 389,745 | 198,275 | 446,462 | 1,934,853 | 338,785 | 216,809 | 219,296 | 530,634 | 243,446 | |
64 | 218,850 | 217,711 | 577,529 | 243,725 | 497,689 | 201,693 | 589,535 | 2,036,412 | 450,449 | 219,387 | 214,900 | 514,353 | 244,809 | |
128 | 220,234 | 215,687 | 530,519 | 241,615 | 608,244 | 199,619 | 714,295 | 1,992,168 | 553,022 | 217,828 | 218,454 | 513,596 | 241,510 | |
256 | 216,011 | 220,188 | 592,578 | 242,548 | 642,341 | 199,408 | 834,240 | 2,092,959 | 624,043 | 217,682 | 218,165 | 529,358 | 242,878 | |
Ext4 real 27m3.485s user 0m10.847s sys 6m9.578s | 4 | 221,823 | 216,992 | 532,488 | 273,668 | 85,210 | 183,195 | 103,036 | 1,862,817 | 74,781 | 225,841 | 220,620 | 523,799 | 272,848 |
8 | 226,028 | 218,580 | 561,960 | 272,036 | 154,972 | 216,505 | 178,482 | 2,135,372 | 132,506 | 227,423 | 215,766 | 641,021 | 271,328 | |
16 | 222,241 | 217,746 | 547,548 | 270,895 | 260,899 | 223,895 | 295,288 | 2,095,966 | 223,135 | 226,055 | 216,210 | 621,287 | 273,475 | |
32 | 220,121 | 213,025 | 240,426 | 247,628 | 345,210 | 175,977 | 451,631 | 2,145,351 | 342,236 | 225,796 | 213,427 | 598,331 | 269,759 | |
64 | 223,983 | 214,437 | 308,696 | 551,577 | 754,941 | 225,897 | 523,130 | 2,218,016 | 448,086 | 227,030 | 214,706 | 582,795 | 272,323 | |
128 | 222,576 | 217,816 | 624,636 | 271,293 | 644,500 | 224,997 | 720,468 | 2,308,315 | 582,943 | 225,971 | 217,373 | 552,335 | 274,237 | |
256 | 221,202 | 222,238 | 541,685 | 270,898 | 671,748 | 228,085 | 845,494 | 2,215,381 | 643,715 | 225,411 | 219,166 | 580,066 | 273,342 | |
Kbytes/sec |
One of my favorite MySQL configurations for high availability is master-master replication, which is just like normal master-slave replication except that you can fail over in both directions. Aside from MySQL Cluster, which is more special-purpose, this is probably the best general-purpose way to get fast failover and a bunch of other benefits (non-blocking ALTER TABLE, for example).
The benefit is that you have another server with all the same data, up and running, ready to serve queries. In theory, it's a truly hot standby (stay with me -- that's not really guaranteed). You don't get this with shared storage or DRBD, although those provide stronger guarantees against data loss if mysqld crashes. And you can use the standby (passive) master for serving some SELECT queries, taking backups, etc as usual. However, if you do this you actually compromise your high-availability plan a little, because you can mask the lack of capacity that will result when one of the servers is down and you have to rely on just one server to keep everything on its feet.
If you need really high availability, you can't load the pair of servers more than a single server can handle. (You can always use the passive server for non-essential needs -- it doesn't have to be completely dead weight.) As a result, some people choose to make the passive server truly passive, handling none of the application's queries. It just sits there replicating and doing nothing else.
The problem is that the passive server's caches start to get skewed to handle the write workload from replication, and not the read workload it will have to handle if there's a planned or unplanned failover. This isn't a big problem on small systems, but with buffer pools in the dozens of gigabytes (which is arguably "small" these days), it starts to matter a lot. Warming up a system so it's actually responsive can take hours. As a result, the passive master isn't truly hot anymore. It needs to handle the workload it's supposed to be ready to take over. If you fail over to it, it might perform very badly -- get unresponsive, cause tons of I/O, etc. In reality, it can be completely unusable for a long time.
To measure how much this really matters, I did some tests for a customer who was having troubles with this type of scenario. I used mk-query-digest (with some new features) to watch the traffic on the active master and replay SELECT queries against the passive one. I timed the results and ran them through the analysis part of mk-query-digest. A simple key lookup ran in tens of milliseconds on the active master, but executed for up to dozens of seconds on the passive one.
After a couple of hours of handling SELECT traffic, these same queries were responding nicely on the passive master, too.
Is that all? "Buffer pool warmed up, performance is better, case closed!" No. This isn't as simple as it sounds on the surface. There are two things happening and both are important to understand.
The first, most obvious phenomenon is that the buffer pool gets skewed to handle the write workload. Since we're running Percona's patched server, we can actually measure what's in the buffer pool. I measured the active master's buffer pool with the following query:
I loaded this file into a table on my laptop with LOAD DATA INFILE and kept it for later. I did the same on the slave. Then I used mk-query-digest to watch the traffic on the active master:
After a bit I CTRL-C'ed it and it printed out the analysis of the time taken to run the queries against the passive master. I restarted it and after a few hours of this I did the same thing; the query timings were dramatically better now. Then I just let it keep running without any aggregation options to avoid any overhead of storing and analyzing queries. (I added --mirror and --daemonize options so it can run in the background and follow along when the passive/active roles switch.)
After a day or so of doing this, I re-sampled the buffer pool contents on the passive server. With all three samples stored in tables on my laptop, I wrote a query against these three sets of stats to find the top tables on the active server and left-join those against the tables on the passive server, with both a mixed workload from my mirrored SELECT statements and with the "pure" replication workload. I totaled the pages up into gigabytes. Here's the result:
db_table | active | passive + SELECT | passive |
---|---|---|---|
site.benefits | 8.30 | 5.73 | 1.32 |
. | 3.13 | 0.94 | 0.50 |
site.user_actions | 2.55 | 4.09 | 6.29 |
site.user_achievements | 1.36 | 1.20 | 0.35 |
site.clicks | 1.26 | 3.05 | 5.13 |
site.actions_finished | 1.14 | 0.46 | 0.74 |
site.ratings | 0.91 | 0.89 | 0.48 |
The difference is clear. The buffer pool contains over 8G of data for the site.benefits table on the active master, but if you just put a replication workload on the server, that falls to 1.32G. Other tables are similar. The mixed workload with some SELECT queries mirrored is somewhere between the two.
One thing we don't know is which pages are in the pool. Same table, same size of data doesn't mean same buffer pool contents. An insert-only workload will probably fill the buffer pool with the most recent data; a mixed workload will usually have some different hot spot or mixture of hot spots, so it'll bring different parts of the table into memory.
So that's the first thing that's happening. The second is the insert buffer. Notice the pages with no database or table name -- the second row in the table above. Those are a mixture of things, but it's overwhelmingly the insert buffer.
As Peter explained in his recent post on the insert buffer, the other thing the SELECTs do is keep the insert buffer in a production stead-state. The buffered records are forced to be merged by the SELECTs, and a lot more of the pages from the insert buffer are in the buffer pool, not on disk. So it's not just the buffer pool that gets skewed with a write-only workload! The insert buffer can also cause terrible performance. There are some subtleties about exactly what's happening that I'm still investigating and may write more about later, in this particular case.
So what can we conclude from this? Simply this: if you have a standby server that's not under a realistic workload, you won't be able to get good performance after a failover. You need to use some technique to mirror the read-only workload to the passive server. It doesn't have to be the tools I used -- it could be MySQL Proxy or a TCP sniffer or anything else. But if you need fast failover, you need some way to at least partially emulate a production workload on the standby machine.
PS: I see Robert Hodges just published an article on warm standby for PostgreSQL. Link love for interested readers.
Entry posted by Baron Schwartz | No comment
二月 2009 | ||||||
一 | 二 | 三 | 四 | 五 | 六 | 日 |
1 | ||||||
2 | 3 | 4 | 5 | 6 | 7 | 8 |
9 | 10 | 11 | 12 | 13 | 14 | 15 |
16 | 17 | 18 | 19 | 20 | 21 | 22 |
23 | 24 | 25 | 26 | 27 | 28 |