Tim O'Reilly
2008-07-31
I've been worried for some years that the open source movement might fall prey to the problem that Kim Stanley Robinson so incisively captured in Green Mars: "History is a wave that moves through time slightly faster than we do." Innovators are left behind, as the world they've changed picks up on their ideas, runs with them, and takes them in unexpected directions. In essays like The Open Source Paradigm Shift and What is Web 2.0?, I argued that the success of the internet as a non-proprietary platform built largely on commodity open source software could lead to a new kind of proprietary lock-in in the cloud. What good are free and open source licenses, all based on the act of software distribution, when software is no longer distributed but merely performed on the global network stage? How can we preserve freedom to innovate when the competitive advantage of online players comes from massive databases created via user contribution, which literally get better the more people use them, raising seemingly insuperable barriers to new competition? I was heartened by the program at this year's Open Source Convention. Over the past couple of years, open source programs aimed at the Web 2.0 and cloud computing problem space have been proliferating, and I'm seeing clear signs that the values of open source are being reframed for the network era. Sessions like Beyond REST? Building Data Services with XMPP PubSub, Cloud Computing with BigData, Hypertable: An Open Source, High Performance, Scalable Database, Supporting the Open Web, and Processing Large Data with Hadoop and EC2 were all full. (Due to enforcement of fire regulations at the Portland Convention Center, many of them had people turned away, as SRO was not allowed.) But just "paying attention" to cloud computing isn't the point. The point is to rediscover what makes open source tick, but in the new context. It's important to recognize that open source has several key dimensions that contribute to its success:
This is far from a complete list, but it gives food for thought. As outlined above, I don't believe we've figured out what kinds of licenses will allow forking of Web 2.0 and cloud applications, especially because the lock-in provided by many of these applications is given by their data rather than their code. However, there are hopeful signs like Yahoo! Boss that companies are at beginning to understand that in the era of the cloud, open source without open data is only half the application. But even open data is fundamentally challenged by the idea of utility computing in the cloud. Jesse Vincent, the guy who's brought out some of the best hacker t-shirts ever (as well as RT) put it succinctly: "Web 2.0 is digital sharecropping." (Googling, I discover that Nick Carr seems to have coined this meme back in 2006!) If this is true of many Web 2.0 success stories, it's even more true of cloud computing as infrastructure. I'm ever mindful of Microsoft Windows Live VP Debra Chrapaty's dictum that "In the future, being a developer on someone's platform will mean being hosted on their infrastructure." The New York Times dubbed bandwidth providers OPEC 2.0. How much more will that become true of cloud computing platforms? That's why I'm interested in peer-to-peer approaches to delivering internet applications. Jesse Vincent's talk, Prophet: Your Path Out of the Cloud describes a system for federated sync; Evan Prodromou's Open Source Microblogging describes identi.ca, a federated open source approach to lifestreaming applications. We can talk all we like about open data and open services, but frankly, it's important to realize just how much of what is possible is dictated by the architecture of the systems we use. Ask yourself, for example, why the PC wound up with an ecosystem of binary freeware, while Unix wound up with an ecosystem of open source software? It wasn't just ideology; it was that the fragmented hardware architecture of Unix required source so users could compile the applications for their machine. Why did the WWW end up with hundreds of millions of independent information providers while centralized sites like AOL and MSN faltered? Take note: All of the platform as a service plays, from Amazon's S3 and EC2 and Google's AppEngine to Salesforce's force.com -- not to mention Facebook's social networking platform -- have a lot more in common with AOL than they do with internet services as we've known them over the past decade and a half. Will we have to spend a decade backtracking from centralized approaches? The interoperable internet should be the platform, not any one vendor's private preserve. (Neil McAllister provides a look at just how one-sided most platform as a service contracts are.) So here's my first piece of advice: if you care about open source for the cloud, build on services that are designed to be federated rather than centralized. Architecture trumps licensing any time. But peer-to-peer architectures aren't as important as open standards and protocols. If services are required to interoperate, competition is preserved. Despite all Microsoft and Netscape's efforts to "own" the web during the browser wars, they failed because Apache held the line on open standards. This is why the Open Web Foundation, announced last week at OScon, is putting an important stake in the ground. It's not just open source software for the web that we need, but open standards that will ensure that dominant players still have to play nice. The "internet operating system" that I'm hoping to see evolve over the next few years will require developers to move away from thinking of their applications as endpoints, and more as re-usable components. For example, why does every application have to try to recreate its own social network? Shouldn't social networking be a system service? This isn't just a "moral" appeal, but strategic advice. The first provider to build a reasonably open, re-usable system service in any particular area is going to get the biggest uptake. Right now, there's a lot of focus on low level platform subsystems like storage and computation, but I continue to believe that many of the key subsystems in this evolving OS will be data subsystems, like identity, location, payment, product catalogs, music, etc. And eventually, these subsystems will need to be reasonably open and interoperable, so that a developer can build a data-intensive application without having to own all the data his application requires. This is what John Musser calls the programmable web. Note that I said "reasonably open." Google Maps isn't open source by any means, but it was open enough (considerably more so than any preceding web mapping service) and so it became a key component of a whole generation of new applications that no longer needed to do their own mapping. A quick look at programmableweb.com shows google maps with about 90% share of mapping mashups. Google Maps is proprietary, but it is reusable. A key test of whether an API is open is whether it is used to enable services that are not hosted by the API provider, and are distributed across the web. Facebook's APIs enable applications on Facebook; Google Maps is a true programmable web subsystem. That being said, even though the cloud platforms themselves are mostly proprietary, the software stacks running on them are not. Thorstein von Eicken of Rightscale pointed out in his talk Scale Into the Cloud, that almost all of the software stacks running on cloud computing platforms are open source, for the simple reason that proprietary software licenses have no provisions for cloud deployment. Even though open source licenses don't prevent lock-in by cloud providers, they do at least allow developers to deploy their work on the cloud. In that context, it's important to recognize that even proprietary cloud computing provides one of the key benefits of open source: low barriers to entry. Derek Gottfried's Processing Large Data with Hadoop and EC2 talk was especially sweet in demonstrating this point. Derek described how, armed with a credit card, a sliver of permission, and his hacking skills, he was able to put the NY Times historical archive online for free access, ramping up from 4 instances to nearly 1,000. Open source is about enabling innovation and re-use, and at their best, Web 2.0 and cloud computing can be bent to serve those same aims. Yet another benefit of open source - try before you buy viral marketing - is also possible for cloud application vendors. During one venture pitch, I was asking the company how they'd avoid the high sales costs typically associated with enterprise software. Open source has solved this problem by letting companies build a huge pipeline of free users, who they can then upsell with follow-on services. The cloud answer isn't quite as good, but at least there's an answer: some number of application instances are free, and you charge after that. While this business model loses some virality, and transfers some costs from the end user to the application provider, it has a benefit that open source now lacks, of providing a much stronger upgrade path to paid services. Only time will tell whether open source or cloud deployment is a better distribution vector, but it's clear that both are miles ahead of traditional proprietary software in this regard. In short, we're a long way from having all the answers, but we're getting there. Despite all the possibilities for lock-in that we see with Web 2.0 and cloud computing, I believe that the benefits of openness and interoperability will eventually prevail, and we'll see a system made up of cooperating programs that aren't all owned by the same company, an internet platform, that, like Linux on the commodity PC architecture, is assembled from the work of thousands. Those who are skeptical of the idea of the internet operating system argue that we're missing the kinds of control layers that characterize a true operating system. I like to remind them that much of the software that is today assembled into a Linux system already existed before Linus wrote the kernel. Like LA, 72 suburbs in search of a city, today's web is 72 subsystems in search of an operating system kernel. When we finally get that kernel, it had better be open source. |
翻译:yuwen 几年来我一直担心开源运动可能会遭受Kim Stanley Robinson在“Green Mars”中精辟论述的问题:“历史的浪潮比我们做得还要快。”创新者被抛在后面,他们曾经改变的世界拿着他们的主意向着意想不到的方向跑了。 在“开源模式的转变”和“什么是Web 2.0”这些文章中我认为Internet作为一个非私有平台主要构建在开源软件之上,它的成功有可能会导致在云计算领域一种新的封锁。自由和开源许可都是基于软件分发方式,当软件不再需要分发而只在网络平台上运行时自由和开源许可证意义何在?当在线企业形成竞争优势后我们又如何保护创新的自由?这些企业通过用户创造的巨型数据库形成自己的竞争优势,用户用得越多这种优势越强,从而对后来者形成不可逾越的门槛。 今年开源大会上的会议让我欢欣鼓舞。在过去几年里针对Web 2.0和云计算的开源活动正在激增,我已经清楚地看到开源概念在网路时代重组的迹象。比如这次会议上像“超越REST?通过XMPP PubSub构建数据服务”、“BigData云计算”、“Hypertable:开源、高性能可扩展数据库”、“支持Open Web”以及“用Hadoop和EC2处理大数据”这样的分会场场爆满。(由于波特兰会议中心的防火要求,很多会议没能让所有感兴趣的人都参加。Brian Aker关于Drizzle的演讲非常受欢迎以至他不得不讲了三次!) 仅仅“关注”云计算不是关键。关键是要再一次找到在新的形势下如何使开源继续发展。很重要的是认清开源的成功有几个关键元素:
1.许可证要允许和鼓励再发布、修改乃至发展分支; 这些还远不全面,但可以让我们来思索。如上所述我不认为我们已经找到某种许可证能允许分支发展Web 2.0和云应用,尤其是这些应用形成的封锁是数据带来的,而不是没有开放代码。然而,已经可以看到迹象,(像Yahoo! Boss)企业已经开始理解在云计算时代没有开放数据的开源仅仅是一半开源。 但是开放数据从根本上被云计算中的效能计算想法所挑战。Jesse Vincent——他搞出一些史上最好的骇客T恤(还有RT)一语道破:“Web 2.0就是数字化的佃农租种地主的土地。”(我查了一下发现似乎是Nick Carr在2006年发明了这个说法!)如果这对于Web 2.0成功企业是事实的话那对于作为基础结构的云计算就更是事实了。我还记得Microsoft Windows Live副总裁Debra Chrapaty的话“未来在某人的平台上作为开发者就意味着你根植于这个基础结构。”纽约时报把带宽提供企业称为OPEC 2.0。云计算平台上多大程度会是如此? 这就是为什么我更认同通过点对点途径分发internet应用。Jesse Vincent在这次开源大会上的讲话“Prophet:走出云计算之路”描述了一个联合同步系统;Evan Prodromou的“开源微博客”介绍了identi.ca,这是一个联合的、开源的生活流应用方案。 我们可以对开放数据和开放服务畅所欲言,但坦率地讲更为重要的是要认识到有多少可能的事是被大家使用的系统体系结构把持着。可以想一下,比如为什么PC只能带动一个二进制免费软件的产业,而Unix却可以产生一个开源软件的生态环境?这不仅仅是意识形态的问题;Unix分散的硬件体系结构需要源代码以便用户在自己机器上编译应用程序。为什么WWW上产生了那么多独立信息提供者而像AOL和MSN这样的集中网站却步履蹒跚? 请注意:所有作为服务的平台(从Amazon的S3和EC2、Google的AppEngine到Salesforce的force.com,更不要说Facebook的社交网络平台了)与AOL有更多的相似之处,而不是过去十五年我大家所知道的internet服务。我们将用十年时间回到集中的模式去?可互操作的internet应该是一个平台而不是哪一个厂商的私人禁区。(Neil McAllister描述了多数平台作为服务合约是多么片面。) 所以我给出我的第一个建议:如果你关心云计算的开源,请在那些设计为联合而不是集中控制的服务上构建项目。体系结构从来都是战胜许可证的。 但是点对点体系结构并不像开放标准和协议那样重要。如果服务要求互操作,竞争就会被保护下来。无论微软和Nescape在当年的浏览器大战中如何想去控制Web,均以失败告终,因为Apache坚持开放标准。这就是为什么说上周开源大会上成立的Open Web Foundation有重要的意义。我们要保证的不仅仅是Web上开源的软件,还有开放的标准,它能确保抢占了统治地位的厂商不见利忘义。 我期望未来几年里发展的“internet操作系统”将要求开发人员不要再将应用看作终点,而是要将应用作为组件。比如为什么每个应用都要创造自己的社交网络?社交网络难道不应该是个系统服务吗? 这不是“道义”上的呼吁,是战略建议。任何领域中第一个提供构建适当地开放、可重用的系统服务的厂商将快速发展。目前有很多重点放在低层平台子系统上,像存储和计算,但是我一直相信这样一个发展的操作系统中很多关键的子系统是数据子系统,比如身份信息、位置信息、支付、产品目录、音乐,等等。而且最终这些子系统将要适当地开放和可以互操作,从而开发人员可以直接构建数据密集型应用,而无需自己去组织应用需要的所有数据。John Musser称其为可编程Web。 请注意我说的是“适当地开放”。Google Maps肯定不是开源的,但是它已经足够开放了(和此前任何Web地图服务比较)以至于成为整整一代新应用的关键组件,这些应用不再需要自己的地图信息。programmableweb.com上的一个总结显示Google Maps支持了差不多90%的地图mashup。Google Maps是私有的,但是是可重用的。判断一个API是否开放一个关键原则是看它是否支持那些不是建立在该API之上的服务,而且是否可以在Web上分发。Facebook的API支持Facebook上的应用;Google Maps才是真正的可编程Web子系统。 所以,即使云计算平台本身是私有的,它上面运行的软件可以不是。Rightscale的Thorsten von Eicken在他的讲话“扩展到云计算”中指出,几乎所有云计算平台上的软件都是开源的,一个简单的原因是私有软件许可证没有支持云计算部署方式。尽管开源协议不能防止云计算提供者的封锁,但至少允许开发人员在云计算中部署软件。 谈到这里有一点是重要的,承认即使是私有云计算平台也提供了开源的一个关键益处:降低了进入的门槛。Derek Gottfried的讲话”用Hadoop和EC2来处理大数据“很好地展示了这一点。Derek描述了他如何用一张信用卡、权限以及骇客技巧就可以将纽约时报过刊在线档案放到网上让大家免费访问。开源是要鼓励创新和重用,Web 2.0和云计算也可以服务同样的目标。 开源的另一个好处——买之前先试试,病毒式营销——对于云计算提供厂商也是可能的。在一次风险投资中我问那个公司如何避免高昂的销售成本(尤其是企业软件)。开源通过构建免费用户组成的管道来解决这个问题,公司接下来可以向其销售后续的服务。云计算的答案不是那么好但毕竟有答案:一些应用一定数量的实例是免费的,再多的实例则要收费。这种商业模式失掉了一些病毒式营销,将成本从终端用户传到应用提供商,但它有一个开源没有的好处,它提供了更有力的付费服务升级渠道。只有时间能证明开源或云计算谁是更好的分发方式,但是很清楚的一点是二者在这方面都比传统的私有软件进步许多。 总结一下,在我们得到所有答案之前还有很长的路要走,但正在前进。不管那些我们看到的Web 2.0和云计算中封锁的可能性,我相信开放和互操作的益处将最终成为主流,我们将看到一个由合作的程序组成的系统,它们不属于同一公司,一个internet操作系统就像是在PC体系结构上的Linux,由无数软件组成。那些对internet操作系统持怀疑态度的人称我们正在失去一个真正操作系统的控制层次。我提醒他们今天Linux中很多软件在Linus写了内核之前就存在。正如“72个城镇组成了洛杉矶”,今天的Web是一个操作系统内核的72个子系统。当我们最终找到了内核最好是开源。 |
JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions
The one I'm speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.
I have described couple of solutions to this problem - using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.
Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users .... etc)
As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.
Note this is also interesting case of EXPLAIN being wrong - it shows key_len=7 which corresponds to the full key while only first key part is used.
Let us now replace the range with IN list in this query:
So we get same result but approximately 50 times faster. In this report we had just one month worth of data - what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).
Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:
As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.
It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on "info" table instead nested loops join - exactly what we tried to avoid.
It is easy to block equality propagation by using some trivial function:
So we stopped equality propagation but now have another problem - for some reason MySQL decides to only do "ref" on the date only instead of using range on day and list of groups for each join iteration.
This does not make sense but this is how it is.
I also tried to increase cardinality by having all rows to have different group_id and it still does not work.
The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)
For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.
So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team
UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas
At the same time I figured out how to make MySQL Optimizer to do what we want to do - Just add yet another table to the join so the info table just has bunch of ref lookups:
This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.
Entry posted by peter | 4 comments
July 31st 2006 was my last day working for MySQL and August 1st I started what later was incorporated Percona with Vadim joining me September 1st as co-founder.
Two years is a significant anniversary for any startup - surviving (and being profitable) for 2 years can be seen as validation of our business model and strategy and we're quite happy about this.
So what is our strategy ? I left MySQL with idea of building company which will be fair in rewarding their employees for their contribution, in particular engineers which do a lot of heavy lifting in technology companies. I really liked many of Monty's ideas as he implemented during early years of MySQL (you can see many of these same ideas described in Hacking Companies article). We're not just like that but we're very close in spirit which you can describe as lets smart engineers to gather and do cool stuff together.
The second part of our strategy is being fair to the customers and providing them with great service at fair prices. We decided from the start we're making money as consulting company being for work it takes to deliver service rather than focusing on maximizing leverage by selling software or subscription.
We develop software to be able to provide better services with lower cost for the client. This makes sense because we can help more people and builds efficiency as our competitive advantage.
Third part which is important for us as a founders (and we try to hire people which share our values) is giving back to community. It works as a great marketing vehicle for us but it just feels right. We feel open source software is a great way to give back to community for technology company. We've sponsored MMM, Maatkit, Released Innodb Recovery Tools (we probably would have made a lot of money keeping this inhouse, but it just does not feel right to leave people in need without a tool to get data back if they can't pay), Sponsored some Sphinx development. We also published variety of patches for MySQL. Though our giving back to community does not stop there. On the technical Landscape we try to provide a lot of information via Blog, Forums or Presentations. We also contribute to other worth causes like gathering money for Ivan surgery.
Where do we plan to go ? We're helping customers building and maintaining high quality applications. Currently our focus around MySQL and surrounding technologies but this is so because it is "pick of the web". We're constantly looking at emerging technologies to see what can be used for building large scale web application, which is there core of our interest is. We see what other challenges our customers have and we have consultants joining us with different backgrounds which allows us to provide additional services such as capacity planning, migrations, web layer optimizations, MySQL Customizations/Optimizations etc. We want people having their own great ideas to join us and develop them in entrepreneur friendly atmosphere.
In these two years we've grown from 2 person company to company employing over 20 full time employees in Europe and US. We're still virtual company having no office where people would work.
The MySQL was a great school to show how this is possible.
We're staying profitable all the time attracting no external money as venture fundings or the loans. This allows us to develop company on our own pace and have no obligations to deliver huge returns to anyone. We believe as consulting company we do not need these to maintain comfortable growth pace without putting undue pressure on our employees and retaining team values.
For us with Vadim the the change was the serious one. As we started delivering high quality services was out main challenge and as engineers this was something we knew pretty well how to do. As the company grew our roles change to include a lot of challenges in organizing administrative sales process, ensuring we're paid and paying our consultants, managing people and leadership on leading the company. We're learning a lot as we go and we're listening to advice of Mentors we can find. We're also growing team by looking not only for great engineers but also for people with great management and administrative skills.
Yesterday Monty visited us for dinner and I told him it is 2 year anniversary since I left MySQL. He asked us if we're happy with the choice or have regrets - we have none and looking forward the next two years. Getting your own company up and running is a lot of hard work but is is a lot of fun too.
Entry posted by peter | 9 comments
Brady Forrest
2008-08-01
The geoweb is going 3D. Google is bringing Google Earth into the browser via a plug-in. Photosynth, 3D photo collection creator and viewer, is moving into the Microsoft's Virtual Earth team (this was posted about on July 26th; the post was removed, but is still findable in the cache's of both Google and Live). Google's Panoramio, a location-oriented photo-sharing site, has released their own 3D-ish photo viewer (see the Sydney Opera House and launch coverage on Google Earth Blog). And the geo teams of both Google and Microsoft have their own 3D modelers, Sketchup and trueSpace (more info) respectively. However the imagery that you see in VE or Google is not 3D. That is where Earthmine, a Berkeley-based startup is hoping to come in (Radar post). They are currently mapping four cities with NASA technology and a custom-designed camera rig. Each pixel in an image is assigned 3D coordinate. Capturing this data allows for a multitude of future applications. Their current, private environment is dogfooding their own API. In it you can see some of the promise of 3D mapping (and some beautiful imagery). They enable you to to tag a location or add a virtual object. You can also select points and measure the real-world distance between them (as shown in the screenshot). There will be two versions of the API. A FlashVIewer API is available for public sites and can programmed with Javascript and ActionScript. It includes coordinate search and the ability to merge existing data sets. The Direct Data API will be a low-level REST API and will provide direct access to images (by lat-long coordinates) and 3D lookups. It can also support 3D models (hopefully the ones you've created in the free Sketchup and trueSpace programs). Earthmine's four-city Beta is going to be launching in the Fall. They are currently looking for launch partners. Personally, I am hoping that at least one iPhone app is made with their data. If you have an Augmented Reality app for the iPhone in you (or any other app that could use rich 3D data) contact them via their Beta Signup page. Earthmine spoke about their vision and product at Where 2.0 2008. You can see the video after the jump:
|
八月 2008 | ||||||
一 | 二 | 三 | 四 | 五 | 六 | 日 |
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |