《大国崛起》这部记录片自从播出后引起了很多讨论。从网上下载后每天看一点,积少成多,现在也快看完了。
说是"大国崛起",其实很多国家不过是从小变大,然后变强的过程。比如,西班牙、英国这样的国家。我在看的过程中,发现几个比较有意思的问题:那些曾经的大国在崛起的过程中不可避免的会出现一个或几个决定历史命运的英雄人物。比如法国路易十四,德意志的俾斯麦、俄国的彼得大帝。在我们受到的传统教育中,我们总被教导说"人民群众创造历史",有的时候,在历史的十字路口,英雄人物的作用还是不可抹杀啊。另外一个有趣的是,崛起的大国几乎都是经济在背后驱动的,整个记录片如果说成是一部经济发展史也不为过。为经济作保驾护航的,是一个合理的制度,合理的制度! 这可能也是该片创作班子企图表达的主题吧。
看过《大国崛起》后,发现又了解了不少历史知识。在家里就有一套《世界近代史》,是以前的所谓"内部参考"的资料。历史,真是一个复杂的东西,从不同的角度解读,从不同的切入点着手,得到的信息都是不一样的。
"大国崛起"这个话题不可避免的要和意识形态搅合在一起。我觉得《大国崛起》是一个非常好的记录片,面向亿万的观众,要想拍出一部皆大欢喜的东西是不可能的,也不会没有任何缺陷。这不过是一些历史学家的新史观的体现,完全没有必要拔的那么高或是用很高的要求来衡量它。
观看,并引起思考,这就足够了。
--EOF--
作者:陈朝晖 雅虎美国工程师
背景知识:搜索引擎的质量指标一般包括相关性(Relevance)、时效性(Freshness)、全面性(Comprehensiveness)和可用性(Usability)等四个方面,今天我们要谈的索引量就属于完整性指标的范畴。
首先需要注意的是,对于搜索引擎,网页的索引量和抓取量是不同的概念。搜索引擎的网页抓取数量一般都要远大于索引量,因为抓取的网页中包括很多内容重复或者作弊等质量不高的网页。搜索引擎需要根据算法从抓取的网页当中取其精华,去其糟粕,挑选出有价值的网页进行索引。因此,对用户而言,搜索引擎的索引量大小才更有意义。
其次,无限制增大索引量并不一定能保证搜索质量的提升。一方面,在全面性指标中,除索引量外,还需要考虑到收录网页的质量和不同类型网页的分布。另一方面,搜索引擎的质量指标体系要保证四方面的均衡发展,不是依靠单个指标的突破就可以改善的。目前包括雅虎中国在内的主流中文搜索引擎的网页索引量都在20亿量级,基本上可以满足用户的日常查询需求。
然而,由于从外部无法直接测算出搜索引擎网页索引量的绝对值大小,很多搜索引擎服务商喜欢对外夸大自己的收录网页数,作为市场噱头。从1998年开始,Krishna Bharat和Andrei Broder就开始研究,如何通过第三方来客观比较不同搜索引擎索引量的大小。8年后,在今年5月份的WWW2006大会上,来自以色列的Ziv Bar-Yossef和Maxim Gurevich由于这方面的出色研究成果夺得了大会唯一的最佳论文奖。他们的研究算出了主流英文搜索引擎的索引量相对大小:雅虎是Google的1.28倍,Google是MSN的1.36倍。他们是如何算出这些数字的呢?下面我们将为搜索引擎爱好者介绍这个算法,以及探讨在中文搜索引擎上是如何应用的。
概述
搜索引擎的索引量或称覆盖率对搜索结果的相关性、时效性和找到率都具有深远的影响。出于市场运作的考虑,各大互联网搜索引擎不时对外公布自己索引的文档数量,然而这些数据往往不同程度地被加入了一些水份,可信度上有一个问号。因此,如何通过搜索引擎的公共接口,也就是通常所说的搜索框,比较客观、准确地测试它的索引量就成为了一个令人关注的问题。
图1,对搜索引擎的索引采样
每一个搜索引擎的索引都覆盖了互联网上全部文档的一个子集。如果我们把测试作为对这个集合的采样,那么问题的关键就在于如何实现一个近似的等概率随机采样(uniform search engine url sampler),参见图1。具体地说,假定一个搜索引擎S总共索引了|D|个文档,那么我们希望采样得到某一个具体文档的概率是1/|D|。
一旦实现了通过搜索框对索引的等概率随机采样,我们就可以在统计意义上比较有把握地估计搜索引擎索引量的相对大小。如下图所示:
图2,比较搜索引擎索引的相对大小
我们先对引擎S1随机采样N1个url。然后,通过url查询获知引擎S2索引了其中的N12个url,而没有索引另外N10个。换句话说,N1 = N10+N12 。同样地,如果我们对引擎S2随机采样N2个url,发现其中N21被S1收录而N20没有收录,N2=N20+N21。那么我们可以估计S1与S2的相对大小为:
|D1|/|D2|
≌(N12+N10) / (N12+N12N20/N21)
=(N1N21)/(N2N12)
=N21/N12 (如果N1══N2)
待续...
chedong posted a photo:
从我的网站本月的统计来看:IE7的用户已经占6%了,geeker们升级很快啊……
Msie 7.0 6.6 %
Msie 6.0 59.7 %
Msie 5.5 0.6 %
Msie 5.01 0.2 %
Msie 5.0 0.2 %
另外一个数字是FireFox用户:
Firefox 2.0 12.60%
Firefox 1.5.0.8 3.10%
Firefox 1.5.0.7 0.30%
Firefox 1.5.0.6 0.30%
Firefox 1.5.0.5 0.10%
Firefox 1.5.0.3 0.60%
看来大部分都升级到2了。
nabaztag just came out with a new rabbit version that has a 'belly button'. next to its normal features (e.g. ear position communication is quite impressive), the wireless bunny now is able to 'listen', allowing voice messages to be easily sent, podcasts & web radio to be played etc. it now can also RFID-wise 'sniff' physical objects & react to them (for instance, holding door keys in front of it will urge it to send "I'm home" messages to your friends).
as an 'old generation' nabaztag owner, I can only welcome its increased intelligence & the extended sound features. actually, my ambient display rabbit is quite funny & useful, if it only could withhold itself of loudly announcing the arrival of spam email messages during some meetings...
[link: nabaztag.com & nabaztag.com|via engadget.com]
a snowboard jacket that allows people to transmit images to an embedded display. wearers can also receive location-specific information (e.g. ideal slopes, location of friends, weather information) & biometric body data (e.g. dehydration, altitude sickness) can be captured & monitored. currently, the prototype is limited to a WiFi PDA that displays new images as they arrive from camera phone emails.
[link: moondial.com]
Just in time for the holidays, here's another post about our statistics, and this time we'll describe how we deal with metrics issues, how we think we can improve the kinds of statistics we provide, and admit that despite all this number crunching, we still don't know how many dribs are in a drab (but we know that the answer involves Planck's constant).
With over 500,000 feeds now managed, we deal with statistics anomalies like spiked/tanked subscriber counts, podcast counts, and click counts on a weekly, if not daily basis. Some of these are larger issues than others, obviously. We're sure that the good people at ComScore, HitWise, and other CamelCase-named statistics companies would agree that there are always issues and anomalies popping up that have to be beaten back with gusto like so many zombies in Dawn (or Shawn) of the Dead.
The goal we always set for ourselves is to try to maintain apples-to-apples comparisons across all types of counting and aggregator/client treatment. In other words, we try to say that regardless of what bucket some metric goes in, it should always result in the ability to look at a couple different pieces of the data (feeds, aggregators, podcatchers, etc.) and say "these make sense relative to one another." You set up some heuristics and algorithms that you then try to apply those as universally as possible and take your lumps. It's like the never-ending "uniques" debate that the web stats community has — you try to plant some stakes in the ground that get you to reasonable conclusions when you consider all the data, and then jump off the next bridge when you come to it.
Some of the metrics issues that we are continually addressing include:
Across the board, we're seeing more and more distinct kinds of user-agents requesting feeds. Here's a quick chart of the growth in unique user-agents we've seen polling feeds just in the last six months.
Caveat Emptor: These chart numbers don't include user-agents with spammy identifiers that are obviously just long random strings, and hundreds of agents like "Shmucky-bot/1.0" and "Shmucky-bot/2.0" are only counted as one distinct user-agent. All of this data excludes the millions of requests a day we capture from clients with completely blank identifiers. Still, you can see the current count is well over 8,000 different kinds of feed reading entities. Everything from aggregators and search crawlers to thousands of mobile feed readers, hundreds of podcatchers, loads of language specific agents, specialty browser toolbars and more.
One of the questions we bounce around here is "what can we do to help people get more information about their statistics in order to better understand how their content is being distributed?" (although we don't speak to ourselves so eloquently). There are a few things we're always working on in this department:
今天收到一个朋友的来信:说我的网站的字太小了。我去Analytics上看了一下,其中的WEB设计参数中,有一个屏幕分辨率的指标。目前我的网站上使用analytics的统计
来访者有95%以上的用户是使用1024分辨率以上(包括我自己看),为什么还要用那么小的字体呢?修改了一下style,把首页上所有 12px的字体都改成了14px(其实应该尽量避免使用固定象素大小字体,使用相对大小更好一些),之所以选择14px象素,因为我的网站有1/6左右是Firefox用户,单数大小字体对他们不适合。
如果你看到的首页还是小字体,请按F5强制刷新一下。
如果不满意还可以投上一票:
Free polls from Go2poll.com |