Source favicon23:57 Welcome to Hadoop! » del.icio.us/chedong
Hadoop项目从Nutch项目中分类出来了,作为一个面向分布式开发的地层框架。目前wiki还不是很好用
Source favicon23:23 信息指纹与消重算法 » 搜索引擎研究

在半周的搜索引擎沙龙上在讨论Lucene的时候做猎兔分词的罗刚提到了信息指纹,我趁这个机会介绍一下信息指纹和消重。

信息指纹:就是提取一个信息的特征,通常是一组词或者一组词+权重,然后根据这组词调用特别的算法,例如MD5,将之转化为一组代码,这组代码就成为标识这个信息的指纹。

从理论上讲,每两个不同文本的特征信息是不同的,那么得到的代码也应该是不一样的,就象人的指纹。

搜索引擎在建立索引的时候需要对重复内容的网页进行识别和消重,这就要用到信息指纹。

例如,通常搜索引擎要先对网页进行消噪,就是净化网页,将一些模版类的,无用的广告等剔除调。然后得到预处理后的网页,然后对网页进行向量化处理,简单的讲就是分词,统计,并按照词频生成一个列表。

例如:
网页12
搜索10
引擎7
...
...
然后取前N个关键词作为信息的矢量,这是可以直接进行MD5哈系,或者按照其它规则进行重拍后进行MD5哈系。例如本例,取前3个关键词,在进行哈系,得到的信息指纹就是:a7eb9d92a83cf438881915e0bc2df70b

这样a7eb9d92a83cf438881915e0bc2df70b 就作为本文档的指纹和以往的文档进行比较,如果有相同的,就说明指纹上看是一样的,就可以进入消重处理。

Source favicon22:43 漂亮文本 » Blog on 27th Floor
             _   _
| | | |
___ __ _| |_| |__ __ _ _ _ __ _ _ __
/ __/ _` | __| '_ \ / _` | | | |/ _` | '_ \
| (_| (_| | |_| | | | (_| | |_| | (_| | | | |
\___\__,_|\__|_| |_|\__,_|\__, |\__,_|_| |_|
__/ |
|___/

我一向缺乏艺术细胞,挑个颜色都挑不好,Blog上面的颜色其实是抄来的。所以呢,要求也不高,上面那个字,我就觉得还是不错的。

现在网络大兴,到处都是文本文件,HTML,CSS,XML,大家写Blog其实也都是敲字,格式化的东西很少,Word只怕扔得很远了。

那天有人说起文本怎么才能漂亮,想了想发现只记得两个传统印记,就是*号圈上表示*加重*,用_圈上表示_下划线_。除此之外,怎么做漂亮,一点没概念。不过昨天就有人提到了这个页面,就是个纯文本简历,在Firefox里面自动运用了等宽字体,显示得非常漂亮,各处空格缩进也都正常。这个简历条目清晰易读,内容排列有序,实在是难得的佳作啊!(谁有其他佳作的,不妨推荐来看看 - 破解组织的作品就不必了)

上面那个字符拼图,其实是用一个工具做出来的,Jave,以java写成,GPL协议,内含多种工具,包括专做文字的Figlet。强烈推荐这个工具,果然是一次写,到处运行。

和纯HTML(不含图片等元素)相比,纯文本表现力还差点,并且要用到不少空格或其他字符,当然HTML要有标签和CSS;纯文本要想漂亮,西文得用等宽字体,而HTML则要有浏览器。
Source favicon22:15 Review of Selected Demo by TechCrunch » Wangjianshuo's blog
Isaac shared a link to ideasfactorychina mail-list: TechCrunch » A Taste of DEMO 2006. I posted back my review of several demos. According to Michale, "70 companies gather at a hotel in Phoenix, Arizona to compete head on for our attention. $15,000 buys you 5 minutes in front of 700 people, and a chance to make history." The fact of how expensive the 5 minutes is sounds interesting already. Review of Selected Demo by TechCrunch Credit: The demo was quoted...
Source favicon21:28 手机服务退订:切记人工再确认 » 车东[Blog^2]

联通: 10109696 人工服务 不能即时退订,只能重发退订申请。
移动: 1860 人工咨询 或者 自动退定发送00000到186201

Source favicon21:25 Dissect Spam Karma » Xerdoc Together
Spam Karma是一个非常棒的WP插件,用来过滤WP的留言中的Spam。用过WP的朋友或多或少的都受到过Spam疯狂的攻击,SK可以很大程度上的帮助你从这种困境中解脱出来。 Karma [’ka:ma] - 卡马(镍铬丝精密级),是一种度量单位。SK中,用Karma来表示一个留言的"Spam等级",最后根据一个留言的Karma值来判定这个留言是不是Spam。 SK本身是一个WP的插件,而为了有更好的扩展性,SK2也是采用插件的形式来工作的。当然,它的插件工作比起WP的机制来说,简单很多。 Plugin Framework大概有几种模式: 一种如Eclipse那样,采用"微内核+插件"。除了一个微小的核(包括Plugin Framework)之外,其它任何东西都是插件。这就需要尽可能的多留出扩展点,每个插件在做的时候要时刻考虑别人可能从哪个地方来扩展。 一种如Firefox、WordPress这样,主程序相对独立,在独立Standalone程序的基础上,暴露出一些接口,这种扩展性也非常强,当然,这是建立在StandAlone程序暴露出了足够丰富的扩展点的基础上。 这两种都有非常丰富的扩展性,比较适合做大型的Framework,比如Eclipse RCP等等。另外,这两种模式中往往都有"Extension Point"的概念。 另一种方式就简单多了,系统规定好几个Interface,插件只需要实现这几个Interface即可。这样的插件Framework,实现起来比较简单,但是扩展性也非常有限,比较适合做比较专用的小型程序。这种插件的核心概念就是"Interface"。 SK2就是一个这样的例子。 先来看看SK扩展WP的部分: SK Extend了以下几个Extension: <?php add_action('comment_form', 'sk2_form_insert'); add_action('admin_menu', 'sk2_add_options'); add_action('admin_head', 'sk2_output_admin_css'); add_filter('pre_comment_approved', 'sk2_fix_approved'); add_action('comment_post', 'sk2_filter_comment'); add_action('wp_footer', 'sk2_insert_footer', 3); ?> 其中,"admin_menu"是用来在菜单中添加一个Option选项,"admin_head"是在Head中添加SK自己的CSS信息,"wp_footer"是在Blog的尾部添加一个SK的Copyright信息。 其余的三个就比较重要了 Extend "comment_form" Extension的是函数"sk2_form_insert",顾名思义,是在Comment Form Load的时候添加代码。 "pre_comment_approved"是一个Trick,防止WP自动发送Comment Notification。 "comment_post"是最重要的一个扩展点,它是在一条Comment被提交之后来调用的,SK的处理也正在此处。 再来看看SK2插件,前面说过它是属于比较简单的第三种形式,看看它的Interface是怎么定义的: 这个类中有三个核心的接口函数分别是 <?php function filter_this(&$cmt_object) { // override this to do your own filtering log_msg (__("Default filter (no action) called for plugin: ", 'sk2') . $name, 3, $cmt_object->ID); } function treat_this(&$cmt_object) { // override this to do your own treatment log_msg (__("Default [...]
Source favicon19:29 人际关系维系的依托 » 刻录事@上海
除了亲情和让人捉摸不透的爱情之外,这个社会中的任何人际关系的建立和维系,终究会被归结到第三方的联系,这个第三方可能是“物”,可能是“事”,也可能是“人”。 社会性网络大体可以按照这个思路分成两大类,以物事为联结纽带的SNS和以人为联结的SNS。 以Del.icio.us为例来说说物事SNS。 在这些SNS中,用户收藏并分享他们在Web上阅读的对象。人们可以在Del.icio.us上很容易的找到自己兴趣的物件领域,然后追源到发现、收藏、分享他们的用户。通过对这些用户收藏的观察,发现他们中的一些经常能提供自己有兴趣的资料,于是人们就会和他们建立联系(订阅他们的美味)。这种联系的建立是单方面的。 整个过程的模式是:(人—〉物
Source favicon16:19 Hung Huang, Chen Kaige and the Steamed Bun » Danwei RSS 1.0
huang_and_mum.jpg
Hung Huang and her mother Zhang Hanzhi, who was once Mao's English teacher

Film director Chen Kaige's (陈凯歌) awful but awfully expensive movie The Promise (无极) has been widely panned by Chinese moviegoes and critics alike. A man named Hu Ge (胡戈) made a twenty minute spoof of the film called The Bloody Case That Started From A Steam Bun (一个馒头引发的血案), which was copied on many different websites and Internet forums.

Chen Kaige was enraged, and has started proceedings to sue the author of the spoof, earning the famous director even more ridicule. (There are links to the spoof and commentary at the bottom of this post.)

Now his ex-wife, Hung Huang (or Hong Huang) has stepped into the fray, on her new blog on Sina.com.

Your correspondent used to work for her at her media company CIMG. After I left the company I was roundly cursed by her, for various reasons, in a book she wrote. She was nice enough not to name me, although it was clear to anyone who knew me that I was the target. Some of the curses were justified, some not, but like many other people, I have learned that Hung is not someone you want angry at you if you cannot deal with highly barbed but very funny mockery. (I can deal.)

This is a rough translation of what she had to say about Chen Kaige and the Steamed Bun debacle:

My ex-husband and the steam bun

Because of this business with my ex-husband and a steamed bun, for a week now all my friends have been making fun of me, teasing and mocking me and criticizing my judgement and taste in men. Yesterday evening was the climax: a table of eight people, originally all quite restrained, exploded with mockery. This morning I looked at some blogs and found many comments about me. It seems women should be very careful about selecting a husband. Well in this life it's already too late for me, I'll pay more attention to it in the next life. 
 


This affair makes me feel that I have been treated very unjustly: Why do people have to connect me to a person I haven't seen in more than ten years, to whom I haven't said a word. We haven't even bumped into each other. This is really unfair. The police often say to comrades who have committed offences: if you correct the mistake everything will be OK. Even if you have a previous conviction, people shouldn't continue to talk about it all the time, that completely lacks a spirit of generosity.



I always try to be or pretend to be a decent person, with a magnanimous attitude, with some reluctance to speak [and judge others]. But this affair is just too hilarious, I am nearly going crazy not speaking about it. And also, no matter how hard I try, nobody is ever going to mistake me for a lady, so I might as well just say a few words about this. 



We Chinese have a saying: you can navigate a boat in the stomach of a prime minister [i.e. a great leader should be able to deal with all kinds of problems and annoyances]. If a steamed bun can't even go in his stomach, then it's obvious that he has become a chicken with a small stomach [i.e. very narrow-minded].

Moreover, this steamed bun has clearly taken some coarse grain and turned it into high quality flour; not a good thing. If it was me, I would see what stale food I had at home and immediately take it out and get a talented person to turn it into wheat-flour and rice, immediately get some face, maybe make some money and get fame and fortune [note, this paragraph is badly translated, please send corrections to jeremy -at- danwei.org]. 
 


Self-mockery is a weapon of all intelligent people. Especially when they meet difficulties, self-mockery can instruct, and help them out of a predicament. Being ridiculed by other people is a painful thing, but people like Lu Xun who are merciless with bad people even when they are down are rare, most people will fogive a wrongdoer a way out and laugh the problem away. 



This post is the end of a vow of silence. So now our LE magazine editors have no mercy on me, not about this: they have already urged me to write a 'Steamed Bun Q&A' for the March issue of the magazine. I ask everyone to read the March issue of the magazine. If you feel that we are coarse grain, you are welcome to trample us into a steamed buns.



Lastly, I must apologize to the parties concerned, but if I restrained myself from talking I think I'd get cancer from the effort.

Links and Sources


Source favicon14:05 日本经济重归先锋地位 » blog中文翻译
日本经济在过去三个月实现了出人意料的强劲增长,促使日本再度回归世界经济强国浪尖,同时也预示了其长达15年经济萧条的终结。 作为世界第二大经济体,日本经济在第四季度(译者注:指2005年第四季度)增长了1.4%,远超美国的0.3%和欧盟的0.4%。 得益于最后3个月超过市场预期的增长,日本经济2005年全年的增幅达到2.8%。同时,2005年也成为日本经济在内需刺激下的第三个健康增长年。 经济学家认为,由于财政不再受坏账拖累,日本迎来了自1991年以来保持经济持续增长的最佳时机。 日本银行部分官员认为宽松的货币政策应当终止(译者注:2006年2月9日,日本银行政策委员会决议继续维系现行的宽松货币政策),而较高的经济增速似乎是对此种观点的有力支持。出于这样的考虑,日本的部长们(译者注:在日本称为“相”,如大藏相即财政部长:Financial Minister)近来有意压低了相关数字。而昨天,日本财政金融大臣与谢野馨(Kaoru Yosano)却一反常态,认为增长的数字是“非常积极的经济指标”。 分析家认为经济的高增长预示着日本最终走出了多年的经济停滞阶段。 2005年日本经济发展的中流砥柱则是国内因素的主导作用。家庭消费上升了2.2%,而非住宅商业投资则陡增了8.4%。同时,出口净额对年度增长2.8%的贡献仅为0.2个百分点。 然而,日本在向全世界宣称经济复苏这个好消息之前,还要跨越几个坎。通货紧缩的阴影依旧。而国内生产总值(译者注:GDP,Gross Domestic Product)的统计数字则表明,消费者价格指数(译者注:CPI, Consumer Price Index)近来开始的增长,在较高程度上源自油价上升。 与此同时,日本经济增长对于减小全球贸易失衡(具体体现为美国贸易逆差)的作用甚微。进口的增速尚不足以降低日本的贸易顺差。
Source favicon09:23 Gannt Project » Jan's Tech Blog
Gannt Project是一個Open Source的Project Management工具,而且更可以匯入Microsoft Project的檔案。...
08:00 2006/02/18 08:00:00TQ洽谈通搜索力指数排行榜 » TQ洽谈通搜索力指数
 搜索引擎  搜索力指数  排名升降  份额
1. Baidu  175582274     60.30%
2. Google  39403862     13.53%
3. 3721  31633898     10.86%
4. Yahoo  24219838     8.32%
5. 163  7132050     2.45%
6. Sogou  5961318     2.05%
7. QQ  3911546     1.34%
8. iAsk  1411534     0.48%
9. China  944034     0.32%
10. Zhongsou  532278     0.18%
11. Tom  399242     0.14%
Source favicon07:55 Response to the DoJ motion » Official Google Blog




In August, Google was served with a subpoena from the U. S. Department of Justice demanding disclosure of two full months’ worth of search queries that Google received from its users, as well as all the URLs in Google’s index. We objected to the subpoena, which started a set of legal procedures that puts the issue before the Federal courts. Below is the introduction to our response to the Department of Justice's motion to the court to force us to comply with the subpoena. You can find the entire response here. (This is a 25-page PDF file.)





I. INTRODUCTION

Google users trust that when they enter a search query into a Google search box, not only will they receive back the most relevant results, but that Google will keep private whatever information users communicate absent a compelling reason. The Government's demand for disclosure of untold millions of search queries submitted by Google users and for production of a million Web page addresses or "URLs" randomly selected from Google's proprietary index would undermine that trust, unnecessarily burden Google, and do nothing to further the Government's case in the underlying action.



Fortunately, the Court has multiple, independent bases to reject the Government's Motion. First, the Government's presentation falls woefully short of demonstrating that the requested information will lead to admissible evidence. This burden is unquestionably the Government's. Rather than meet it, the Government concedes that Google's search queries and URLs are not evidence to be used at trial at all. Instead, the Government says, the data will be "useful" to its purported expert in developing some theory to support the Government's notion that a law banning materials that are harmful to minors on the Internet will be more effective than a technology filter in eliminating it.



Google is, of course, concerned about the availability of materials harmful to minors on the Internet, but that shared concern does not render the Government's request acceptable or relevant. In truth, the data demanded tells the Government absolutely nothing about either filters or the effectiveness of laws. Nor will the data tell the Government whether a given search would return any particular URL. Nor will the URL returned, by its name alone, tell the Government whether that URL was a site that contained material harmful to minors.



But, the Government's request would tell the world much about Google's trade secrets and proprietary systems. This is the second independent ground upon which the Court should reject the subpoena. Google avidly protects every aspect of its search technology from disclosure, even including the total number of searches conducted on any given day. Moreover, to know whether a given search would return any given URL in Google's database, a complete knowledge of how Google's search engine operates is required, inevitably further entangling Google in the underlying litigation. No assurances, no promises, and no confidentiality order, can protect Google's trade secrets from scrutiny and disclosure during the course of discovery and trial.



Finally, the Government's subpoena imposes an undue burden on Google without a sufficiently countervailing justification. Perhaps the Government can be forgiven its glib rejection of this point because it is unfamiliar with Google's system architecture. If the Government had that familiarity, it would know that its request will take over a week of engineer time to complete. But the burden is not mechanical alone; it includes legal risks as well. A real question exists as to whether the Government must follow the mandatory procedures of the Electronic Communications Privacy Act in seeking Google users' search queries. The privacy of Google users matters, and Google has promised to disclose information to the Government only as required by law. Google should not bear the burden of guessing what the law requires in regard to disclosure of search queries to the Government, or the risk of guessing wrong.



For all of these reasons, the Court must reject the Government's Motion.

^==Back Home: www.chedong.com

<== 2006-02-17

==> 2006-02-19