在半周的搜索引擎沙龙上在讨论Lucene的时候做猎兔分词的罗刚提到了信息指纹,我趁这个机会介绍一下信息指纹和消重。
信息指纹:就是提取一个信息的特征,通常是一组词或者一组词+权重,然后根据这组词调用特别的算法,例如MD5,将之转化为一组代码,这组代码就成为标识这个信息的指纹。
从理论上讲,每两个不同文本的特征信息是不同的,那么得到的代码也应该是不一样的,就象人的指纹。
搜索引擎在建立索引的时候需要对重复内容的网页进行识别和消重,这就要用到信息指纹。
例如,通常搜索引擎要先对网页进行消噪,就是净化网页,将一些模版类的,无用的广告等剔除调。然后得到预处理后的网页,然后对网页进行向量化处理,简单的讲就是分词,统计,并按照词频生成一个列表。
例如:
网页12
搜索10
引擎7
...
...
然后取前N个关键词作为信息的矢量,这是可以直接进行MD5哈系,或者按照其它规则进行重拍后进行MD5哈系。例如本例,取前3个关键词,在进行哈系,得到的信息指纹就是:a7eb9d92a83cf438881915e0bc2df70b
这样a7eb9d92a83cf438881915e0bc2df70b 就作为本文档的指纹和以往的文档进行比较,如果有相同的,就说明指纹上看是一样的,就可以进入消重处理。
_ _
| | | |
___ __ _| |_| |__ __ _ _ _ __ _ _ __
/ __/ _` | __| '_ \ / _` | | | |/ _` | '_ \
| (_| (_| | |_| | | | (_| | |_| | (_| | | | |
\___\__,_|\__|_| |_|\__,_|\__, |\__,_|_| |_|
__/ |
|___/
联通: 10109696 人工服务 不能即时退订,只能重发退订申请。
移动: 1860 人工咨询 或者 自动退定发送00000到186201
Chen Kaige was enraged, and has started proceedings to sue the author of the spoof, earning the famous director even more ridicule. (There are links to the spoof and commentary at the bottom of this post.)
Now his ex-wife, Hung Huang (or Hong Huang) has stepped into the fray, on her new blog on Sina.com.
Your correspondent used to work for her at her media company CIMG. After I left the company I was roundly cursed by her, for various reasons, in a book she wrote. She was nice enough not to name me, although it was clear to anyone who knew me that I was the target. Some of the curses were justified, some not, but like many other people, I have learned that Hung is not someone you want angry at you if you cannot deal with highly barbed but very funny mockery. (I can deal.)
This is a rough translation of what she had to say about Chen Kaige and the Steamed Bun debacle:
My ex-husband and the steam bun
Because of this business with my ex-husband and a steamed bun, for a week now all my friends have been making fun of me, teasing and mocking me and criticizing my judgement and taste in men. Yesterday evening was the climax: a table of eight people, originally all quite restrained, exploded with mockery. This morning I looked at some blogs and found many comments about me. It seems women should be very careful about selecting a husband. Well in this life it's already too late for me, I'll pay more attention to it in the next life.
This affair makes me feel that I have been treated very unjustly: Why do people have to connect me to a person I haven't seen in more than ten years, to whom I haven't said a word. We haven't even bumped into each other. This is really unfair. The police often say to comrades who have committed offences: if you correct the mistake everything will be OK. Even if you have a previous conviction, people shouldn't continue to talk about it all the time, that completely lacks a spirit of generosity.
I always try to be or pretend to be a decent person, with a magnanimous attitude, with some reluctance to speak [and judge others]. But this affair is just too hilarious, I am nearly going crazy not speaking about it. And also, no matter how hard I try, nobody is ever going to mistake me for a lady, so I might as well just say a few words about this.
We Chinese have a saying: you can navigate a boat in the stomach of a prime minister [i.e. a great leader should be able to deal with all kinds of problems and annoyances]. If a steamed bun can't even go in his stomach, then it's obvious that he has become a chicken with a small stomach [i.e. very narrow-minded].
Moreover, this steamed bun has clearly taken some coarse grain and turned it into high quality flour; not a good thing. If it was me, I would see what stale food I had at home and immediately take it out and get a talented person to turn it into wheat-flour and rice, immediately get some face, maybe make some money and get fame and fortune [note, this paragraph is badly translated, please send corrections to jeremy -at- danwei.org].
Self-mockery is a weapon of all intelligent people. Especially when they meet difficulties, self-mockery can instruct, and help them out of a predicament. Being ridiculed by other people is a painful thing, but people like Lu Xun who are merciless with bad people even when they are down are rare, most people will fogive a wrongdoer a way out and laugh the problem away.
This post is the end of a vow of silence. So now our LE magazine editors have no mercy on me, not about this: they have already urged me to write a 'Steamed Bun Q&A' for the March issue of the magazine. I ask everyone to read the March issue of the magazine. If you feel that we are coarse grain, you are welcome to trample us into a steamed buns.
Lastly, I must apologize to the parties concerned, but if I restrained myself from talking I think I'd get cancer from the effort.