第一次尝试MagpieRSS,因为没有安装iconv和mbstring,所以失败了,今天在服务器上安装了iconv和mbstring的支持,今天仔细看了一下lilina中的rss_fetch的用法:最重要的是制定RSS的输出格式为'MAGPIE_OUTPUT_ENCODING' = 'UTF-8'
样例代码如下:
<?php
// $Id$
// including
require_once("rss_fetch.inc");
// specify output encoding default is ISO-8859-1
define('MAGPIE_OUTPUT_ENCODING', 'UTF-8');;
define('MAGPIE_FETCH_TIME_OUT', 60 * 180);
$url = $_GET['url'];
$rss = fetch_rss($url);
print_r($rss);
?>
在gRaSSland设计中,我看了一下print_r的输出:发现rss 1.0 rss 2.0 atom在作者属性和文章摘要属性方面还是有一些出入的。
列表如下:
RSS 1.0 | RSS 2.0 | Atom |
magpierss Object ( [parser] => Resource id #10 [current_item] => Array ( ) [items] => Array ( [about] => [title] => another story [link] => [description] => foo [dc] => Array ( [subject] => foo [creator] => chedong [date] => 2004-12-17T22:24:41+08:00 )
[date_timestamp] => 1103293440 ) [1] ( [about] => [title] => GrassLand demo [link] => [description] => body text with <b>html</b> [dc] => Array ( [subject] => foo [creator] => chedong [date] => 2004-12-17T22:18:17+08:00 )
[date_timestamp] => 1103293080 ) ) [channel] => Array [title] => demo [link] => http://blog.cnblog.org/demo/ ( [date] => 2004-12-17T22:24:41+08:00 )
[items_seq] => [tagline] => [textinput] => Array [image] => Array [feed_type] => RSS [_KNOWN_ENCODINGS] => Array [stack] => Array [inchannel] => [etag] => "c2ab4-615-41c2ed3e" ) | magpierss Object ( [parser] => Resource id #10 [current_item] => Array ( ) [items] => Array ( [title] => another story [description] => foo [link] => [guid] => [category] => foo [pubdate] => Fri, 17 Dec 2004 22:24:41 +0800 [summary] => foo [date_timestamp] => 1103293481 ) [1] ( [title] => GrassLand demo [description] => body text with <b>html</b> [link] => [guid] => [category] => foo [pubdate] => Fri, 17 Dec 2004 22:18:17 +0800 [summary] => body text with <b>html</b> [date_timestamp] => 1103293097 ) ) [channel] => Array [title] => demo [link] => http://blog.cnblog.org/demo/ [copyright] => Copyright 2004 [lastbuilddate] => Fri, 17 Dec 2004 22:24:41 +0800 [generator] => http://www.movabletype.org/?v=3.11 [docs] => http://blogs.law.harvard.edu/tech/rss [tagline] => [textinput] => Array [image] => Array [feed_type] => RSS [_KNOWN_ENCODINGS] => Array [stack] => Array [inchannel] => [etag] => "c2ab3-40f-41c2ed3f" ) | magpierss Object ( [parser] => Resource id #10 [current_item] => Array ( ) [items] => Array ( [title] => another story [link] => [modified] => 2004-12-17T14:27:38Z [issued] => 2004-12-17T14:24:41Z [id] => tag:blog.cnblog.org,2004:/demo/6.3690 [created] => 2004-12-17T14:24:41Z [summary] => foo [author] => [author_name] => chedong [author_url] => http://www.chedong.com [author_email] => chedong@hotmail.com [dc] => Array ( [subject] => foo )
[content] => Array ( [encoded] =>
) [1] ( [title] => GrassLand demo [link] => [modified] => 2004-12-17T14:27:39Z [issued] => 2004-12-17T14:18:17Z [id] => tag:blog.cnblog.org,2004:/demo/6.3689 [created] => 2004-12-17T14:18:17Z [summary] => body text with html [author] => [author_name] => chedong [author_url] => http://www.chedong.com [author_email] => chedong@hotmail.com [dc] => Array ( [subject] => foo )
[content] => Array ( [encoded] =>
) ) [channel] => Array [title] => demo [link] => http://blog.cnblog.org/demo/ [modified] => 2004-12-17T14:27:38Z [generator] => Movable Type [copyright] => Copyright (c) 2004, chedong [description] => [textinput] => Array [image] => Array [feed_type] => Atom [_KNOWN_ENCODINGS] => Array [stack] => Array [inchannel] => [etag] => "c2ab0-776-41c2ed3f" ) |
因此在RSS抓取的过程中:映射author,需要根据不同版本进行映射:
foreach ($rss->items as $item) {
if ($rss->feed_type == "RSS" && $rss->feed_version == "1.0") {
$item['author'] = $item['dc']['creator'];
}
else if ($rss->feed_type == "Atom") {
$item['author'] = $item['author_name'];
}
else {
$item['author'] = $rss->channel['title'];
}
// print_r($item);
$sql = "INSERT INTO `grassland` ( `url` , `title` , `author` , `content` , `pubdate` , `author_url` , `author_rss` ) VALUE
S ('" .$item['link'] .
"' , '" . mysql_escape_string($item['title']) .
"' , '" . mysql_escape_string($item['author']) .
"' , '" . mysql_escape_string($item['description']) .
"' , '" . mysql_escape_string($item['date_timestamp']) .
"' , '" . mysql_escape_string($rss->channel['link']) .
"' , '" . mysql_escape_string($url) . "') ";
// print $sql;
$result = mysql_query($sql);
}
学习一下Steve的解决过程,翻译自:
http://minutillo.com/steve/weblog/2004/6/17/php-xml-and-character-encodings-a-tale-of-sadness-rage-and-data-loss
....
问题:你如何知道XML是什么字符集?唯一的答案就是你自己扫描XML头然后判断是什么字符集,代码如下:
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';
if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}
正则表达式发现XML自己的字符集。如果发现,就记录到$encoding 如果没有发现就当成UTF-8(也是XML缺省的字符集),完整代码如下:
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';
if (preg_match($rx, $xml, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}
$parser = xml_parser_create($encoding);
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
成了,所有的FEED转成UTF-8了,轮到BIG-5了,再次查看了PHP文档和源代码发现PHP 4.x只支持UTF-8, ISO-8859-1 和 US-ASCII,所以轮到 BIG5或SHIFT-JIS还是会乱码, PHP 5也没用(译注:我也尝试过使用php 5)在PHP 5正式发布时会包含BIG5和GB2312这2种主要的中文编码,在PHP文档中搜索了一下,找到了一个潜在的解决方案 mbstring() mbstring系列函数支持一个巨长的字符集列表,并可以进行之间的相互转换
最后的解决方案:用regex 发现数据源字符集,如果PHP自己不能解决就在解析前用mb_convert_encoding将其转换成UTF-8,然后按UTF-8解析,但是解析不了的可能性还是非常高的。我试了一下代码:
$rx = '/<?xml.*encoding=['"](.*?)['"].*?>/m';
if (preg_match($rx, $source, $m)) {
$encoding = strtoupper($m[1]);
} else {
$encoding = "UTF-8";
}
if($encoding == "UTF-8" || $encoding == "US-ASCII” || $encoding == "ISO-8859-1") {
$parser = xml_parser_create($encoding);
} else {
if(function_exists('mb_convert_encoding')) {
$encoded_source = @mb_convert_encoding($source, "UTF-8", $encoding);
}
if($encoded_source != NULL) {
$source = str_replace ( $m[0],'<?xml version="1.0" encoding="utf-8"?>', $encoded_source);
}
$parser = xml_parser_create("UTF-8");
}
xml_parser_set_option($parser, XML_OPTION_TARGET_ENCODING, "UTF-8");
:) HACK成功,解析ISO-8859-15, BIG-5, 甚至 GB2312 都没有问题,可以将其全部转成UTF-8并在同一个页面中展现。。。我可以宣布这是一个非常艺术的PHP XML字符集声明解析方案,最好能在PHP 4.x中加入。
这多语言字符集支持方面Java等商业软件都做的比较好,我以前做过一些试验,说明Java对字符集的支持机制。有了iconv和mbstring的支持:php应用也终于可以面向字符编程,而不是面向字节编程了……
计划在GrassLand的RSS中使用MagPieRSS做为RSS解析工具,下一步进行网页抓取同步和数据增加添加的工作。
2004-12-24
RSS版本和使用的字符集统计
使用MT作为发布系统一般会生成3个FEED文件,分别是RSS 1.0/2.0和Atom 0.3
http://blog.cnblog.org/index.xml RSS 2.0 GBK
http://blog.cnblog.org/index.rdf RSS 1.0 GBK
http://blog.cnblog.org/atom.xml Atom 0.3 GBK
通过对gRaSSland目前注册的数百的RSS的统计:RSS占绝对主流
357 RSS
7
对应的RSS版本为:
222 1.0
122 2.0
11 0.92
7
4 0.91
使用的发布语言:却以GB2312为主。
233 GB2312
79 UTF-8
54
2 ISO-8859-1
RSS解析器的容错性显得非常重要。
作者:车东 发表于:2004-12-12 22:12 最后更新于:2007-04-15 19:04版权声明:可以转载,转载时请务必以超链接形式标明文章 MagPieRSS中UTF-8和GBK的RSS解析分析(附:php中的面向字符编程详解) 的原始出处和作者信息及本版权声明。
http://www.chedong.com/blog/archives/000598.html
Comments
今天用了一下magpierss,怎么抓取中文的RSS一直是乱码,但我的服务器incov函数启用了啊
不知道为什么
由: huangam 发表于 2007年07月30日 下午04时51分