16:38 Converting Character Sets » MySQL Performance Blog

The web is going the way of utf8.  Drizzle has chosen it as the default character set, most back-ends to websites use it to store text data, and those who are still using latin1 have begun to migrate their databases to utf8.  Googling for "mysql convert charset to utf8" results in a plethora of sites, each with a slightly different approach, and each broken in some respect.  I'll outline those approaches here and show why they don't work, and then present a script that can generically be used to convert a database (or set of tables) to a target character set and collation.

Approach #1:

SQL:
  1. ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;

Take the following table as an example why this approach will not work:

SQL:
  1. mysql> CREATE TABLE `t1` (
  2. ->   `c1` text NOT NULL
  3. -> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  4. Query OK, 0 rows affected (0.02 sec)
  5.  
  6. mysql> ALTER TABLE `t1` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;
  7. Query OK, 0 rows affected (0.02 sec)
  8. Records: 0  Duplicates: 0  Warnings: 0
  9.  
  10. mysql> SHOW CREATE TABLE `t1`\G
  11. *************************** 1. row ***************************
  12. TABLE: t1
  13. CREATE TABLE: CREATE TABLE `t1` (
  14. `c1` mediumtext NOT NULL
  15. ) ENGINE=MyISAM DEFAULT CHARSET=utf8
  16. 1 row IN SET (0.01 sec)

Notice the implicit conversion of c1 from text to mediumtext. This approach can result in modified data types and silent data truncation, which makes it unacceptable for our purposes.

Approach #2 (outlined here):

This approach avoids the issue of implicit conversions by changing each data type to it's binary counterpart before conversion. Due to implementation limitations, however, it also converts any pre-existing binary columns to their text counterpart. Additionally, this approach will fail because a binary column cannot be part of a FULLTEXT index. Even if these limitations are overcome, this process is inherently unsuitable for large databases because it requires multiple alter statements to be run on each table:

1) Drop FULLTEXT indexes
2) Convert target columns to their binary counterparts
3) Convert the table to the target character set
4) Convert target columns to their original data types
5) Add FULLTEXT indexes back

For those of us routinely waiting hours, if not days, for a single alter statement to finish, this is unacceptable.

Approach #3:

Dumping the entire database and re-importing it with the appropriate server & client character sets.

This is a three-step process, where one must first dump only the schema and then edit it by hand to have the appropriate character sets and the dump the data separately. After which, the schema must be re-created and data imported. If you're using replication, this usually isn't even an option because you'll have a ridiculous amount of binary logs and force a reload of data on every server in the replication chain (very time/bandwidth/disk space consuming).

Except for Approach #1, these approaches are much more difficult than they need to be. Consider the following ALTER statement against the table in Approach #1:

SQL:
  1. ALTER TABLE `t1`
  2. DEFAULT CHARSET=utf8,
  3. MODIFY COLUMN `c1` text CHARACTER SET utf8;

This approach will both change the default character set for the table and target column, while leaving in place any FULLTEXT indexes. It also requires only a single ALTER statement for a given table. A perl script has been put together to parallel-ize the ALTER statements and is available at:

CODE:
  1. %> wget http://www.pablowe.net/convert_charset

It will be added to Percona Tools on Launchpad (or perhaps maatkit, if it proves useful enough) once it is feature complete. Outstanding issues include:

- Proper handling of string foreign keys (currently fails, but you probably shouldn't be using strings as foreign keys anyway ...)
- Allow throttling of the number of threads created (currently creates one per table)

CODE:
  1. Usage:
  2. convert_charset --database=database [options]
  3.  
  4. Options:
  5. --askpass        Prompt for a MySQL password
  6. --charset        The target character set to convert to
  7. --collate        The target collation to convert to
  8. --database|d     The target database
  9. --help|?         Display this help and exit
  10. --host|h         The target host
  11. --ignore-columns Columns to ignore, useful if you want to
  12. keep the existing charset for a target column
  13. Comma-separated.  NO SPACES.
  14. table.column
  15. --ignore-tables  A comma-separated list of tables to ignore
  16. --password|p     The MySQL password to use
  17. --port           The target port
  18. --tables         A comma-separated list of tables to convert.
  19. All non-named tables will be ignored
  20. --test           Print the ALTER statements that would be executed
  21. without executing them.
  22. --user|u         The MySQL user
  23. --version|V      Display version information and exit
  24.  
  25. defaults are:
  26.  
  27. ATTRIBUTE                  VALUE
  28. -------------------------- ------------------
  29. askpass                    FALSE
  30. charset                    utf8
  31. collate                    No Default Value
  32. database                   No Default Value
  33. help                       FALSE
  34. host                       localhost
  35. ignore-columns             No Default Value
  36. ignore-tables              No Default Value
  37. password                   No Default Value
  38. port                       3306
  39. tables                     No Default Value
  40. test                       FALSE
  41. user                       Current User
  42. version                    FALSE


Entry posted by Ryan Lowe | No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

04:34 Books: Web 2.0 Architectures and Cloud Application Architectures » High Scalability - Building bigger, faster, more reliable websites.

I am excited about the upcoming release of two books on Web 2.0 and Cloud Application Architectures by O'Reilly.

Web 2.0 Architectures (estimated release in May 2009)
What entrepreneurs and information architects need to know

Using several high-profile Web 2.0 companies as examples, authors Duane Nickull, Dion Hinchcliffe, and James Governor have distilled the core patterns of Web 2.0 coupled with an abstract model and reference architecture. The result is a base of knowledge that developers, business people, futurists, and entrepreneurs can understand and use as a source of ideas and inspiration. Featured architectures include Google, Flickr, BitTorrent, MySpace, Facebook, and Wikipedia.

Cloud Application Architectures (estimated release in April 2009)
Building Applications and Infrastructure in the Cloud

This book by George Reese offers tested techniques for creating web applications on cloud computing infrastructures and for migrating existing systems to these environments. Specifically, you'll learn about the programming and system administration necessary for supporting transactional web applications in the cloud -- mission-critical activities that include orders and payments to support customers.

The second book is available online at O'Reilly as a Rough Cuts Version so you might already had a chance to check it out. If so, do you like it?

软件项目管理的十条忠告于今夕之何夕,唯读书以自省 » 车东's shared items in Google Reader
Shared by 车东
每两周要读完一本书

Joe Marasco是Rational Software的一名已经退休的事业部经理和高级副总裁。他将自己多年软件开发与管理经验的精华萃取成《The Software Development Edge:Essays on Managing Successful Projects》一书,该书中文名为《软件开发的边界——管理成功的项目》(个人觉得这个名字译得不好)。

软件开发的边界-管理成功的项目

此书着实是近期有关软件开发项目的一本精品书籍,在“第四章——管理”中,Joe提出了软件项目团队管理的十条忠告。

  1. 专注于构建一个强有力的团队,这一团队能够解决困难的问题,并为客户创造真正的价值。
  2. 领导者鼓舞;管理者授权。要同时成为优秀的领导者和管理者,你需要就愿景进行沟通并理解其细节。
  3. 对可能出现的障碍有所准备,防微杜渐,在这些障碍尚未壮大时就清除它们。
  4. 花时间来仔细倾听别人的意见,但不要过于担心其他人的想法。
  5. 专注于事实。
  6. 充当一个衰减器,而不是放大器,为团队提供稳定性。
  7. 永远不要将不能解释的事情归咎为蓄意破坏。
  8. 培养幽默意识来作为严肃认真的一种平衡:对工作一丝不苟,对自己轻松自如。
  9. 除了工作,还应该懂得享受生活,而且每年要读25本书。
  10. 相信你的直觉:如果你感觉不妙,那么很可能预感就会成真。

随机文章:

酱油诗一首 2009-02-04
无跟帖,不新闻 2009-01-15
最好的一本厕上杂志没了…… 2009-01-11
Hand in Hand 2008-10-29
敏捷软件开发中的TOC 2008-08-24

收藏到:Del.icio.us


《城客》:第一本中文互动杂志!



^==Back Home: www.chedong.com

^==Back Digest Home: www.chedong.com/digest/

<== 2009-03-16
  三月 2009  
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          
==> 2009-03-18