SimRank: A Measure of Structural-Context Similarity | 19 Apr 2009

22:34 Talking MySQL to Sphinx » MySQL Performance Blog

In the recently released Sphinx version 0.9.9-rc2 there is a support for MySQL wire protocol and SphinxQL - SQL-like language to query Sphinx indexes. This support is currently in its early preview stage but it is still fun to play with.

A thing to mention - unlike MySQL Storage Engines, some of which as InfoBright or KickFire take over execution after parsing, Sphinx MySQL support has nothing to do with MySQL - it is implementation of the wire protocol from scratch.

For this test I was not interesting in the full text search performance, we already know Sphinx is much faster than MySQL build in full text search. I was rather interested to look performance of other queries, not using Full Text Search.

PLAIN TEXT

SQL:

[root@r27 sp]# mysql --host 127.0.0.1 --port 3307
Welcome TO the MySQL monitor. Commands end WITH ; OR \g.
Your MySQL connection id IS 1
Server version: 0.9.9-id64-rc2 (r1785)
Type 'help;' OR '\h' FOR help. Type '\c' TO clear the buffer.

For the tests I used the table from the forum search engine, leaving just bunch of ids in it, removing everything else:

PLAIN TEXT

SQL:

CREATE TABLE `sptest` (
`id` bigint(20) UNSIGNED NOT NULL,
`site_id` int(10) UNSIGNED NOT NULL,
`forum_id` int(10) UNSIGNED NOT NULL,
`author_id` int(10) UNSIGNED NOT NULL,
`num_links` smallint(5) UNSIGNED NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8

This table contained some 25 millions of rows and no indexes there defined - Sphinx does not support explicit indexes and it is clear when you can use index for sort MySQL will be a lot faster.

First - Sorting. Sphinx is smart doing sorting because it does not try to sort everything but if you ask but rather only number of rows it needs to reach the LIMIT

Sphinx

PLAIN TEXT

SQL:

mysql> SELECT forum_id AS f FROM sptest ORDER BY author_id DESC LIMIT 10;
+------------+--------+----------+
| id | weight | forum_id |
+------------+--------+----------+
| 6739362135 | 1 | 2736983 |
| 6739362391 | 1 | 2736983 |
| 6739338327 | 1 | 1024599 |
| 6739357527 | 1 | 1023063 |
| 6739359063 | 1 | 1024599 |
| 6739305559 | 1 | 2558807 |
| 6739336791 | 1 | 2558807 |
| 6739300695 | 1 | 208215 |
| 6739297111 | 1 | 2736471 |
| 6739296855 | 1 | 2736471 |
+------------+--------+----------+
10 rows IN SET (7.92 sec)

MySQL

PLAIN TEXT

SQL:

mysql> SELECT forum_id AS f FROM sptest ORDER BY author_id DESC LIMIT 10;
+---------+
| f |
+---------+
| 2736983 |
| 2736983 |
| 1024599 |
| 1023063 |
| 1024599 |
| 2558807 |
| 2558807 |
| 208215 |
| 2736471 |
| 2736471 |
+---------+
10 rows IN SET (17.91 sec)

As you can see Sphinx adds couple of extra columns to result set even if you have not asked it.

Another thing to try is GROUP BY - Sphinx executes GROUP BY in fixed memory which means results may be approximate - this is geared towards full text search applications when exact number is not important.

Sphinx

PLAIN TEXT

SQL:

mysql> SELECT max(forum_id) AS m,author_id AS a FROM sptest GROUP BY author_id ORDER BY m DESC LIMIT 10;
+------------+--------+----------+-----------+---------+
| id | weight | forum_id | author_id | m |
+------------+--------+----------+-----------+---------+
| 6739362135 | 1 | 2736983 | 139452247 | 2736983 |
| 6738995287 | 1 | 1762135 | 134125655 | 2736727 |
| 6739296855 | 1 | 2736471 | 139450967 | 2736471 |
| 6739297111 | 1 | 2736471 | 139451223 | 2736471 |
| 6739227479 | 1 | 2736215 | 139449687 | 2736215 |
| 6739227735 | 1 | 2736215 | 139449943 | 2736215 |
| 6739226967 | 1 | 2735959 | 139449175 | 2735959 |
| 6739227223 | 1 | 2735959 | 139449431 | 2735959 |
| 6739223383 | 1 | 2735703 | 139448663 | 2735703 |
| 6739223639 | 1 | 2735703 | 139448919 | 2735703 |
+------------+--------+----------+-----------+---------+
10 rows IN SET (32.47 sec)

MySQL

PLAIN TEXT

SQL:

mysql> SELECT max(forum_id) AS m,author_id AS a FROM sptest GROUP BY author_id ORDER BY m DESC LIMIT 10;
+---------+-----------+
| m | a |
+---------+-----------+
| 2736983 | 139452247 |
| 2736727 | 134125655 |
| 2736471 | 139450967 |
| 2736471 | 139451223 |
| 2736215 | 139449687 |
| 2736215 | 139449943 |
| 2735959 | 139449175 |
| 2735959 | 139449431 |
| 2735703 | 139448663 |
| 2735703 | 139448919 |
+---------+-----------+
10 rows IN SET (1 min 15.03 sec)

Another optimization I wanted to check is the "early block reject" which should allow to quickly throw away large blocks of attributes if they do not contain any data:

Sphinx

PLAIN TEXT

SQL:

SELECT max(author_id) AS a ,forum_id AS f FROM sptest WHERE num_links=1;
Empty SET (2.70 sec)

MySQL

PLAIN TEXT

SQL:

mysql> SELECT max(author_id) AS a ,forum_id AS f FROM sptest WHERE num_links=1;
+------+---+
| a | f |
+------+---+
| NULL | NULL |
+------+---+
1 row IN SET (4.29 sec)

I would expect much larger lead in this case because of this optimization but it seems to be broken in the tested version.

Also note the result set difference - Sphinx finds no rows and creates no groups while MySQL reports NULL group as a result.

SphinxQL at this point is rather picky - it wants AS for all the expressions, it also could not parse some queries for no reason though I expect these things to be polished in the near future. The good thing is the query execution maps to the same execution engine which is quite stable which means it will likely stabilize soon.

Sphinx also offers number of extensions to the SQL which are helpful for search use cases - WITHIN GROUP ORDER BY allows to select which item to pick within given group (like if you want to show most recent document, or most relevant) and others.

You might find using Native API more feature full at this point but command line language is very helpful for testing and debugging purposes as well as so Sphinx can be accessed from languages which doe not have native Sphinx API implemented - everyone seems to be able to talk to MySQL these days.

Now on performance - for given class of queries Sphinx was just 1.5-2 times faster. I honestly hoped for more, though I carefully picked queries which are reasonably good for both of them - it is easy to "break" MySQL making it to do group by with on disk temporary table which will make Sphinx much faster and few others.

The true gain from Sphinx however comes from its ability to scale almost linearly using multiple CPU cores and multiple nodes in the system. The raw scan speed was almost 10 millions of rows per second (this is on rather outdated CPU I used for testing) - this means you should be able to scan through 100M+ rows on the single modern 8 core server which is quite a number.

Entry posted by peter | No comment

Add to: | | | |

19:45 Has-Patch Marathon Results » WordPress Development Blog

As promised, here are the results of the 24-hour has-patch marathon that was announced, begun and completed over the course of a few days last week (more on timing after the results). Results include activity from 8am Pacific time on Thursday, April 16, 2009 to 9am Pacific time on Friday, April 17, 2009.

Total number of patches committed to core: 44

Contributors whose old patches were committed: 9

Marathon contributors whose patches were committed: 13

Tickets closed: 102 (breakdown below)

Fixed – 45
Dupe – 16
Wontfix – 10
Invalid – 19
Worksforme – 12

Tickets created: 20 [I guess not everyone got the memo that we were trying to close tickets. ]

Tickets reopened: 4

Number of testers who left comments in ticket threads: 10

Number of testing-specific comments: 18

These numbers are based on opening each ticket that registered activity during the marathon hours and counting the actual comments that indicated some testing of a patch. Contributions to philosophical discussions without a patch, while important, weren’t counted for this purpose. Nor were Trac notices that simply noted a ticket was being closed because it was a dupe, invalid, etc.

Top five contributors (committed patches): Denis-de-Bernardy, filosofo, nbachiyski, scohoust, simonwheatley

Top five testing feedback providers: shanef, Nicholas91, Denis-de-Bernardy, sivel, williamsba, mrmist (tie)

Given the short notice/last-minute nature of the marathon, I think we did pretty well. Granted, there were people who complained that two days wasn’t enough notice to clear their schedules, but let’s be honest, the 24-hour has-patch marathon was more of a rallying cry to help clean out Trac than a deadline based on anything specific. Patches are always welcome/encouraged, and now that the big features for 2.8 are mostly done, the lead devs will be able to spend more time reviewing Trac tickets and patches. Still, not too many people tested existing patches (or if they did, they failed to leave the requisite comment in the ticket threads). Testing patches is one of the easiest things you can do to help further development, since patches won’t be committed or rejected until they’ve been tested by several people.

As we get closer to the 2.8 release, jump into Trac any time and test a few patches (don’t forget to leave the feedback!) if you have time. If there’s a ticket you’re sick of seeing there, write a patch and ask your fellow contributors to test it and comment on the ticket thread. We’ll announce an official bug hunt soon (and yes, there will be more than two days’ notice), but the fact remains that addressing new bugs is easier if Trac isn’t clogged with old tickets. If you spot duplicate tickets, mark it a dupe, note the other ticket number in the comments and close the ticket. If you see one that is no longer relevant because the current code base fixes a problem reported several versions ago, mark it invalid, leave a comment and close the ticket. These simple housekeeping tasks may not seem like much, but they do help. Special props to Denis-de-Bernardy, who in addition to writing a couple of patches during the marathon and testing a few others, did a bunch of ticket maintenance like this, and cleared out a number of tickets.

Thank you to everyone who participated, and until the next marathon, happy patching and testing!

Active Facebook Users By CountryO'Reilly Radar - Insight, analysis, and research about emerging technologies. » 车东's shared items in Google Reader

Since I last posted numbers on Facebook's user base six week ago, the company has added close to 20 million active users.

I've had a few requests for detailed numbers by country so I quickly assembled an update for each of the regions shown above.

Among countries with at least a million users, the fastest-growing are Indonesia and the Philippines. According to Alexa, Facebook is now the 2nd most-popular site in Indonesia, displacing Friendster as the country's leading social network. The company now has close to 13M active users in Asia.

For more details, you can view or download regional numbers below:

Facebook Demo 20090415

View more presentations from oreillymedia.

I had to shrink the chart for Europe to fit into the slides, here is larger version of that image:

[The equivalent chart for N. America & Other Regions, can be found here.]

	四月 2009
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30