MySQL分组查询TOP N的实践和踩坑

写在前边


昨天同事问我了一个问题:
假设有表 三个字段 课程 学生 成绩,如何取每门课程成绩 top3 的学生
当然,凭我的知识储备,并没有很好的思路去解决,所以自然是掏出一个神器来寻找答案。
一番查阅,使我明白了许多,所以记录一下。
以下是我对网上各种方法的实践&测试&总结。

分组以后求 TOP N


准备工作

创建两个表 test1test2,表结构相同,数据不同。

创建表

1
2
3
4
5
6
7
8
CREATE TABLE `test1` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(20) DEFAULT NULL,
`course` varchar(20) DEFAULT NULL,
`score` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `course` (`course`,`score`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

插入数据

插入表 test1

test1 是一个有 1.8W 数据的表,有 180 个学生,分别有 100 门课的成绩

a1 a2 a3 三名学生分别拥有 b0-b99100 个课程的最高分 99 89 79

其他学生的成绩都是 60

1
2
3
4
5
6
7
8
9
10
11
insert into test1(name,course,score)
values
('a1','b0',99),
('a2','b0',89),
('a3','b0',79),
...
...
('a177','b99',60),
('a178','b99',60),
('a179','b99',60)
;

插入表 test2

test2 插入了 9 条数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mysql> INSERT INTO `test2`(name,course,score) VALUES
('a1', 'a', 50),
('a2', 'a', 50),
('a3', 'a', 50),
('a4', 'a', 40),
('a5', 'a', 40),
('a6', 'a', 40),
('a7', 'a', 30),
('a8', 'a', 30),
('a9', 'a', 30);
mysql> select * from test2;
+----+------+--------+-------+
| id | name | course | score |
+----+------+--------+-------+
| 1 | a1 | a | 50 |
| 2 | a2 | a | 50 |
| 3 | a3 | a | 50 |
| 4 | a4 | a | 40 |
| 5 | a5 | a | 40 |
| 6 | a6 | a | 40 |
| 7 | a7 | a | 30 |
| 8 | a8 | a | 30 |
| 9 | a9 | a | 30 |
+----+------+--------+-------+
9 rows in set (0.06 sec)

解决方案


以下是我对网上的解决方案的查询和总结

使用 union all

已知课程数目和课程,并且结果集比较小,可以使用 union all

1
2
3
4
5
mysql>(select name,course,score from test1 where course='b0' order by score desc limit 3)
-> union all
-> (select name,course,score from test1 where course='b1' order by score desc limit 3)
-> union all
-> (select name,course,score from test1 where course='b2' order by score desc limit 3);

实际应用中,显然我们可以提前查课程表得知课程数目和课程情况,然后需要循环课程表的结果拼接 sql 查询成绩表,由于 course 有索引,所以速度也不会很慢,麻烦是很显然的(不优雅)。
但是如果课程很多,N 也比较大,还是不合理的。

自身的左连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
mysql> select a.name,a.course,a.score
from test2 a left join test2 b on a.course=b.course and a.score<b.score
group by a.name,a.course,a.score
having count(b.id)<3
order by a.course,a.score desc;
+------+--------+-------+
| name | course | score |
+------+--------+-------+
| a2 | a | 50 |
| a1 | a | 50 |
| a3 | a | 50 |
+------+--------+-------+
3 rows in set (0.07 sec)


mysql> select a.name,a.course,a.score,b.name as b_name,b.score as b_score
from test2 a left join test2 b on a.course=b.course and a.score<b.score
where a.name = 'a4';
+------+--------+-------+--------+---------+
| name | course | score | b_name | b_score |
+------+--------+-------+--------+---------+
| a4 | a | 40 | a1 | 50 |
| a4 | a | 40 | a2 | 50 |
| a4 | a | 40 | a3 | 50 |
+------+--------+-------+--------+---------+
3 rows in set (0.06 sec)

自身的左连接筛选,在左右表课程相同的情况下,每一条记录都会关联到所有右表中相同课程的成绩大于当前记录的记录(第二条 sql 的结果十分显然),然后再进行筛选。
所以想象一下在实际的应用中,成绩的重复度是很低的,成绩最低的那条记录会匹配到所有成绩更高的记录,那么关联的数据量将会十分可观。

子查询

1
2
3
4
5
6
7
8
9
10
11
12
mysql> select *
from test2 a
where 3>(select count(*) from test2 where course=a.course and score>a.score)
order by a.course,a.score desc;
+----+------+--------+-------+
| id | name | course | score |
+----+------+--------+-------+
| 1 | a1 | a | 50 |
| 2 | a2 | a | 50 |
| 3 | a3 | a | 50 |
+----+------+--------+-------+
3 rows in set (0.05 sec)

子查询和自身做链接的解决思路差不多,实际查询的数据量也是十分可观的。

使用临时变量

先按照课程和成绩倒序排,那么这个时候的结果集中,相同课程的成绩一定是倒序排的,所以用变量@group 记录一下上一条的课程 ID,如果@group 与本条记录中的 course 相等,就让变量@num 自增 1,否则让@num 等于 1,这样一来,@num 就代表了每门课程中成绩的名次。然后对结果集中的@num 变量进行筛选,就可以得到想要的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
select name, course, score
from (
select name, course, score,
@num := if(@group = course, @num + 1, 1) as row_number,
@group := course
from test2
order by course, score desc
) as data where data.row_number <= 3;
+------+--------+-------+
| name | course | score |
+------+--------+-------+
| a1 | a | 50 |
| a2 | a | 50 |
| a3 | a | 50 |
+------+--------+-------+
3 rows in set (0.06 sec)

这个过程中,第二遍筛选的结果集是在执行器调用存储引擎接口返回给数据库 server 层以后进行处理的,所以速度自然是非常的快。

以为这个方法的介绍这样就结束了?再往下看

1
2
3
4
5
6
7
8
9
select name, course, score
from (
select name, course, score,
@num := if(@group = course, @num + 1, 1) as row_number,
@group := course
from test2
order by course, score desc
) as data where data.row_number <= 3;
Empty set

同样的 sql 执行过后,结果却是的。
为什么呢?
回看上边对表 test2 的数据展示,这个表只有一种课程。所以造成这个问题的原因就十分显然了。
第一遍执行过后,@group@num 是我们本次会话创建的临时变量,@group 存储的值是第一遍查询的结果集中最后一条记录的 course 值。由于单次会话内的临时变量在会话结束后才会销毁,那么第二次查询问题就产生了,@num 的值只有当@group 与记录中的 course 不同时,才会变为 1,如果数据库中的数据只有一种课程,那么单次会话中就无法触发这个条件,导致后续的查询得不到我们想要的结果
怎么解决呢?

MySQL 中有mysql_reset_connection()方法,在不关闭本次会话的情况下,重置本次的链接,这个过程包含了销毁临时变量。
那么我们每次执行这个 sql 都要先执行这个方法吗?
非也,网友给我们了一个更加方便的答案,灵活运用交叉链接(cross join)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
mysql> select name, course, score
from
(
select name, course, score,
(@num:=if(@group = course, @num +1, if(@group := course, 1, 1))) row_number
from test2 t
CROSS JOIN (select @num:=0, @group:=null) c
order by course, score desc, name
) as data
where data.row_number <= 3;
+------+--------+-------+
| name | course | score |
+------+--------+-------+
| a1 | a | 50 |
| a2 | a | 50 |
| a3 | a | 50 |
+------+--------+-------+
3 rows in set (0.05 sec)

mysql> select name, course, score
from
(
select name, course, score,
(@num:=if(@group = course, @num +1, if(@group := course, 1, 1))) row_number
from test2 t
CROSS JOIN (select @num:=0, @group:=null) c
order by course, score desc, name
) as data
where data.row_number <= 3;
+------+--------+-------+
| name | course | score |
+------+--------+-------+
| a1 | a | 50 |
| a2 | a | 50 |
| a3 | a | 50 |
+------+--------+-------+
3 rows in set (0.06 sec)

连续执行两次,结果一样,问题解决。

性能测试

上边已经介绍过 test1 是一个有 1.8W 数据的表,有 180 个学生,分别有 100 门课的成绩。

高重复度


现在 test1 的数据重复度很高,在较高重复度的情况下,自身左连接和子查询会关联相对更少的数据,应该会快一点。

1
2
3
4
5
6
7
8
9
10
mysql> select score,count(*) from test1 group by score;
+-------+----------+
| score | count(*) |
+-------+----------+
| 60 | 17700 |
| 79 | 100 |
| 89 | 100 |
| 99 | 100 |
+-------+----------+
4 rows in set (0.05 sec)

union all

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> (select name,course,score from test1 where course='b0' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b1' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b2' order by score desc limit 3)
...
...
(select name,course,score from test1 where course='b97' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b98' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b99' order by score desc limit 3)

// 省略结果
300 rows in set (0.40 sec)

自身左连接

1
2
3
4
5
6
7
8
mysql> select a.name,a.course,a.score
from test1 a left join test1 b on a.course=b.course and a.score<b.score
group by a.name,a.course,a.score
having count(b.id)<3
order by a.course,a.score desc;

// 省略结果
300 rows in set (1.44 sec)

子查询

1
2
3
4
5
6
7
mysql> select *
from test1 a
where 3>(select count(*) from test1 where course=a.course and score>a.score)
order by a.course,a.score desc;

// 省略结果
300 rows in set (1.29 sec)

临时变量

1
2
3
4
5
6
7
8
9
10
11
12
13
mysql> select name, course, score
from
(
select name, course, score,
(@num:=if(@group = course, @num +1, if(@group := course, 1, 1))) row_number
from test1 t
CROSS JOIN (select @num:=0, @group:=null) c
order by course, score desc, name
) as data
where data.row_number <= 3;

// 省略结果
300 rows in set (0.39 sec)

降低成绩的重复度


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
mysql> update test1 set score = 1 where name = 'a101';
update test1 set score = 2 where name = 'a102';
update test1 set score = 3 where name = 'a103';
update test1 set score = 4 where name = 'a104';
update test1 set score = 5 where name = 'a105';
update test1 set score = 6 where name = 'a106';
update test1 set score = 7 where name = 'a107';
update test1 set score = 8 where name = 'a108';
update test1 set score = 9 where name = 'a109';
update test1 set score = 11 where name = 'a111';
update test1 set score = 12 where name = 'a112';
update test1 set score = 13 where name = 'a113';
update test1 set score = 14 where name = 'a114';
update test1 set score = 15 where name = 'a115';
update test1 set score = 16 where name = 'a116';
update test1 set score = 17 where name = 'a117';
update test1 set score = 18 where name = 'a118';
update test1 set score = 19 where name = 'a119';

//情况如下
mysql> select score,count(*) from test1 group by score;
+-------+----------+
| score | count(*) |
+-------+----------+
| 0 | 100 |
| 1 | 100 |
| 2 | 100 |
...
...
| 58 | 100 |
| 59 | 100 |
| 60 | 11700 |
| 79 | 100 |
| 89 | 100 |
| 99 | 100 |
+-------+----------+
64 rows in set (0.11 sec)

union all

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql> (select name,course,score from test1 where course='b0' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b1' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b2' order by score desc limit 3)
...
...
(select name,course,score from test1 where course='b97' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b98' order by score desc limit 3)
union all
(select name,course,score from test1 where course='b99' order by score desc limit 3)

// 省略结果
300 rows in set (0.44 sec)

自身左连接

1
2
3
4
5
6
7
8
mysql> select a.name,a.course,a.score
from test1 a left join test1 b on a.course=b.course and a.score<b.score
group by a.name,a.course,a.score
having count(b.id)<3
order by a.course,a.score desc;

// 省略结果
300 rows in set (2.29 sec)

子查询

1
2
3
4
5
6
7
mysql> select *
from test1 a
where 3>(select count(*) from test1 where course=a.course and score>a.score)
order by a.course,a.score desc;

// 省略结果
300 rows in set (1.42 sec)

临时变量

1
2
3
4
5
6
7
8
9
10
11
12
13
mysql> select name, course, score
from
(
select name, course, score,
(@num:=if(@group = course, @num +1, if(@group := course, 1, 1))) row_number
from test1 t
CROSS JOIN (select @num:=0, @group:=null) c
order by course, score desc, name
) as data
where data.row_number <= 3;

// 省略结果
300 rows in set (0.41 sec)

总结


方法 优点
union all 在有索引的情况下,速度并不慢,而且从存储引擎返回到 server 层的数据量最小。然后 sql 的拼接十分麻烦,没有索引就 GG 了。
自身左连接 数据的重复度越低,生成的结果集越大,自然非常的慢
子查询 同样数据的重复度越低,生成的结果集越大,速度相对自身左连接快一点,不过依然很慢。
临时变量 速度快,数据的筛选是在存储引擎返回到 server 层以后进行了筛选,数据量比较大

综上所述,我选择使用临时变量查询。

进阶

这个问题呢,其实已经在 leetcode 上有的。
部门工资前三高的员工
在这个问题中,工资相同的员工被认为名次相同。如果部门相同的情况下A 100,B 90,C 90,D 80,E 70,那么要求A B C D都要出现在结果集中

1
2
3
4
5
6
7
8
9
10
11
mysql> select Department.Name as Department, data.name as Employee, Salary
from
(
select Name , DepartmentId, Salary,
(@num:=if(@group = DepartmentId , if(@SSalary = Salary,@num,@num+1), if(@group := DepartmentId, 1, 1))) row_number,
@SSalary := Salary
from Employee t
CROSS JOIN (select @num:=0, @group:=null,@SSalary := 0) c
order by DepartmentId, Salary desc, name
) as data join Department on Department.Id = data.DepartmentId
where data.row_number <= 3;

问题解决,收下开心。

End


如果有兴趣,可以制造更多的数据量观察分析,并可以测试一下内存占用的情况。
如果有问题,欢迎评论。

参考链接


文章不错,你都不请我喝杯茶,就是说你呀!
0%
upyun