PostgreSQL 全表全字段模糊查询的毫秒级高效实现

您所在的位置：网站首页 › hit5组合 › PostgreSQL 全表全字段模糊查询的毫秒级高效实现

PostgreSQL 全表全字段模糊查询的毫秒级高效实现

2023-09-27 03:16| 来源: 网络整理| 查看: 265

标签

PostgreSQL , 分词 , 全文检索 , 全字段检索 , 任意字段检索 , 下拉框选择 , 搜索引擎

背景

在一些应用程序中，可能需要对表的所有字段进行检索，有些字段可能需要精准查询，有些字段可能需要模糊查询或全文检索。

比如一些前端页面下拉框的勾选和选择。

这种需求对于应用开发人员来说，会很蛋疼，因为写SQL很麻烦，例子：

之前写过一篇文章来解决这个问题

《PostgreSQL 行级全文检索》

使用的是全文检索，而当用户的需求为模糊查询时？如何来解决呢？

不难想到我之前写过的一系列文章

《PostgreSQL 百亿数据秒级响应正则及模糊查询》

《PostgreSQL 1000亿数据量正则匹配速度与激情》

《中文模糊查询性能优化 by PostgreSQL trgm》

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

《PostgreSQL 全文检索加速快到没有朋友 - RUM索引接口(潘多拉魔盒)》

全表，所有字段的模糊查询，应该如何做呢 ?

关键技术还是pg_trgm。

《PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search》

全表全字段模糊查询的实现例子

比如有一张这样的表，有若干个字段，然后前端设计了一个页面，允许用户进行模糊搜索，但是搜索的范围是所有字段。

这样做用户体验是好了，但是对于程序来说有点蛋疼，因为我们并不知道用户想要搜索的是哪个或哪些字段。

那么怎么能做到高效的匹配呢？

创建测试表，生成测试数据

postgres=# create table t(phonenum text, info text, c1 int, c2 text, c3 text, c4 timestamp); CREATE TABLE postgres=# insert into t values ('13888888888','i am digoal, a postgresqler',123,'china','中华人民共和国，阿里巴巴，阿',now()); INSERT 0 1 postgres=# select * from t; phonenum | info | c1 | c2 | c3 | c4 -------------+-----------------------------+-----+-------+------------------------------+---------------------------- 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2016-04-19 11:15:55.208658 (1 row)

首先，被搜索的字段中如果有中文或者其他mutli-bytes字符，那么不能使用collate, ctype=c的数据库。

还好阿里云RDS PostgreSQL默认就不是C的，很棒。

如果不是，你可以这么指定collate 和 ctype postgres=# create database test with template template0 lc_collate 'zh_CN.utf8' lc_ctype 'zh_CN.utf8';

那么接下来，我们要做的是，建立支持模糊查询的函数索引，

create extension pg_trgm; create or replace function record_to_text(anyelement) returns text as $$ select $1::text; $$ language sql strict immutable; test=# create index idx_t_1 on t using gin (record_to_text(t) gin_trgm_ops) ; CREATE INDEX 当需要使用分页，或者结果集很大时，建议使用gist test=# create index idx_t_2 on t using gist (record_to_text(t) gist_trgm_ops) ; CREATE INDEX

查询测试

test=# explain select * from t where record_to_text(t) ~ 'digoal'; QUERY PLAN ------------------------------------------------------------------- Index Scan using idx_t_2 on t (cost=0.38..8.39 rows=1 width=140) Index Cond: (record_to_text(t.*) ~ 'digoal'::text) (2 rows)

查询性能测试

先插一堆数据进去 postgres=# insert into t select * from t; INSERT 0 4194304 test=# select count(*) from t; count --------- 4194304 (1 row) 然后插几条不一样的 insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌德哥刘德华彭德怀',now()); insert into t values ('13888889999','i am dege, a postgresqler',123,'china','德歌德哥刘德华彭德怀',now()); vacuum analyze t;

查询速度杠杠的

test=# explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ 'dege'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------- Index Scan using idx_t_2 on public.t (cost=0.41..2.43 rows=1 width=101) (actual time=0.236..0.254 rows=2 loops=1) Output: phonenum, info, c1, c2, c3, c4 Index Cond: (record_to_text(t.*) ~ 'dege'::text) Buffers: shared hit=5 Planning time: 0.349 ms Execution time: 0.301 ms (6 rows)

测试查询包含刘德华的行(因为行很少，所以建议使用GIN索引)

test=# /*+ BitmapScan(t idx_t_1) */ explain (analyze,verbose,timing,costs,buffers) select * from t where record_to_text(t) ~ '刘德华' limit 10; LOG: available indexes for BitmapScan(t): idx_t_1 LOG: pg_hint_plan: used hint: BitmapScan(t idx_t_1) not used hint: duplication hint: error hint: LOG: pg_hint_plan: used hint: not used hint: BitmapScan(t idx_t_1) duplication hint: error hint: QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Limit (cost=441.00..442.26 rows=1 width=101) (actual time=0.239..0.255 rows=2 loops=1) Output: phonenum, info, c1, c2, c3, c4 Buffers: shared hit=4 -> Bitmap Heap Scan on public.t (cost=441.00..442.26 rows=1 width=101) (actual time=0.238..0.252 rows=2 loops=1) Output: phonenum, info, c1, c2, c3, c4 Recheck Cond: (record_to_text(t.*) ~ '刘德华'::text) Heap Blocks: exact=1 Buffers: shared hit=4 -> Bitmap Index Scan on idx_t_1 (cost=0.00..441.00 rows=1 width=0) (actual time=0.086..0.086 rows=2 loops=1) Index Cond: (record_to_text(t.*) ~ '刘德华'::text) Buffers: shared hit=3 Planning time: 0.494 ms Execution time: 0.313 ms (13 rows) test=# /*+ BitmapScan(t idx_t_1) */ select * from t where record_to_text(t) ~ '刘德华' limit 10; LOG: available indexes for BitmapScan(t): idx_t_1 LOG: pg_hint_plan: used hint: BitmapScan(t idx_t_1) not used hint: duplication hint: error hint: LOG: pg_hint_plan: used hint: not used hint: BitmapScan(t idx_t_1) duplication hint: error hint: phonenum | info | c1 | c2 | c3 | c4 -------------+---------------------------+-----+-------+-------------------------+---------------------------- 13888889999 | i am dege, a postgresqler | 123 | china | 德歌德哥刘德华彭德怀 | 2017-01-06 17:04:42.19215 13888889999 | i am dege, a postgresqler | 123 | china | 德歌德哥刘德华彭德怀 | 2017-01-06 17:04:42.514895 (2 rows) Time: 1.225 ms 语句超时

通常这种索引命中，根据返回的结果集数量，响应时间可能是 0.几毫秒到几十毫秒不等。

不过有些时候，可能因为用户输入的信息量太少，比如输入了2个字符，那么被匹配到的token信息会很多，导致变慢。

使用GiST可以缓解。

那么应用层可以做一个保护，比如超过1秒，则报语句超时。

postgres=# set statement_timeout = '1s'; SET or test=# /*+ Set(statement_timeout 1s) */ select * from t where record_to_text(t) ~ 'd' limit 10; LOG: pg_hint_plan: used hint: Set(statement_timeout 1s) not used hint: duplication hint: error hint: LOG: pg_hint_plan: used hint: Set(statement_timeout 1s) not used hint: duplication hint: error hint: phonenum | info | c1 | c2 | c3 | c4 -------------+-----------------------------+-----+-------+------------------------------+---------------------------- 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 13888888888 | i am digoal, a postgresqler | 123 | china | 中华人民共和国，阿里巴巴，阿 | 2017-01-06 16:51:16.840941 (10 rows) hint的使用

使用规则很简单

当使用游标返回时，使用gist

当输入的字符少于3个时，使用GIST

当评估行很少时，使用GIN

其他情况都是要GIN

有了以上规则，你就可以通过HINT，强制使用哪个索引了。

hint使用方法参考：

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

其他优化

业务层面也可以做出一些优化，比如可以先用全文检索，如果没有匹配到，再用模糊查询。

又比如gist, gin, rum索引应该如何选择，可以参考这篇文档

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

只读实例

按照前面的测试，通常来说一个查询的响应应该在1毫秒以内，

对于一个32核的机器，这种模糊查询能达到的QPS估计在8万左右。

如果你发现单节点，在已优化的情况下，已经不能满足查询的并发，那么可以构建只读实例。

构建只读实例的方法也很简单，请参考

https://yq.aliyun.com/articles/7255

https://github.com/aliyun/rds_dbsync

参考

《PostgreSQL 行级全文检索》

《PostgreSQL 百亿数据秒级响应正则及模糊查询》

《PostgreSQL 1000亿数据量正则匹配速度与激情》

《中文模糊查询性能优化 by PostgreSQL trgm》

搜索引擎引以为豪的rank排序, phrase已经不是什么秘密，在PostgreSQL里面已经攻破。如下文章有介绍。

《从难缠的模糊查询聊开 - PostgreSQL独门绝招之一 GIN , GiST , SP-GiST , RUM 索引原理与技术背景》

《PostgreSQL 全文检索加速快到没有朋友 - RUM索引接口(潘多拉魔盒)》

《PostgreSQL 9.3 pg_trgm imporve support multi-bytes char and gist,gin index for reg-exp search》

【本文地址】

PostgreSQL 全表全字段模糊查询的毫秒级高效实现

PostgreSQL 全表全字段模糊查询的毫秒级高效实现

今日新闻

推荐新闻

PostgreSQL 全表 全字段 模糊查询的毫秒级高效实现

PostgreSQL 全表 全字段 模糊查询的毫秒级高效实现

今日新闻

推荐新闻

PostgreSQL 全表全字段模糊查询的毫秒级高效实现

PostgreSQL 全表全字段模糊查询的毫秒级高效实现