使用PostgreSQL的Knexselect查询在多个并行请求上性能极度下降
简单来说
我正在开发一个游戏(梦想),我的后端堆栈是与Knex的Node.js和PostgreSQL(9.6)。 我把所有的球员数据都保存在这里,我需要经常请求。 其中一个请求需要做10个简单的select来提取数据,这就是问题的出发点:如果服务器只同时提供1个请求,这些查询是相当快的(〜1ms)。 但是,如果服务器服务器并行请求很多(100-400),查询执行时间极度降低(每个查询可能长达几秒)
细节
为了更客观,我将描述服务器的请求目标,select查询和我收到的结果。
关于系统
我在同一个conf上运行Digital Ocean 4cpu / 8gb液滴和Postgres上的节点代码(2个不同的液滴,相同的configuration)
关于请求
它需要做一些游戏操作,他为数据库中的2个玩家select数据
DDL
玩家数据由5个表格代表:
CREATE TABLE public.player_profile( id integer NOT NULL DEFAULT nextval('player_profile_id_seq'::regclass), public_data integer NOT NULL, private_data integer NOT NULL, current_active_deck_num smallint NOT NULL DEFAULT '0'::smallint, created_at bigint NOT NULL DEFAULT '0'::bigint, CONSTRAINT player_profile_pkey PRIMARY KEY (id), CONSTRAINT player_profile_private_data_foreign FOREIGN KEY (private_data) REFERENCES public.profile_private_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT player_profile_public_data_foreign FOREIGN KEY (public_data) REFERENCES public.profile_public_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_data( id integer NOT NULL DEFAULT nextval('player_character_data_id_seq'::regclass), owner_player integer NOT NULL, character_id integer NOT NULL, experience_counter integer NOT NULL, level_counter integer NOT NULL, character_name character varying(255) COLLATE pg_catalog."default" NOT NULL, created_at bigint NOT NULL DEFAULT '0'::bigint, CONSTRAINT player_character_data_pkey PRIMARY KEY (id), CONSTRAINT player_character_data_owner_player_foreign FOREIGN KEY (owner_player) REFERENCES public.player_profile (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_cards( id integer NOT NULL DEFAULT nextval('player_cards_id_seq'::regclass), card_id integer NOT NULL, owner_player integer NOT NULL, card_level integer NOT NULL, first_deck boolean NOT NULL, consumables integer NOT NULL, second_deck boolean NOT NULL DEFAULT false, third_deck boolean NOT NULL DEFAULT false, quality character varying(10) COLLATE pg_catalog."default" NOT NULL DEFAULT 'none'::character varying, CONSTRAINT player_cards_pkey PRIMARY KEY (id), CONSTRAINT player_cards_owner_player_foreign FOREIGN KEY (owner_player) REFERENCES public.player_profile (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_equipment( id integer NOT NULL DEFAULT nextval('player_character_equipment_id_seq'::regclass), owner_character integer NOT NULL, item_id integer NOT NULL, item_level integer NOT NULL, item_type character varying(20) COLLATE pg_catalog."default" NOT NULL, is_equipped boolean NOT NULL, slot_num integer, CONSTRAINT player_character_equipment_pkey PRIMARY KEY (id), CONSTRAINT player_character_equipment_owner_character_foreign FOREIGN KEY (owner_character) REFERENCES public.player_character_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_runes( id integer NOT NULL DEFAULT nextval('player_character_runes_id_seq'::regclass), owner_character integer NOT NULL, item_id integer NOT NULL, slot_num integer, decay_start_timestamp bigint, CONSTRAINT player_character_runes_pkey PRIMARY KEY (id), CONSTRAINT player_character_runes_owner_character_foreign FOREIGN KEY (owner_character) REFERENCES public.player_character_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION );
有索引
knex.raw('create index "player_cards_owner_player_first_deck_index" on "player_cards"("owner_player") WHERE first_deck = TRUE'); knex.raw('create index "player_cards_owner_player_second_deck_index" on "player_cards"("owner_player") WHERE second_deck = TRUE'); knex.raw('create index "player_cards_owner_player_third_deck_index" on "player_cards"("owner_player") WHERE third_deck = TRUE'); knex.raw('create index "player_character_equipment_owner_character_is_equipped_index" on "player_character_equipment" ("owner_character") WHERE is_equipped = TRUE'); knex.raw('create index "player_character_runes_owner_character_slot_num_not_null_index" on "player_character_runes" ("owner_character") WHERE slot_num IS NOT NULL');
代码
第一个查询
async.parallel([ cb => tx('player_character_data') .select('character_id', 'id') .where('owner_player', playerId) .limit(1) .asCallback(cb), cb => tx('player_character_data') .select('character_id', 'id') .where('owner_player', enemyId) .limit(1) .asCallback(cb) ], callbackFn);
第二个查询
async.parallel([ cb => tx('player_profile') .select('current_active_deck_num') .where('id', playerId) .asCallback(cb), cb => tx('player_profile') .select('current_active_deck_num') .where('id', enemyId) .asCallback(cb) ], callbackFn);
第三个问题
playerQ = { first_deck: true } enemyQ = { first_deck: true } MAX_CARDS_IN_DECK = 5 async.parallel([ cb => tx('player_cards') .select('card_id', 'card_level') .where('owner_player', playerId) .andWhere(playerQ) .limit(MAX_CARDS_IN_DECK) .asCallback(cb), cb => tx('player_cards') .select('card_id', 'card_level') .where('owner_player', enemyId) .andWhere(enemyQ) .limit(MAX_CARDS_IN_DECK) .asCallback(cb) ], callbackFn);
第四个问题
MAX_EQUIPPED_ITEMS = 3 async.parallel([ cb => tx('player_character_equipment') .select('item_id', 'item_level') .where('owner_character', playerCharacterUniqueId) .andWhere('is_equipped', true) .limit(MAX_EQUIPPED_ITEMS) .asCallback(cb), cb => tx('player_character_equipment') .select('item_id', 'item_level') .where('owner_character', enemyCharacterUniqueId) .andWhere('is_equipped', true) .limit(MAX_EQUIPPED_ITEMS) .asCallback(cb) ], callbackFn);
第五个
runeSlotsMax = 3 async.parallel([ cb => tx('player_character_runes') .select('item_id', 'decay_start_timestamp') .where('owner_character', playerCharacterUniqueId) .whereNotNull('slot_num') .limit(runeSlotsMax) .asCallback(cb), cb => tx('player_character_runes') .select('item_id', 'decay_start_timestamp') .where('owner_character', enemyCharacterUniqueId) .whereNotNull('slot_num') .limit(runeSlotsMax) .asCallback(cb) ], callbackFn);
EXPLAIN(分析)
只有索引扫描,<1ms的计划和执行时间。 可以发布,如果需要(没有发布,以节省空间)
时间本身
( 总数是请求数, 最小 / 最大 / 平均 / 中位数是响应时间 )
- 4个并发请求:
{ "total": 300, "avg": 1.81, "median": 2, "min": 1, "max": 6 }
- 400个并发请求:
-
{ "total": 300, "avg": 209.57666666666665, "median": 176, "min": 9, "max": 1683 }
-
{ "total": 300, "avg": 2105.9, "median": 2005, "min": 1563, "max": 4074 }
– 上次select
-
我试图把缓慢的查询超过100毫秒进行日志 – 什么都没有。 也试图增加连接池的大小直到数量的并行请求 – 没有任何东西。
我可以在这里看到三个潜在的问题:
- 400个并发请求实际上是相当多的,你的机器规格是没有什么让人兴奋的。 也许这与我的MSSQL背景有关,但我可以想象这是一个可能需要加强硬件的情况。
- 两台服务器之间的通信应该很快,但可能会造成一些延迟。 一个强大的服务器可能是更好的解决scheme
- 我假设你有合理的数据量(400并发连接应该有很多存储)。 也许发布一些实际生成的SQL可能是有用的。 很大程度上取决于SQL Knex提供的,并且可能有可用的优化。 想到索引,但是需要看到SQL来确认。
您的testing似乎没有包含来自客户端的networking延迟,所以这可能是您尚未解决的其他问题。
一个解决scheme被发现很快,但在这里忘了回应(很忙,很抱歉)。
缓慢的查询没有魔法,但只有节点的事件循环性质:
- 所有silimar请求是并行的;
- 我有一个非常缓慢的执行时间(〜150-200ms)的代码块;
- 如果你有〜800个并行请求,150ms的代码块会变成〜10000ms的事件循环延迟;
- 所有你会看到的是缓慢的请求的可见性,但它只是一个callback函数,而不是数据库的滞后;
结论 :使用pgBadger
检测慢查询,使用isBusy
模块检测事件循环滞后