使用PostgreSQL的Knexselect查询在多个并行请求上性能极度下降

简单来说

我正在开发一个游戏(梦想),我的后端堆栈是与Knex的Node.js和PostgreSQL(9.6)。 我把所有的球员数据都保存在这里,我需要经常请求。 其中一个请求需要做10个简单的select来提取数据,这就是问题的出发点:如果服务器只同时提供1个请求,这些查询是相当快的(〜1ms)。 但是,如果服务器服务器并行请求很多(100-400),查询执行时间极度降低(每个查询可能长达几秒)

细节

为了更客观,我将描述服务器的请求目标,select查询和我收到的结果。

关于系统

我在同一个conf上运行Digital Ocean 4cpu / 8gb液滴和Postgres上的节点代码(2个不同的液滴,相同的configuration)

关于请求

它需要做一些游戏操作,他为数据库中的2个玩家select数据

DDL

玩家数据由5个表格代表:

CREATE TABLE public.player_profile( id integer NOT NULL DEFAULT nextval('player_profile_id_seq'::regclass), public_data integer NOT NULL, private_data integer NOT NULL, current_active_deck_num smallint NOT NULL DEFAULT '0'::smallint, created_at bigint NOT NULL DEFAULT '0'::bigint, CONSTRAINT player_profile_pkey PRIMARY KEY (id), CONSTRAINT player_profile_private_data_foreign FOREIGN KEY (private_data) REFERENCES public.profile_private_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION, CONSTRAINT player_profile_public_data_foreign FOREIGN KEY (public_data) REFERENCES public.profile_public_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_data( id integer NOT NULL DEFAULT nextval('player_character_data_id_seq'::regclass), owner_player integer NOT NULL, character_id integer NOT NULL, experience_counter integer NOT NULL, level_counter integer NOT NULL, character_name character varying(255) COLLATE pg_catalog."default" NOT NULL, created_at bigint NOT NULL DEFAULT '0'::bigint, CONSTRAINT player_character_data_pkey PRIMARY KEY (id), CONSTRAINT player_character_data_owner_player_foreign FOREIGN KEY (owner_player) REFERENCES public.player_profile (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_cards( id integer NOT NULL DEFAULT nextval('player_cards_id_seq'::regclass), card_id integer NOT NULL, owner_player integer NOT NULL, card_level integer NOT NULL, first_deck boolean NOT NULL, consumables integer NOT NULL, second_deck boolean NOT NULL DEFAULT false, third_deck boolean NOT NULL DEFAULT false, quality character varying(10) COLLATE pg_catalog."default" NOT NULL DEFAULT 'none'::character varying, CONSTRAINT player_cards_pkey PRIMARY KEY (id), CONSTRAINT player_cards_owner_player_foreign FOREIGN KEY (owner_player) REFERENCES public.player_profile (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_equipment( id integer NOT NULL DEFAULT nextval('player_character_equipment_id_seq'::regclass), owner_character integer NOT NULL, item_id integer NOT NULL, item_level integer NOT NULL, item_type character varying(20) COLLATE pg_catalog."default" NOT NULL, is_equipped boolean NOT NULL, slot_num integer, CONSTRAINT player_character_equipment_pkey PRIMARY KEY (id), CONSTRAINT player_character_equipment_owner_character_foreign FOREIGN KEY (owner_character) REFERENCES public.player_character_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); CREATE TABLE public.player_character_runes( id integer NOT NULL DEFAULT nextval('player_character_runes_id_seq'::regclass), owner_character integer NOT NULL, item_id integer NOT NULL, slot_num integer, decay_start_timestamp bigint, CONSTRAINT player_character_runes_pkey PRIMARY KEY (id), CONSTRAINT player_character_runes_owner_character_foreign FOREIGN KEY (owner_character) REFERENCES public.player_character_data (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ); 

有索引

 knex.raw('create index "player_cards_owner_player_first_deck_index" on "player_cards"("owner_player") WHERE first_deck = TRUE'); knex.raw('create index "player_cards_owner_player_second_deck_index" on "player_cards"("owner_player") WHERE second_deck = TRUE'); knex.raw('create index "player_cards_owner_player_third_deck_index" on "player_cards"("owner_player") WHERE third_deck = TRUE'); knex.raw('create index "player_character_equipment_owner_character_is_equipped_index" on "player_character_equipment" ("owner_character") WHERE is_equipped = TRUE'); knex.raw('create index "player_character_runes_owner_character_slot_num_not_null_index" on "player_character_runes" ("owner_character") WHERE slot_num IS NOT NULL'); 

代码

第一个查询

 async.parallel([ cb => tx('player_character_data') .select('character_id', 'id') .where('owner_player', playerId) .limit(1) .asCallback(cb), cb => tx('player_character_data') .select('character_id', 'id') .where('owner_player', enemyId) .limit(1) .asCallback(cb) ], callbackFn); 

第二个查询

 async.parallel([ cb => tx('player_profile') .select('current_active_deck_num') .where('id', playerId) .asCallback(cb), cb => tx('player_profile') .select('current_active_deck_num') .where('id', enemyId) .asCallback(cb) ], callbackFn); 

第三个问题

 playerQ = { first_deck: true } enemyQ = { first_deck: true } MAX_CARDS_IN_DECK = 5 async.parallel([ cb => tx('player_cards') .select('card_id', 'card_level') .where('owner_player', playerId) .andWhere(playerQ) .limit(MAX_CARDS_IN_DECK) .asCallback(cb), cb => tx('player_cards') .select('card_id', 'card_level') .where('owner_player', enemyId) .andWhere(enemyQ) .limit(MAX_CARDS_IN_DECK) .asCallback(cb) ], callbackFn); 

第四个问题

 MAX_EQUIPPED_ITEMS = 3 async.parallel([ cb => tx('player_character_equipment') .select('item_id', 'item_level') .where('owner_character', playerCharacterUniqueId) .andWhere('is_equipped', true) .limit(MAX_EQUIPPED_ITEMS) .asCallback(cb), cb => tx('player_character_equipment') .select('item_id', 'item_level') .where('owner_character', enemyCharacterUniqueId) .andWhere('is_equipped', true) .limit(MAX_EQUIPPED_ITEMS) .asCallback(cb) ], callbackFn); 

第五个

 runeSlotsMax = 3 async.parallel([ cb => tx('player_character_runes') .select('item_id', 'decay_start_timestamp') .where('owner_character', playerCharacterUniqueId) .whereNotNull('slot_num') .limit(runeSlotsMax) .asCallback(cb), cb => tx('player_character_runes') .select('item_id', 'decay_start_timestamp') .where('owner_character', enemyCharacterUniqueId) .whereNotNull('slot_num') .limit(runeSlotsMax) .asCallback(cb) ], callbackFn); 

EXPLAIN(分析)

只有索引扫描,<1ms的计划和执行时间。 可以发布,如果需要(没有发布,以节省空间)

时间本身

总数是请求数, 最小 / 最大 / 平均 / 中位数是响应时间

  • 4个并发请求: { "total": 300, "avg": 1.81, "median": 2, "min": 1, "max": 6 }
  • 400个并发请求:
    • { "total": 300, "avg": 209.57666666666665, "median": 176, "min": 9, "max": 1683 }
    • { "total": 300, "avg": 2105.9, "median": 2005, "min": 1563, "max": 4074 } – 上次select

我试图把缓慢的查询超过100毫秒进行日志 – 什么都没有。 也试图增加连接池的大小直到数量的并行请求 – 没有任何东西。

我可以在这里看到三个潜在的问题:

  1. 400个并发请求实际上是相当多的,你的机器规格是没有什么让人兴奋的。 也许这与我的MSSQL背景有关,但我可以想象这是一个可能需要加强硬件的情况。
  2. 两台服务器之间的通信应该很快,但可能会造成一些延迟。 一个强大的服务器可能是更好的解决scheme
  3. 我假设你有合理的数据量(400并发连接应该有很多存储)。 也许发布一些实际生成的SQL可能是有用的。 很大程度上取决于SQL Knex提供的,并且可能有可用的优化。 想到索引,但是需要看到SQL来确认。

您的testing似乎没有包含来自客户端的networking延迟,所以这可能是您尚未解决的其他问题。

一个解决scheme被发现很快,但在这里忘了回应(很忙,很抱歉)。

缓慢的查询没有魔法,但只有节点的事件循环性质:

  • 所有silimar请求是并行的;
  • 我有一个非常缓慢的执行时间(〜150-200ms)的代码块;
  • 如果你有〜800个并行请求,150ms的代码块会变成〜10000ms的事件循环延迟;
  • 所有你会看到的是缓慢的请求的可见性,但它只是一个callback函数,而不是数据库的滞后;

结论 :使用pgBadger检测慢查询,使用isBusy模块检测事件循环滞后