Ken started Quickjs backed indexer processes freeze #5342

aduchate · 2024-11-25T15:50:04Z

Description

On a cluster (3.4.2) of 6 nodes that has a fairly large amount of databases (~1500 of size between 1GB and 150GB each), we have recently added and modified about 20 design docs per database (total 30000 dds). We have setup Ken with a concurrency of 5 to let the indexation happen.

About every 10 minutes, we see one indexer process not being updated anymore. It basically stays stuck forever (we let a few linger for 24 hours). Killing all couchjs_mainjs has no impact on the stuck indexer. The only way to get rid of it is to issue, in remsh, an exit(Pid, kill). . Pid here is the pid field of /_active_task, not indexer_pid.

Steps to Reproduce

Create a lot of databases with a lot of data, create a few design documents per database, start ken.

Expected Behaviour

The indexers shouldn't get stuck

Your Environment

CouchDB version used: 3.4.2
Browser name and version: irrelevant
Operating system and version: ubuntu 22.04 (couchdb compiled with dockerfile below)

FROM ubuntu:22.04

# Create app directory
WORKDIR /root

# Install dependencies
RUN apt-get update
RUN ln -fs /usr/share/zoneinfo/Europe/Brussels /etc/localtime
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata
RUN dpkg-reconfigure --frontend noninteractive tzdata
RUN apt-get install -y gnupg2 wget curl vim git build-essential pkg-config libicu-dev libmozjs-78-dev libcurl4-openssl-dev libncurses-dev node-gyp npm libssl-dev help2man openjdk-21-jdk-headless
RUN git clone https://github.com/erlang/otp otp_src_27.1.2
WORKDIR /root/otp_src_27.1.2
RUN git checkout -b 27.1.2 44ffe8811dfcf3d2fe04d530c6e8fac5ca384e02
RUN bash -c 'export ERL_TOP=`pwd`; export LANG=C; ./configure; make; make release_tests; cd release/tests/test_server; /root/otp_src_27.1.2/bin/erl -s ts install -s ts smoke_test batch -s init stop; cd ..; tar zcvf otp-tests.tgz test_server; cd /root/otp_src_27.1.2; make install'
WORKDIR /root
RUN git clone https://github.com/apache/couchdb.git #9
WORKDIR /root/couchdb
RUN git checkout -b 3.4.2 6e5ad2a5c5479cb09722b4a7d13b3d59b7bb2a23
RUN bash -c 'curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.1/install.sh | bash ; . /root/.nvm/nvm.sh; nvm install 18'
RUN ./configure  --disable-docs --spidermonkey-version 78 chdir=/media/data/src/couchdb
RUN bash -c '. /root/.nvm/nvm.sh; nvm use 18; make release'
WORKDIR /root/couchdb/rel
RUN tar zcvf couchdb.jammy-jellyfish.3.4.2.tgz couchdb

Additional Context

We can give you access to the infrastructure that causes the problem to happen if needed.

The text was updated successfully, but these errors were encountered:

nickva · 2024-11-25T16:49:30Z

Thanks for reaching out, Antoine.

Do you see any crashes or errors in the logs?
QuickJS can build indexes a bit faster but that could mean putting more pressure on the disk or other resources. Do you see increased cpu, disk(I/O quota reaching a limit?) or memory usage during that time?
If your design documents or documents are large sometimes it's possible to hit the maximum stack size in the JS engine. For QuickJS that can be adjusted with memory_limit_bytes. Though in that case I'd expect to see repeated crashes in the logs with segfault or a memory limit error.
Before killing the process try to gather some stats if you an remsh in:
- That should give us some idea what the the process is running in the remsh erlang:process_info(Pid).
- These three commands will show the top 3 busiest process, message queue lengths and memory users:

recon:proc_window(reductions, 3, 5000).
recon:proc_window(message_queue_len, 3, 5000).
recon:proc_window(memory, 3, 5000).

Another option if you see you have reached the os_process_limit on any node with the number of couchjs processes, could consider increasing that a bit.

aduchate · 2024-11-26T18:16:52Z

Hi nicka,

No crash in the logs at all
The parallelism of ken was limited to 5 which insured that the CPU usage was staying very low on the 48 cores machine
I'll try raising memory_limit_bytes though I doubt it will improve things because at a subsequent indexations of the same view succeed.

Below, the request information from remsh

([email protected])2> ([email protected])2> erlang:process_info(<0.83257742.0>). 
[{current_function,{gen,do_call,4}},
 {initial_call,{erlang,apply,2}},
 {status,waiting},
 {message_queue_len,0},
 {links,[<0.83311580.0>]},
 {dictionary,
     [{task_status_props,
          [{changes_done,622766},
           {database,
               <<"shards/a0000000-bfffffff/icure-ms-wgc-nieuwegent-614fc45"...>>},
           {design_document,<<"_design/Contact-DataOwner">>},
           {indexer_pid,<<"<0.4441283.0>">>},
           {progress,50},
           {started_on,1732624104},
           {total_changes,1223768},
           {type,indexer},
           {updated_on,1732624464}]},
      {io_priority,
          {view_update,
              <<"shards/a0000000-bfffffff/icure-ms-wgc-nieuwegent-614fc455-b5"...>>,
              <<"_design/Contact-DataOwner">>}},
      {task_status_update,{{0,0,0},500000}}]},
 {trap_exit,false},
 {error_handler,error_handler},
 {priority,normal},
 {group_leader,<0.252.0>},
 {total_heap_size,833025},
 {heap_size,318187},
 {stack_size,16},
 {reductions,86130223},
 {garbage_collection,
     [{max_heap_size,
          #{error_logger => true,include_shared_binaries => false,
            kill => true,size => 0}},
      {min_bin_vheap_size,46422},
      {min_heap_size,233},
      {fullsweep_after,65535},
      {minor_gcs,1217}]},
 {suspending,[]}]
([email protected])3> recon:proc_window(reductions, 3, 5000).
[{<0.172400993.0>,64192268,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]},
 {<0.170530515.0>,61520321,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]},
 {<0.172131409.0>,60832785,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_index_updater,update,3}}]}]
([email protected])4> recon:proc_window(message_queue_len, 3, 5000).
[{<0.554738.0>,1,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.21585.0>,1,
  [{current_function,{erts_internal,dirty_nif_finalizer,1}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.3.0>,1,
  [{current_function,
       {erts_dirty_process_signal_handler,msg_loop,0}},
   {initial_call,
       {erts_dirty_process_signal_handler,start,0}}]}]
([email protected])5> recon:proc_window(memory, 3, 5000).
[{<0.172416530.0>,4721072,
  [{current_function,{gen,do_call,4}},
   {initial_call,{couch_bt_engine_compactor,start,4}}]},
 {<0.172600591.0>,4720504,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]},
 {<0.4328.0>,4032976,
  [{current_function,{gen_server,loop,7}},
   {initial_call,{proc_lib,init_p,5}}]}]

nickva · 2024-11-26T21:26:18Z

Thanks @aduchate.

Hmm, so far everything looks good.

couch_index_updater:update/3 processes seem to be the most busy on the cluster. But they seem to wait in 'do_call` so they are waiting on a response from another process.

Try to get erlang:process_info(Pid, current_stacktrace). along with erlang:process_info(Pid). for the stuck indexer, maybe the indexer updater process as well.
What about the {progress,50}, {total_changes,1223768}, {type,indexer}, {updated_on,1732624464} values on the stuck indexer, do they get updated periodically? What about disk IO and CPU usage does it seems like anything is being used significantly?
Is it always the same db/index that gets stuck or it's always a random one?
RUN git clone https://github.com/erlang/otp otp_src_27.1.2 we do support and test with Erlang 27. However our package still ship with 25. Would it be possible to see if 25 would act the same?
Does running the QuickJS scanner show any discrepancies or crashes in the logs against the cluster, specifically for the stuck indexes? https://docs.couchdb.org/en/stable/best-practices/jsdevel.html#scanning-for-quickjs-incompatibilities
Do indexes build with Spidermonkey engine? (ensure to restart the nodes after toggling the config settings as in 3.4.2 the engine setting doesn't update at runtime)

aduchate added bug needs-triage labels Nov 25, 2024

aduchate changed the title ~~Ken started Quickjs backed indexer processes freezes~~ Ken started Quickjs backed indexer processes freeze Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ken started Quickjs backed indexer processes freeze #5342

Ken started Quickjs backed indexer processes freeze #5342

aduchate commented Nov 25, 2024 •

edited

Loading

nickva commented Nov 25, 2024 •

edited

Loading

aduchate commented Nov 26, 2024

nickva commented Nov 26, 2024

Ken started Quickjs backed indexer processes freeze #5342

Ken started Quickjs backed indexer processes freeze #5342

Comments

aduchate commented Nov 25, 2024 • edited Loading

Description

Steps to Reproduce

Expected Behaviour

Your Environment

Additional Context

nickva commented Nov 25, 2024 • edited Loading

aduchate commented Nov 26, 2024

nickva commented Nov 26, 2024

aduchate commented Nov 25, 2024 •

edited

Loading

nickva commented Nov 25, 2024 •

edited

Loading