Uploaded image for project: 'OpenDJ'
  1. OpenDJ
  2. OPENDJ-5611

Change number indexing can lag behind replication under extreme load

    Details

    • Story Points:
      3

      Description

      While replication is able to replay updates at around 25k ops/sec, change number indexing currently maxes out at around 12-13k changes/s.

      It means that on a very high throughput sustained for a very long time, the indexer is continuously falling behind the replayed updates. This prevents purging (purge only purges indexed changes) and contributes to filling the disks.

      That being said, using the change number indexer is an invalid use case for AM CTS. The CTS does not need indexing updates to tokens.

      Global information

      Problem seen by running four replicas system test (scenario3)
      Test ran during 12h with instances stop and start

      Topology

      4 DJ instances running on AWS vm instances
      https://wikis.forgerock.org/confluence/display/QA/Four+replicas#Fourreplicas-Topology

      Scenario

      https://wikis.forgerock.org/confluence/display/QA/Four+replicas#Fourreplicas-Scenario3:backup-restoreinstances
      DS2 stops more than 3 hours 

      Product information

      • revision : c0eaa20c9df
      • build id : 2018-10-18 16:12:47

      Observations

      Test durations

      • test starts: 2018-10-22 16:43:26
      • load stops : 2018-10-23 05:34:00
      • test stops : 2018-10-23 07:35:39

      Grafana dashboards

      Replayed update

      • During the run, replayed update grafana graphs shows throughput up to 25k ops/sec
      • There is some activity between 5:34 and 7:00 (after ldap loads stopped)
      • No more activity after 7:00

      Current delay

      • shows activity up to 6:50
      • no more activity after 6:50

      Last change numbers

      10:00 am

      • DS1 : host1 1389 : lastchangenumber: 369950257
      • DS2 : host2 1389 : lastchangenumber: 307872848
      • DS3 : host3 1389 : lastchangenumber: 369950257
      • DS4 : host4 1389 : lastchangenumber: 369950257

      Monitoring this manually, seems indexing manages 12k ops/sec

      Comments

      • replayed updates can manage 25k ops/sec while indexing 12k ops/sec
      • indexing activity is not available in grafana dashboard

      How to run the test

      $ scripts/stress/run/run_stress.py -c perf -s four_replicas.scenario3 --duration 12h --num_users 5000000 --monitoring_interval 180 --concurrency 10000 --dj_version 6.5.0-SNAPSHOT --cfg config/config_4replicas.cfg OpenDJ

      config_4replicas_aws.cfg

       

        Attachments

        1. 700_indexing.png
          700_indexing.png
          194 kB
        2. 700_nb_entries.png
          700_nb_entries.png
          61 kB
        3. 700_replication.png
          700_replication.png
          716 kB
        4. 700_report.png
          700_report.png
          768 kB
        5. config_4replicas_aws.cfg
          7 kB
        6. Screen Shot 2018-10-23 at 11.46.00.png
          Screen Shot 2018-10-23 at 11.46.00.png
          321 kB
        7. Screen Shot 2018-10-23 at 11.47.11.png
          Screen Shot 2018-10-23 at 11.47.11.png
          161 kB
        8. Screen Shot 2018-10-23 at 11.47.37.png
          Screen Shot 2018-10-23 at 11.47.37.png
          713 kB
        9. Screen Shot 2018-10-23 at 14.41.24.png
          Screen Shot 2018-10-23 at 14.41.24.png
          510 kB

          Issue Links

            Activity

              People

              • Assignee:
                ylecaillez Yannick Lecaillez
                Reporter:
                guillaume.andru Guillaume Andru
                QA Assignee:
                Guillaume Andru
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: