Uploaded image for project: 'OpenIDM'
  1. OpenIDM
  2. OPENIDM-10576

Performance degradations were seen on managed users CUD operations

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Not a defect
    • Affects Version/s: OpenIDM 6.0.0
    • Fix Version/s: None
    • Component/s: Performance
    • Labels:
    • Environment:
      IDM master builds starting from rev b71c794 runs on a CentOS 7 machine, MySQL as repo.

      Description

      See performance degradations across CUD operations on managed users, normally, we can do CRUD with 300 users/seconds dispatch rate over 900 seconds with a good response times(~100ms).
      Now with the same dispatch rate and duration, Gatling requests for create with PUT would have a higher response times as

      ---- Global Information --------------------------------------------------------
      > request count                                     270000 (OK=270000 KO=0     )
      > min response time                                    288 (OK=288    KO=-     )
      > max response time                                  41280 (OK=41280  KO=-     )
      > mean response time                                 18505 (OK=18505  KO=-     )
      > std deviation                                      18992 (OK=18992  KO=-     )
      > response time 50th percentile                      16196 (OK=16196  KO=-     )
      > response time 95th percentile                      37445 (OK=37445  KO=-     )
      > response time 99th percentile                      39531 (OK=39531  KO=-     )
      > response time 99.9th percentile                    40214 (OK=40214  KO=-     )
      > mean requests/sec                                287.234 (OK=287.234 KO=-     )
      ---- Response Time Distribution ------------------------------------------------
      > t < 800 ms                                            34 (  0%)
      > 800 ms < t < 1200 ms                                  62 (  0%)
      > t > 1200 ms                                       269904 (100%)
      > failed                                                 0 (  0%)
      

      compared with mean response time about 100 ms before the changes.

      Update and Delete requests would timeout

      For update, the first 72% requests were OK, timeout after that
      2018-04-03 06:37:27                                         715s elapsed
      ---- ManUsers Update Put -------------------------------------------------------
      [#####################################################-----                ] 72%
               waiting: 55984  / active: 17658  / done:196358
      ---- Requests ------------------------------------------------------------------
      > Global                                                   (OK=196358 KO=0     )
      > Update PUT Request                                       (OK=196358 KO=0     )
      ================================================================================
      
      06:37:27.565 [DEBUG] i.g.h.a.AsyncHandler - Request 'Update PUT Request' failed for user 196046
      java.util.concurrent.TimeoutException: Request timeout to 
      
      From delete, first 15% requests were OK, and timeous after that
      2018-04-03 06:45:05                                         200s elapsed
      ---- ManUsers Delete -----------------------------------------------------------
      [###########-----                                                          ] 15%
               waiting: 210523 / active: 17436  / done:42041
      ---- Requests ------------------------------------------------------------------
      > Global                                                   (OK=42041  KO=0     )
      > Delete Request                                           (OK=42041  KO=0     )
      ================================================================================
      
      06:45:09.878 [DEBUG] i.g.h.a.AsyncHandler - Request 'Delete Request' failed for user 42693
      java.util.concurrent.TimeoutException: Request timeout to etorki.internal.forgerock.com/172.16.204.4:8080 after 60000ms
          ...
      

      Tried to lower the dispatch rate to 250 users/second with 900 seconds duration, create and update tests both passed with good response times but delete test still failed with requests timeout. All the tests do pass with good response times when used dispatch rate 225 users/second over 900 duration

      2018-04-03 19:50:26                                         325s elapsed
      ---- ManUsers Delete -----------------------------------------------------------
      [#####################-----                                                ] 29%
               waiting: 144070 / active: 14674  / done:66256
      ---- Requests ------------------------------------------------------------------
      > Global                                                   (OK=66256  KO=0     )
      > Delete Request                                           (OK=66256  KO=0     )
      ================================================================================
      
      19:50:29.083 [DEBUG] i.g.h.a.AsyncHandler - Request 'Delete Request' failed for user 66493
      

      Note:
      1.Tests identified the issue is introduced from commit b71c7949f4dd08126db22f908ff4b1ccd3ab3fe5 for OPENIDM-10075
      2. The symptom is seen with JDBC repo, embedded DJ, with/o roles and w/ roles with different degradation severity.
      3. jstack trace is collected/attached during delete test.
      idm-metrices from the commit 6ead915 right before b71c7949 and b71c7949 PR are also attached.
      4. As Alin Brici and Dirk Hogan suggested, tried to make trackedProperties empty as in managed.json

      "trackedProperties" : []
      

      See perf improvement on update but not in create PUT and delete.

      when tried to remove the whole meta field from managed.json

      "meta" : {
                   "property" : "_meta",
                   "resourceCollection" : "internal/usermeta",
                   "trackedProperties" : [
                   ]
                 }
      

      The test passed and the response times for CUD returns the normal level in IDM master.

      To reproduce using pyforge:
      1. Set appropriate stress level in config/config.cfg
      2. Run the following tests

      nohup python -u run-pybot.py -c stress -i gatling -s *CrudManUsers -t IDM_Managed_User_Create_Put -t IDM_Managed_User_Update -t IDM_Managed_User_Delete  OpenIDM
      

      And observe the symptoms.

        Attachments

        1. b57-b58.csv
          7 kB
        2. idm.jstack
          2.98 MB
        3. idm-metrics.6ead915
          54 kB
        4. idm-metrics-after-create
          35 kB
        5. idm-metrics-after-delete
          58 kB
        6. idm-metrics-after-update
          58 kB
        7. openidm0.log.3
          5.01 MB

          Activity

            People

            Assignee:
            brmiller Brendan Miller
            Reporter:
            Tinghua.Xu Tinghua Xu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: