[OPENIDM-14685] Mean response increased significantly for updating and deleting managed users in stress mode test Created: 05/May/20  Updated: 11/Aug/20

Status: Open
Project: OpenIDM
Component/s: Module - Managed Objects
Affects Version/s: 7.0.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Tinghua Xu Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: CLARK, Performance, regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

IDM latest master runs on a CentOS7, Java11, MySQL as repo


Attachments: File config.cfg     File delete.metrics-7.0.0     File delete.metrics-dj-6.5.0.3     File delete.metrics-dj-7.0.0     File idm-metrics-6.5.0     File update.metrics-7.0.0     File update.metrics-dj-6.5.0.3     File update.metrics-dj-7.0.0    
Target Version/s:

 Description   

The test first creates managed users by sending 300 requests per second for 900 seconds, then update the users, then delete the users, the test shows an increase in mean response times from IDM 6.5.0.3 to  7.0.0 for both update and delete:

  6.5.0.3 7.0.0 difference in percentage
Update 58 112 93.10344828
Delete 61 158 159.0163934

 

Attached a metrics collected for the update part(the metrics is not enabled before update), no delete was done after disable metrics in this test.

To reproduce using pyforge(pyforge config for 7.0.0 is attached), run command:

run-pybot.py -v -c stress -s idm.crud_on_managed_users.CrudManUsers OpenIDM 

 

Note that: 

Didn't test this for DJ as repo neither postgres

We fixed similar issue earlier: https://bugster.forgerock.org/jira/browse/OPENIDM-14312

Also see mean response time increases(throughput drop) in perf mode simple managed users update test as well:

  6.5.0.3 6.5.0.3 7.0.0 7.0.0  
  throughput mean response times throughput mean response times Throughput difference
Update 356 56 314 63 -11.79775281

 

 



 Comments   
Comment by Dirk Hogan [ 06/May/20 ]

 
A first analysis of the big differences in the metrics:

For 270,000 invocations for the below metrics:

metric 6.5 mean rate 6.5 total 7.0 mean rate 7.0 total
managed.user.update 298 11942467 140 18976190
repo.jdbc.update.managed_user 298 3602021 140 4704638
repo.jdbc.update.internal_usermeta 298 2109151 140 3092320

 

At first blush, a significant element in the increased-latency/decreased-throughput is due to repo manipulations.

HOWEVER: right now, I'm wondering if the mean rate is not indicative of the latency of the call itself, but rather the rate at which e.g. managed.user.update is itself called. This is a crucial difference: if it is the former, then it means the repo layer is the cause; if it is the latter, it means the layers above are the cause. I see this same pattern for the DS-related issues, where DS-version changes (6.5/7) made no difference. So it seems the latency results from the layers above. I see this somewhat confirmed in the metrics below: the mean latency for managed user update (6.5/7) are 40/57 and the mean latency for the update repo manipulations are 12/15 - a 30% increase in the managed layer, and a 20% increase in the repo layer.

Metrics below:

Note I've also included the get repo connection metrics, as they seem different. Also the differing metrics for the queryId->queryFilter (find-relationships-for-resource in 6.5; adhoc-filter on relationships for 7.0) as they are substantially different as well, though not as significant as the repo update differences, and a bit harder to compare, as the invocation counts don't match exactly. 

6.5

        

{             "_id": "managed.user.update",             "_type": "timer",             "count": 270000,             "duration_units": "milliseconds",             "m15_rate": 281.02036711727317,             "m1_rate": 300.0405631518886,             "m5_rate": 297.4416225630038,             "max": 124.20329,             "mean": 40.01597019916523,             "mean_rate": 298.57833047282355,             "min": 18.957044,             "p50": 36.60396,             "p75": 45.265704,             "p95": 68.43506099999999,             "p98": 81.711216,             "p99": 86.61946499999999,             "p999": 105.14191799999999,             "rate_units": "calls/second",             "stddev": 13.737272328707514,             "total": 11942467.836719         }

,

        

{             "_id": "repo.jdbc.update.managed_user",             "_type": "timer",             "count": 270000,             "duration_units": "milliseconds",             "m15_rate": 284.9852825449351,             "m1_rate": 299.8677817849258,             "m5_rate": 297.94359564963503,             "max": 83.85498,             "mean": 12.037431097543092,             "mean_rate": 298.59375233539504,             "min": 3.9970619999999997,             "p50": 10.591244999999999,             "p75": 14.177995,             "p95": 22.546666,             "p98": 26.427526,             "p99": 30.81131,             "p999": 83.85498,             "rate_units": "calls/second",             "stddev": 6.385095051573262,             "total": 3602021.125407         }

,

        

{             "_id": "repo.jdbc.update.internal_usermeta",             "_type": "timer",             "count": 270000,             "duration_units": "milliseconds",             "m15_rate": 282.9953971895374,             "m1_rate": 299.78996259751597,             "m5_rate": 297.6595132215682,             "max": 43.961616,             "mean": 7.206694196221287,             "mean_rate": 298.6014850442494,             "min": 1.915938,             "p50": 6.377695,             "p75": 8.715474,             "p95": 13.471891999999999,             "p98": 18.417154,             "p99": 22.156129999999997,             "p999": 43.961616,             "rate_units": "calls/second",             "stddev": 4.259603153574792,             "total": 2109151.156938         }

,

        

{             "_id": "repo.jdbc.get-connection",             "_type": "timer",             "count": 4625429,             "duration_units": "milliseconds",             "m15_rate": 3250.920625996827,             "m1_rate": 5131.747668626873,             "m5_rate": 4894.5480706356775,             "max": 2.274146,             "mean": 0.00459239784693734,             "mean_rate": 4875.191437146801,             "min": 0.000421,             "p50": 0.000833,             "p75": 0.0009429999999999999,             "p95": 0.001307,             "p98": 0.001645,             "p99": 0.001931,             "p999": 2.274146,             "rate_units": "calls/second",             "stddev": 0.091468507863925,             "total": 831966.414798         }

,

        

{             "_id": "repo.jdbc.query._id-find-relationships-for-resource.relationships",             "_type": "timer",             "count": 654843,             "duration_units": "milliseconds",             "m15_rate": 661.1816623271907,             "m1_rate": 731.8133133912884,             "m5_rate": 721.854581661269,             "max": 31.112122999999997,             "mean": 1.2517860494536903,             "mean_rate": 724.185200006409,             "min": 0.30539,             "p50": 0.932666,             "p75": 1.527793,             "p95": 3.132978,             "p98": 4.255998,             "p99": 4.7050089999999996,             "p999": 7.082192999999999,             "rate_units": "calls/second",             "stddev": 1.1772393987037513,             "total": 1061944.529437         }

7.0

        

{             "_id": "managed.user.update",             "_type": "timer",             "count": 270001,             "duration_units": "milliseconds",             "m15_rate": 189.35970339287923,             "m1_rate": 292.9467532704025,             "m5_rate": 283.6907516901988,             "max": 431.67091999999997,             "mean": 57.530658196385275,             "mean_rate": 140.0724115521655,             "min": 26.189621,             "p50": 54.006307,             "p75": 64.608352,             "p95": 85.91414999999999,             "p98": 96.567787,             "p99": 107.033132,             "p999": 319.089806,             "rate_units": "calls/second",             "stddev": 19.247400144626916,             "total": 18976190.207794         }

,

        

{             "_id": "repo.jdbc.update.managed_user",             "_type": "timer",             "count": 270001,             "duration_units": "milliseconds",             "m15_rate": 189.34837651132656,             "m1_rate": 292.6948578331921,             "m5_rate": 283.64112812704855,             "max": 240.75149399999998,             "mean": 15.721502305100735,             "mean_rate": 140.07686365935928,             "min": 3.273704,             "p50": 14.648826999999999,             "p75": 18.587505,             "p95": 26.714692,             "p98": 31.733908,             "p99": 36.262373,             "p999": 99.331868,             "rate_units": "calls/second",             "stddev": 8.97674323360366,             "total": 4704638.231424         }

,

        

{             "_id": "repo.jdbc.update.internal_usermeta",             "_type": "timer",             "count": 270001,             "duration_units": "milliseconds",             "m15_rate": 189.34875221230683,             "m1_rate": 292.68892010664734,             "m5_rate": 283.6411879520726,             "max": 45.421715,             "mean": 10.127665299330488,             "mean_rate": 140.07481322936985,             "min": 2.125585,             "p50": 9.3215,             "p75": 12.104956,             "p95": 18.353628999999998,             "p98": 22.658728999999997,             "p99": 32.831379,             "p999": 45.421715,             "rate_units": "calls/second",             "stddev": 5.0654846489629834,             "total": 3092320.832253         }

,

        

{             "_id": "repo.jdbc.get-connection",             "_type": "timer",             "count": 7711178,             "duration_units": "milliseconds",             "m15_rate": 3895.754795529973,             "m1_rate": 4780.818811103841,             "m5_rate": 4816.408208534911,             "max": 40.319939999999995,             "mean": 0.02359152907848073,             "mean_rate": 3994.3670684316776,             "min": 0.000388,             "p50": 0.00086,             "p75": 0.000987,             "p95": 0.001401,             "p98": 0.001926,             "p99": 0.002254,             "p999": 3.680895,             "rate_units": "calls/second",             "stddev": 0.8600997296528877,             "total": 7431461.009968         }

,

        

{             "_id": "repo.jdbc.query._adhoc-filter.relationships",             "_type": "timer",             "count": 623755,             "duration_units": "milliseconds",             "m15_rate": 437.9909950459925,             "m1_rate": 684.9538072967911,             "m5_rate": 658.1978129052998,             "max": 77.481315,             "mean": 2.8818762530480106,             "mean_rate": 323.58509121870236,             "min": 0.278535,             "p50": 2.4748799999999997,             "p75": 3.833693,             "p95": 6.5005999999999995,             "p98": 7.865603999999999,             "p99": 9.155362,             "p999": 54.551643999999996,             "rate_units": "calls/second",             "stddev": 3.015356352372434,             "total": 2466955.226292         }
Comment by Tinghua Xu [ 06/May/20 ]

Suspect throughput in livesyn update is caused by the same issue, attached are the metrics collected for livesync update.

Comment by Tinghua Xu [ 27/May/20 ]

As Dirk requested earlier, with the latest IDM master and MySQL as repo, the Gatling output for update test:

---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     27 (OK=27     KO=-     )
> max response time                                   3922 (OK=3922   KO=-     )
> mean response time                                   110 (OK=110    KO=-     )
> std deviation                                        276 (OK=276    KO=-     )
> response time 50th percentile                         63 (OK=63     KO=-     )
> response time 95th percentile                        120 (OK=120    KO=-     )
> response time 99th percentile                       1970 (OK=1970   KO=-     )
> response time 99.9th percentile                     2616 (OK=2616   KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        263678 ( 98%)
> 800 ms < t < 1200 ms                                 944 (  0%)
> t > 1200 ms                                         5378 (  2%)
> failed                                                 0 (  0%)

and for Delete:

---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     18 (OK=18     KO=-     )
> max response time                                   6662 (OK=6662   KO=-     )
> mean response time                                   254 (OK=254    KO=-     )
> std deviation                                        949 (OK=949    KO=-     )
> response time 50th percentile                         32 (OK=32     KO=-     )
> response time 95th percentile                       1910 (OK=1909   KO=-     )
> response time 99th percentile                       5424 (OK=5424   KO=-     )
> response time 99.9th percentile                     5916 (OK=5916   KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        254318 ( 94%)
> 800 ms < t < 1200 ms                                 713 (  0%)
> t > 1200 ms                                        14969 (  6%)
> failed                                                 0 (  0%) 

attached are the updated metrics for update and delete in 7.0.0

Comment by Tinghua Xu [ 27/May/20 ]

Using external DJ as repo, for 7.0.0, we have the following Gatling output:
for update, 

 ---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     26 (OK=26     KO=-     )
> max response time                                   5014 (OK=5014   KO=-     )
> mean response time                                    98 (OK=98     KO=-     )
> std deviation                                        355 (OK=355    KO=-     )
> response time 50th percentile                         41 (OK=41     KO=-     )
> response time 95th percentile                         90 (OK=90     KO=-     )
> response time 99th percentile                       2530 (OK=2530   KO=-     )
> response time 99.9th percentile                     3618 (OK=3618   KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        263808 ( 98%)
> 800 ms < t < 1200 ms                                 966 (  0%)
> t > 1200 ms                                         5226 (  2%)
> failed                                                 0 (  0%) 

For delete, we have:

---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     20 (OK=20     KO=-     )
> max response time                                   8265 (OK=8265   KO=-     )
> mean response time                                   237 (OK=237    KO=-     )
> std deviation                                        857 (OK=857    KO=-     )
> response time 50th percentile                         37 (OK=37     KO=-     )
> response time 95th percentile                       1719 (OK=1719   KO=-     )
> response time 99th percentile                       4662 (OK=4662   KO=-     )
> response time 99.9th percentile                     6143 (OK=6143   KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        254581 ( 94%)
> 800 ms < t < 1200 ms                                 774 (  0%)
> t > 1200 ms                                        14645 (  5%)
> failed                                                 0 (  0%) 
Comment by Tinghua Xu [ 27/May/20 ]

6.5.0.3 external DJ as repo:
Update:

 ---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     20 (OK=20     KO=-     )
> max response time                                   1989 (OK=1989   KO=-     )
> mean response time                                    44 (OK=44     KO=-     )
> std deviation                                         87 (OK=87     KO=-     )
> response time 50th percentile                         31 (OK=31     KO=-     )
> response time 95th percentile                         80 (OK=80     KO=-     )
> response time 99th percentile                        134 (OK=134    KO=-     )
> response time 99.9th percentile                     1318 (OK=1318   KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        268600 ( 99%)
> 800 ms < t < 1200 ms                                 908 (  0%)
> t > 1200 ms                                          492 (  0%)
> failed                                                 0 (  0%)
================================================================================ 

Delete:

 ---- Global Information --------------------------------------------------------
> request count                                     270000 (OK=270000 KO=0     )
> min response time                                     15 (OK=15     KO=-     )
> max response time                                   1493 (OK=1493   KO=-     )
> mean response time                                    30 (OK=30     KO=-     )
> std deviation                                         42 (OK=42     KO=-     )
> response time 50th percentile                         25 (OK=25     KO=-     )
> response time 95th percentile                         47 (OK=47     KO=-     )
> response time 99th percentile                         80 (OK=80     KO=-     )
> response time 99.9th percentile                      701 (OK=701    KO=-     )
> mean requests/sec                                299.667 (OK=299.667 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        269859 (100%)
> 800 ms < t < 1200 ms                                 126 (  0%)
> t > 1200 ms                                           15 (  0%)
> failed                                                 0 (  0%) 
Comment by Tinghua Xu [ 18/Jun/20 ]

The degradation is caused by commit:

commit f6a758bc9abc6a6faec516d61d3e362c2580dd49 
Author: Jason Vincent <jason.vincent@forgerock.com> 
Date:   Tue Jul 9 19:38:40 2019 -0700       
OPENIDM-11773 Modify relationship link expansion mega-query to account for DA Privilges.  

With this commit, we have Gatling output:
For update:

---- Global Information --------------------------------------------------------

> request count                                     270000 (OK=270000 KO=0     )

> min response time                                     29 (OK=29     KO=-     )

> max response time                                   2164 (OK=2164   KO=-     )

> mean response time                                   105 (OK=105    KO=-     )

> std deviation                                        150 (OK=150    KO=-     )

> response time 50th percentile                         78 (OK=78     KO=-     )

> response time 95th percentile                        134 (OK=134    KO=-     )

> response time 99th percentile                       1048 (OK=1048   KO=-     )

> response time 99.9th percentile                     1524 (OK=1524   KO=-     )

> mean requests/sec                                299.667 (OK=299.667 KO=-     )

---- Response Time Distribution ------------------------------------------------

> t < 800 ms                                        265529 ( 98%)

> 800 ms < t < 1200 ms                                2900 (  1%)

> t > 1200 ms                                         1571 (  1%)

> failed                                                 0 (  0%)

 

For Delete:

 ---- Global Information --------------------------------------------------------

> request count                                     270000 (OK=270000 KO=0     )

> min response time                                    400 (OK=400    KO=-     )

> max response time                                  48908 (OK=48908  KO=-     )

> mean response time                                 32003 (OK=32003  KO=-     )

> std deviation                                      10077 (OK=10077  KO=-     )

> response time 50th percentile                      34197 (OK=34198  KO=-     )

> response time 95th percentile                      44305 (OK=44307  KO=-     )

> response time 99th percentile                      45120 (OK=45120  KO=-     )

> response time 99.9th percentile                    46448 (OK=46448  KO=-     )

> mean requests/sec                                 286.32 (OK=286.32 KO=-     )

---- Response Time Distribution ------------------------------------------------

> t < 800 ms                                            87 (  0%)

> 800 ms < t < 1200 ms                                  73 (  0%)

> t > 1200 ms                                       269840 (100%)

> failed                                                 0 (  0%)

 

With the commit(f00490ee6b1ba)right before this commit:

 For update:

---- Global Information --------------------------------------------------------

> request count                                     270000 (OK=270000 KO=0     )

> min response time                                     22 (OK=22     KO=-     )

> max response time                                   1644 (OK=1644   KO=-     )

> mean response time                                    57 (OK=57     KO=-     )

> std deviation                                         59 (OK=59     KO=-     )

> response time 50th percentile                         47 (OK=47     KO=-     )

> response time 95th percentile                         92 (OK=92     KO=-     )

> response time 99th percentile                        199 (OK=199    KO=-     )

> response time 99.9th percentile                      922 (OK=922    KO=-     )

> mean requests/sec                                299.667 (OK=299.667 KO=-     )

---- Response Time Distribution ------------------------------------------------

> t < 800 ms                                        269415 (100%)

> 800 ms < t < 1200 ms                                 563 (  0%)

> t > 1200 ms                                           22 (  0%)

> failed                                                 0 (  0%)

 

For Delete:

---- Global Information --------------------------------------------------------

> request count                                     270000 (OK=270000 KO=0     )

> min response time                                     19 (OK=19     KO=-     )

> max response time                                   9507 (OK=9507   KO=-     )

> mean response time                                    85 (OK=85     KO=-     )

> std deviation                                        202 (OK=202    KO=-     )

> response time 50th percentile                         36 (OK=36     KO=-     )

> response time 95th percentile                        270 (OK=270    KO=-     )

> response time 99th percentile                       1146 (OK=1146   KO=-     )

> response time 99.9th percentile                     1572 (OK=1572   KO=-     )

> mean requests/sec                                299.667 (OK=299.667 KO=-     )

---- Response Time Distribution ------------------------------------------------

> t < 800 ms                                        261905 ( 97%)

> 800 ms < t < 1200 ms                                5988 (  2%)

> t > 1200 ms                                         2107 (  1%)

> failed                                                 0 (  0%)

 

The tests are done on machine fourme.

 

 

Comment by Dirk Hogan [ 17/Jul/20 ]

Tinghua Xu the commit you cite above is not viable. The PR associated with the JIRA, https://stash.forgerock.org/projects/OPENIDM/repos/openidm/pull-requests/5400/overview, explicitly states that there is a known performance regression associated  with these changes, and filed https://bugster.forgerock.org/jira/browse/OPENIDM-13632 to track these fixes. This JIRA was closed as fixed. Your last comment in this JIRA, in turn, mentions filing https://bugster.forgerock.org/jira/browse/OPENIDM-13985, which is also closed as fixed. 

So I am tempted to close this defect as cannot reproduce. I will re-assign to you, if you want to try to find a viable commit hash which introduced the regression. I would say you have to start with the commit made as part of https://stash.forgerock.org/projects/OPENIDM/repos/openidm/pull-requests/6279/overview, as that is the last commit made resulting from perf changes made as the result of https://bugster.forgerock.org/jira/browse/OPENIDM-13632. If you can find a commit that introduces a performance regression following this commit, then I will take (yet) another look.

Comment by Tinghua Xu [ 23/Jul/20 ]

Dirk Hogan, the ticket you mentioned(13632 and 13985) are about the create, this JIRA is about update and delete. When the issue with creates are fixed, we possibly didn't verify the update and delete, especially from the stress mode point of view.

Comment by Tinghua Xu [ 11/Aug/20 ]

Alin Brici, The latest mean response times from IDM 7.0.0-final and 6.5.0.3 are as below:

  6.5.0.3 7.0.0
Update 52 24
Delete 67 31

 

Generated at Sat Nov 28 10:50:46 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.