[OPENIDM-14003] Recon create throughput drop introduced by changes for OPENIDM-13657 Created: 25/Oct/19  Updated: 06/Nov/19  Resolved: 06/Nov/19

Status: Closed
Project: OpenIDM
Component/s: Module - Core mapping, synchronization, reconciliation
Affects Version/s: 7.0.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Tinghua Xu Assignee: Dirk Hogan
Resolution: Cannot Reproduce Votes: 0
Labels: CLARK, Performance, regression
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

IDM master runs on CentOS7, MySQL as repo. Java 11, Latest DJ master.

Attachments: HTML File distilled     File idm-metrics.0a27773     File idm-metrics.55394c2     File idm-metrics.dirk     File idm-metrics.new_master_create     File idm-metrics.newer_master    
Target Version/s:
Story Points: 5


Noticed about 10% throughput drop with recon create from IDM to DJ. With commit 55394c2, the throughput is about 520 users/second, with commit 0a27773(from OPENIDM-13657), the throughput is about 459 users/second.

Attached are the metrics from tests using the two commits.
To reproduce using Pyforge:
1. Set Pyforge config as below:

num_users = 50
duration = 2000
concurrency = 10
max_throughput = -1

host_name = ${Default:host_name}
java_home = /usr/java/jdk11
java_args = ${Default:java_args}
version = 7.0.0-SNAPSHOT
previous_version = 6.5.0
protocol = http
repo_type = mysql
use_docker_repo = False
enable_metrics = True
jvm_debug_options = ${Default:jvm_debug_options}
cloud = ${Default:cloud}
enable_policy_enforcement = False

2. Run test:

run-pybot.py -c stress -s idm.regular_recon.ReconManUserToLDAP OpenIDM

3. Check pyforge/results/debug.txt for recon throughput.

Note that no throughput drop on recon update was seen.

Comment by Dirk Hogan [ 28/Oct/19 ]

See the attached doc named distilled, which compares the relevant metrics before, and after, the queryIds replacement. The bottom line: the generic queryFilter case is no match for the queryId-specific path of find-relationships-for-resource.  

Before - note 700,000 find-relationships-for resource, at a mean latency of .33 ms, for a total of 205 seconds.

{ "_id": "repo.raw._queryId.find-relationships-for-resource", "_type": "timer", "count": 700000, "duration_units": "milliseconds", "m15_rate": 244.67968223586703, "m1_rate": 310.0519044575764, "m5_rate": 270.4030882040544, "max": 1.660286, "mean": 0.3302382702309644, "mean_rate": 274.0017870280342, "min": 0.21412799999999999, "p50": 0.316118, "p75": 0.351884, "p95": 0.418921, "p98": 0.5024569999999999, "p99": 0.580374, "p999": 1.660286, "rate_units": "calls/second", "stddev": 0.09923688358913117, "total": 205482.878824 }

After - note 700,000 of the queryFilter=true replacements for find-relationships-for-resource, at a mean latency of 1.33 ms, 4x slowdown over find-relationships-for-resource, for a total of 885 seconds, 680 seconds slower than find-relationships-for-resource.

{ "_id": "repo.jdbc.query._adhoc-filter.relationships", "_type": "timer", "count": 700000, "duration_units": "milliseconds", "m15_rate": 235.64963952352824, "m1_rate": 203.26379515905415, "m5_rate": 246.39433298674243, "max": 3.801827, "mean": 1.3367790304241058, "mean_rate": 267.0815070631801, "min": 0.894655, "p50": 1.2940669999999999, "p75": 1.4451539999999998, "p95": 1.724916, "p98": 2.272363, "p99": 2.573868, "p999": 3.801827, "rate_units": "calls/second", "stddev": 0.2911924050791407, "total": 885445.902408 }

Note this additional metric, which measures the execution time of the query:

{ "_id": "repo.jdbc.relationship.edge.execute.edge-not-joined-to-vertex", "_type": "timer", "count": 700000, "duration_units": "milliseconds", "m15_rate": 235.71085088440245, "m1_rate": 203.2100405523411, "m5_rate": 246.38173102367307, "max": 3.245891, "mean": 0.8506782474374249, "mean_rate": 267.0827051459067, "min": 0.5551889999999999, "p50": 0.820726, "p75": 0.9376319999999999, "p95": 1.094115, "p98": 1.527715, "p99": 1.742174, "p999": 3.028692, "rate_units": "calls/second", "stddev": 0.21798858693061987, "total": 552344.090919 }

That means:

1.33 - .85 = .48 ms spent parsing the queryFilter into a PreparedStatement. We could seek to diminish this time by caching parsed queryFilter instances

 Even considering the queryFilter parsing time, .85ms for execution compares to the .33ms execution of the corresponding queryId. We may seek to diminish this latency by

  1. using the trigger-generated knowledge of relationship sidedness to exclude one of the sides of the union all.
  2. excluding the join to the object types table


The generated query looks like:

SELECT edge_fullobject, edge_objectid, edge_rev, edge_firstresourcecollection, edge_firstresourceid, edge_firstpropertyname, edge_secondresourcecollection, edge_secondresourceid, edge_secondpropertyname, edge_refResourceId, edge_refResourceCollection FROM (  SELECT edge_fullobject, edge_objectid, edge_rev, edge_firstresourcecollection, edge_firstresourceid, edge_firstpropertyname, edge_secondresourcecollection, edge_secondresourceid, edge_secondpropertyname, id, edge_refResourceId, edge_refResourceCollection FROM ( SELECT edge_of.fullobject AS edge_fullobject, edge_of.objectid AS edge_objectid, edge_of.rev AS edge_rev, edge_of.firstresourcecollection AS edge_firstresourcecollection, edge_of.firstresourceid AS edge_firstresourceid, edge_of.firstpropertyname AS edge_firstpropertyname, edge_of.secondresourcecollection AS edge_secondresourcecollection, edge_of.secondresourceid AS edge_secondresourceid, edge_of.secondpropertyname AS edge_secondpropertyname, edge_of.id AS id, edge_of.secondresourceid AS edge_refResourceId, edge_of.secondresourcecollection AS edge_refResourceCollection FROM openidm.relationships edge_of INNER JOIN openidm.objecttypes edge_ofobjecttypes ON ((edge_of.objecttypes_id = edge_ofobjecttypes.id) AND (edge_ofobjecttypes.objecttype =  'relationships')) WHERE (edge_of.firstresourcecollection = 'managed/user' AND edge_of.firstresourceid = 'A-A-244' AND edge_of.firstpropertyname = 'roles') LIMIT '2147483647' OFFSET '0' ) join0 UNION ALL  SELECT edge_fullobject, edge_objectid, edge_rev, edge_firstresourcecollection, edge_firstresourceid, edge_firstpropertyname, edge_secondresourcecollection, edge_secondresourceid, edge_secondpropertyname, id, edge_refResourceId, edge_refResourceCollection FROM ( SELECT edge_os.fullobject AS edge_fullobject, edge_os.objectid AS edge_objectid, edge_os.rev AS edge_rev, edge_os.firstresourcecollection AS edge_firstresourcecollection, edge_os.firstresourceid AS edge_firstresourceid, edge_os.firstpropertyname AS edge_firstpropertyname, edge_os.secondresourcecollection AS edge_secondresourcecollection, edge_os.secondresourceid AS edge_secondresourceid, edge_os.secondpropertyname AS edge_secondpropertyname, edge_os.id AS id, edge_os.firstresourceid AS edge_refResourceId, edge_os.firstresourcecollection AS edge_refResourceCollection FROM openidm.relationships edge_os INNER JOIN openidm.objecttypes edge_osobjecttypes ON ((edge_os.objecttypes_id = edge_osobjecttypes.id) AND (edge_osobjecttypes.objecttype =  'relationships')) WHERE (edge_os.secondresourcecollection = 'managed/user' AND edge_os.secondresourceid = 'A-A-244' AND edge_os.secondpropertyname = 'roles') LIMIT '2147483647' OFFSET '0' ) join1 ) outerSort LIMIT '2147483647' OFFSET '0'

Comment by Tinghua Xu [ 28/Oct/19 ]

The metrics from the PR-5709 build is indeed better:

            "_id": "repo.jdbc.query._adhoc-filter.relationships",
            "_type": "timer",
            "count": 700000,
            "duration_units": "milliseconds",
            "m15_rate": 245.29786379918065,
            "m1_rate": 309.7632093765477,
            "m5_rate": 270.7303190797846,
            "max": 3.389646,
            "mean": 0.4046165566996481,
            "mean_rate": 270.51292738077126,
            "min": 0.260838,
            "p50": 0.381345,
            "p75": 0.42036399999999996,
            "p95": 0.5129009999999999,
            "p98": 0.618266,
            "p99": 1.016271,
            "p999": 3.389646,
            "rate_units": "calls/second",
            "stddev": 0.15703656377391295,
            "total": 289918.056425

The recon create throughput was 503 users/second.

Comment by Dirk Hogan [ 30/Oct/19 ]

Digging into the fact that the problem is only seen on the recon create, not the recon update, which, from a relationship query perspective, should be exactly the same:

The tests work as follows:

  1. N(6000) users are written directly to the mysql database
  2. recon from managed->dj is performed - this is the entirety of the create test - situation will be absent, link will be created, and record written to dj
  3. N users are modified
  4. recon from managed->dj - this is the 'update' test - situation will be confirmed as link is present, and updated record written to dj

 Testing on my laptop reveals the same pattern which Tinghua mentioned - the query slowness appears on the recon create, not the recon create followed by update. 

Recon create:

            "_id": "repo.jdbc.query._adhoc-filter.relationships",
            "count": 18000,
            "max": 126.693012,
            "mean": 4.42350615297052,
            "min": 1.631173,
            "p50": 2.14474,
            "p75": 2.846394,
            "p95": 12.622641,
            "p98": 34.789837,
            "p99": 54.535301999999994,
            "p999": 126.693012,
            "stddev": 9.486571044335859,
            "m15_rate": 203.62531638707026,
            "m1_rate": 55.91281924201021,
            "m5_rate": 172.96463803265843,
            "mean_rate": 116.58859279757995,
            "duration_units": "milliseconds",
            "rate_units": "calls/second",
            "total": 75856.678543,
            "_type": "timer"
            "_id": "repo.jdbc.relationship.edge.execute.edge-not-joined-to-vertex",
            "count": 18000,
            "max": 64.219494,
            "mean": 2.113976611449383,
            "min": 0.932736,
            "p50": 1.162356,
            "p75": 1.3625239999999998,
            "p95": 4.426564,
            "p98": 13.686378,
            "p99": 26.040647,
            "p999": 48.435699,
            "stddev": 4.73753189808542,
            "m15_rate": 204.4716157037283,
            "m1_rate": 55.98904008557281,
            "m5_rate": 173.56992796941282,
            "mean_rate": 116.59786847860403,
            "duration_units": "milliseconds",
            "rate_units": "calls/second",
            "total": 38313.408627,
            "_type": "timer"

The relevant recon metrics:

                "sourceObjectQuery": {
                    "min": 4,
                    "max": 132,
                    "mean": 10,
                    "count": 6000,
                    "sum": 61072,
                    "stdDev": 13
Note that these are the query to read the full object from managed/user.

Recon create and update:

            "_id": "repo.jdbc.query._adhoc-filter.relationships",
            "count": 54000,
            "max": 15.386235999999998,
            "mean": 1.754353297539799,
            "min": 0.840119,
            "p50": 1.753943,
            "p75": 1.88624,
            "p95": 2.2189389999999998,
            "p98": 4.2906889999999995,
            "p99": 5.394055,
            "p999": 15.386235999999998,
            "stddev": 0.9579543780581617,
            "m15_rate": 179.5952847025916,
            "m1_rate": 2.1688201389539383,
            "m5_rate": 85.08654121106025,
            "mean_rate": 93.96073291734787,
            "duration_units": "milliseconds",
            "rate_units": "calls/second",
            "total": 110557.022468,
            "_type": "timer"
            "_id": "repo.jdbc.relationship.edge.execute.edge-not-joined-to-vertex",
            "count": 54000,
            "max": 23.592205999999997,
            "mean": 1.1138625011841703,
            "min": 0.528164,
            "p50": 1.075427,
            "p75": 1.1585619999999999,
            "p95": 1.31311,
            "p98": 2.933054,
            "p99": 4.043562,
            "p999": 11.437987,
            "stddev": 0.8888795927904168,
            "m15_rate": 180.44367901963057,
            "m1_rate": 2.1681012931595047,
            "m5_rate": 85.32244654994541,
            "mean_rate": 93.96256400786281,
            "duration_units": "milliseconds",
            "rate_units": "calls/second",
            "total": 61691.437609,
            "_type": "timer"

The relevant recon metrics:

                "sourceObjectQuery": {
                    "min": 3,
                    "max": 28,
                    "mean": 5,
                    "count": 6000,
                    "sum": 30654,
                    "stdDev": 1

See the difference in the max,min, and standard deviation for the queries for the recon create, and the recon create and update cases. The difference in the mean is due to greater latency fluctuations in the create case.

I want to conclude that the slowness of the relationship query is resulting from the link writes to the repo, which occur only during the 'create' recon. The uniformity the repo reads during the recon update almost makes me wonder if the repo will cache virtually all requests. It is like the java VM will pre-fetch uniform memory accesses so that CPU caches are never missed if memory is accessed in a regular fashion (index through array as opposed to access elements in a linked-list).

Comment by Tinghua Xu [ 06/Nov/19 ]

The throughput for recon create comes back to the original level 520 users/second after commit b103f0f7ca, the issue is somehow fixed.

Generated at Sat Nov 28 13:00:55 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.