[OPENDJ-6278] export-ldif JE backend: speed improvement Created: 10/May/19  Updated: 28/Jan/20

Status: QA Backlog
Project: OpenDJ
Component/s: backends, tools
Affects Version/s: 6.5.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Jean-Noël Rouvignac Assignee: Jean-Noël Rouvignac
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File flamegraph_alloc.svg     File flamegraph_cpu.svg    
Epic Link: Miscellaneous 7.0
Story Points: 8

 Description   

On a quick test on my laptop, exporting a 10 million entry backend takes about 2,5 minutes. So exporting a 500 million entry backend would take 125 minutes, i.e. about 2 hours.

I profiled the run of export-ldif offline with async-profiler, and here is where the time is spent:

  • 13%: cursor through id2entry
  • 10%: decode the entry
  • 25%: write to file (including serializing to string)
  • 50%: spent in native code doing garbage collection

Reducing allocations would reduce the time spent in GC which would leave more time for the export to take place.

Once this is solved, the next step may be to parallelize the code (to be confirmed). Today's code for export-ldif on JE Backend is largely single threaded (see ExportJob.exportContainer()). Using a thread pool to do the decoding + serialization to string, then having a single thread responsible for the write to disk may be beneficial (to be proven). RxJava may be a good choice for implementing it this way.



 Comments   
Comment by Matthew Swift [ 25/Nov/19 ]

Ludovic Poitou - the fixes so far have only yielded a small speed improvement. We suspect that more significant gains can be had by using a thread pipeline, but it would require a significant amount of work, 5 days. Do you want us to flag it as MVP for 7.0?

Comment by Jean-Noël Rouvignac [ 23/Dec/19 ]

Commit 2c80d920ce2 nearly doubles the perfs.
Unfortunately I intended to have two separate commits, but after a bad rebase from my part, it ended up with only a single commit only

Comment by Jean-Noël Rouvignac [ 26/Dec/19 ]

One functional test failed in CI and commit 2c80d920ce2 has been reverted.

Here are the logs:

[24/Dec/2019:10:23:12 +0000] category=CORE severity=ERROR msgID=140 msg=An uncaught exception during processing for thread Export LDIF - cursor through id2entry has caused it to terminate abnormally. The stack trace for that exception is: UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | com.sleepycat.je.ThreadInterruptedException: (JE 18.3.12) /home/jenkins/workspace/OpenDJ-build_master-CPMER56E632EMKNXZEURRUYWFHY72MGZDGZHN7BYLM5SGO6GS6PA/results/20191224-074402/replication_group3/Issue5406/DJ1/opendj/db/appData java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed. (RxJavaPlugins.java:367 FlowableGenerate.java:146 FlowableGenerate.java:102 FlowableSubscribeOn.java:156 ExecutorScheduler.java:288 ExecutorScheduler.java:253 ThreadPoolExecutor.java:1128 ThreadPoolExecutor.java:628 Thread.java:834)	
[24/Dec/2019:10:23:12 +0000] category=CORE severity=ERROR msgID=140 msg=An uncaught exception during processing for thread Export LDIF - cursor through id2entry has caused it to terminate abnormally. The stack trace for that exception is: UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | com.sleepycat.je.ThreadInterruptedException: (JE 18.3.12) /home/jenkins/workspace/OpenDJ-build_master-CPMER56E632EMKNXZEURRUYWFHY72MGZDGZHN7BYLM5SGO6GS6PA/results/20191224-074402/replication_group3/Issue5406/DJ1/opendj/db/appData java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed. (RxJavaPlugins.java:367 FlowableGenerate.java:146 FlowableGenerate.java:102 FlowableSubscribeOn.java:156 ExecutorScheduler.java:288 ExecutorScheduler.java:253 ThreadPoolExecutor.java:1128 ThreadPoolExecutor.java:628 Thread.java:834)	
[24/Dec/2019:10:23:12 +0000] category=CORE severity=NOTICE msgID=139 msg=The Directory Server has sent an alert notification generated by class org.opends.server.api.DirectoryThread (alert type org.opends.server.UncaughtException, alert ID org.opends.messages.core-140): An uncaught exception during processing for thread Export LDIF - cursor through id2entry has caused it to terminate abnormally. The stack trace for that exception is: UndeliverableException: The exception could not be delivered to the consumer because it has already canceled/disposed the flow or the exception has nowhere to go to begin with. Further reading: https://github.com/ReactiveX/RxJava/wiki/What's-different-in-2.0#error-handling | com.sleepycat.je.ThreadInterruptedException: (JE 18.3.12) /home/jenkins/workspace/OpenDJ-build_master-CPMER56E632EMKNXZEURRUYWFHY72MGZDGZHN7BYLM5SGO6GS6PA/results/20191224-074402/replication_group3/Issue5406/DJ1/opendj/db/appData java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed. (RxJavaPlugins.java:367 FlowableGenerate.java:146 FlowableGenerate.java:102 FlowableSubscribeOn.java:156 ExecutorScheduler.java:288 ExecutorScheduler.java:253 ThreadPoolExecutor.java:1128 ThreadPoolExecutor.java:628 Thread.java:834)	
[24/Dec/2019:10:23:12 +0000] category=SYNC severity=ERROR msgID=76 msg=An unexpected error occurred when updating generation ID for domain "dc=example,dc=com": Other	
[24/Dec/2019:10:23:13 +0000] category=SYNC severity=ERROR msgID=73 msg=An unexpected error occurred when searching for generation id for domain "dc=example,dc=com": Other StorageRuntimeException(com.sleepycat.je.ThreadInterruptedException: (JE 18.3.12) Environment must be closed, caused by: com.sleepycat.je.ThreadInterruptedException: Environment invalid because of previous exception: (JE 18.3.12) /home/jenkins/workspace/OpenDJ-build_master-CPMER56E632EMKNXZEURRUYWFHY72MGZDGZHN7BYLM5SGO6GS6PA/results/20191224-074402/replication_group3/Issue5406/DJ1/opendj/db/appData java.lang.InterruptedException THREAD_INTERRUPTED: InterruptedException may cause incorrect internal state, unable to continue. Environment is invalid and must be closed.)

and here is the corresponding command line:

~/scripts/killall_servers.sh; ./run-pybot.py -nvf -s replication_group3.Issue5406 -t Test_5406 opendj
Comment by Jean-Noël Rouvignac [ 30/Dec/19 ]

It looks like verify-index could also be sped up by using some kind of multi-threading.

Comment by Jean-Noël Rouvignac [ 28/Jan/20 ]

I have reapplied the same change, fixing the problem noticed earlier.

Generated at Fri Nov 27 05:41:25 UTC 2020 using Jira 7.13.12#713012-sha1:6e07c38070d5191bbf7353952ed38f111754533a.