Add the import process implementation for data loader #2462

inv-jishnu · 2025-01-10T11:03:18Z

Description

In this PR I have added import processes to process the import file based on the file format and related dtos and util files for it.

Related issues and/or PRs

Please review this PR once the below PRs are reviewed and merged and master branch is merged to this branch with those changes.

Some more information on data chunk and transaction size
The data chunk size and transaction size are introduced in new changes. The data chunk size is specified is used to split the input files to data chunks of specified size. If the scalardb mode is transaction, the records in each data chunk is processed as transactions. The records are then further split up based on transaction size and are processed together as a single transaction.

Changes made

Added classes to process the import source file based on the file format and related dtos and util classes

Checklist

The following is a best-effort checklist. If any items in this checklist are not applicable to this PR or are dependent on other, unmerged PRs, please still mark the checkboxes after you have read and understood each item.

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes.
Any remaining open issues linked to this PR are documented and up-to-date (Jira, GitHub, etc.).
Tests (unit, integration, etc.) have been added for the changes.
My changes generate no new warnings.
Any dependent changes in other PRs have been merged and published.

Additional notes (optional)

Road map to merge remaining data loader core files. Current status

General
- Add ScalarDB Dao and related files - Add ScalarDB Dao and related files #2417
- TableMetadataService(partially replaced by ConsensusUtils): Add table metadata service #2434
Export
- Export options and validations: Add export options validator #2435
- ProducerTasks: 1 PR incoming
Import
- Dto classes and utilities: Add data chunk and task result enums and dtos #2442
- Import processor and task code: 2-3 PRs incoming
  - Add dtos and other classes for task #2446
  - Add the import process implementation for data loader #2462
- Code for Import transaction batch and data chunk import: 1 PR Incoming
- ControlFile related Dtos: Add Control file module files and validation #2445
- Import logger: 1 PR incoming

Release notes

NA

komamitsu · 2025-03-17T02:12:09Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/ImportManager.java

+  @Override
+  public void addOrUpdateDataChunkStatus(ImportDataChunkStatus status) {
+    synchronized (importDataChunkStatusList) {
+      for (int i = 0; i < importDataChunkStatusList.size(); i++) {


If the size of importDataChunkStatusList is large, this loop might take tens or hundreds milliseconds with holding the lock and affect the performance.

It seems the order of the items is not important. How about using Set or Map instead of List to access an item in O(1).

komamitsu · 2025-03-17T02:23:18Z

...ore/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/CsvImportProcessor.java

+                  } catch (InterruptedException e) {
+                    Thread.currentThread().interrupt();
+                    throw new RuntimeException(
+                        CoreError.DATA_LOADER_DATA_CHUNK_PROCESS_FAILED.buildMessage(


Don't need to handle other Exception types?

Currently here there is InterruptedException may only be thrown by dataChunkQueue.poll.
Please do let me know if you feel there is need to handle any other exception here.
Thank you

I see. That sounds reasonable basically since processDataChunk is called asynchronously. (OOMException could still be thrown, though)

komamitsu · 2025-03-17T02:36:16Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+   * @return a list of {@link ImportDataChunkStatus} objects indicating the processing status of
+   *     each data chunk
+   */
+  public List<ImportDataChunkStatus> process(


It looks this is expected to always be overridden. How about making this abstract method?

komamitsu · 2025-03-17T02:48:26Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+      }
+
+    } catch (TransactionException e) {
+      isSuccess = false;


I think the transaction should be aborted since it's possible any failure occurred before the commit.

komamitsu · 2025-03-17T03:11:40Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+    List<ImportTransactionBatch> transactionBatches =
+        splitIntoTransactionBatches(dataChunk, transactionBatchSize);
+    ExecutorService transactionBatchExecutor =
+        Executors.newFixedThreadPool(Math.min(transactionBatches.size(), numCores));


If tasks are computation-dominant, setting the concurrency to the number of CPU cores is reasonable. But, the tasks invoked in this method are IO-dominant and involve waiting for responses from the underlying database. So, I think higher concurrency would be better in terms of throughput.

komamitsu · 2025-03-17T04:00:53Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/task/ImportTask.java

+          .importAction(importAction)
+          .status(ImportTargetResultStatus.DATA_ALREADY_EXISTS)
+          .errors(
+              Collections.singletonList(CoreError.DATA_LOADER_DATA_ALREADY_EXISTS.buildMessage()))


How about adding key information in the error message for investigation?

@komamitsu san,

The error message is part of ImportTargetResult object and which is then added to an ImportTaskResult object. The ImportTargetResult has necessary details regarding the failure. I have added a sample below. It also includes the actual data that was tried to import into a table.

{ "namespace" : "testn", "tableName" : "all_columns", "importAction" : "INSERT", "errors" : [ "DB-CORE-10182: Record was not found" ], "dataMapped" : false, "importedRecord" : { "col1" : 9111111111, "col2" : 1, "col3" : 1, "col4" : 1.4E-45, "col5" : 4.9E-324, "col6" : "VALUE!!s", "col7" : "0x626C6F6220746573742076616C7565", "col8" : "2000-01-01", "col9" : "01:01:01.000000", "col10" : "2000-01-01T01:01:00", "col11" : "1970-01-21T03:20:41.740Z", "id" : 9111111111, "name" : "sample1", "email" : "[email protected]", "tx_id" : "6361bc3a-9c44-4de9-8829-b028fb48a749", "tx_state" : 3, "tx_version" : 1, "tx_prepared_at" : 1732508362517, "tx_committed_at" : 1732508364532 }, "status" : "DATA_NOT_FOUND" }

Could you please let me know whether the error message (in errors) need to be more clear or this needs any more details. The error message was kept simple as we are providing all details of failure in the failure log.

Thanks for the explanation! Sounds good.

How about adding a comment at https://github.com/scalar-labs/scalardb/pull/2462/files#diff-fd06988f4ced6a0d6d0b1049c69ab07df11c372c3ccb868a78afbe4d24e6f663R827 like // These error messages will be output with key information ?

komamitsu · 2025-03-17T04:01:01Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/task/ImportTask.java

+          .importedRecord(mutableSourceRecord)
+          .importAction(importAction)
+          .status(ImportTargetResultStatus.DATA_NOT_FOUND)
+          .errors(Collections.singletonList(CoreError.DATA_LOADER_DATA_NOT_FOUND.buildMessage()))


Same as above

komamitsu · 2025-03-17T04:03:12Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/task/ImportTask.java

+      return ImportTargetResult.builder()
+          .namespace(namespace)
+          .tableName(table)
+          .status(ImportTargetResultStatus.VALIDATION_FAILED)


Probably same as above? Could you take a look at those error messages added by this PR from the view point of users who need to investigate failures?

komamitsu · 2025-03-17T04:12:03Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/util/ColumnUtils.java

+   * @throws ColumnParsingException if an error occurs while base64 parsing the column
+   */
+  private static Column<?> getColumn(
+      Result scalarDBResult,


Suggested change

Result scalarDBResult,

@Nullable Result scalarDBResult,

komamitsu · 2025-03-17T04:26:33Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+import org.slf4j.LoggerFactory;
+
+@RequiredArgsConstructor
+public abstract class ImportProcessor {


ImportProcessor and its child classes are a bit complicated. They don't need to have unit tests?

inv-jishnu · 2025-03-25T09:56:48Z

@komamitsu san,

I have added changes as suggested including adding some basic unit test cases apart from the feedback related to error message clarity (I have added some more details for that as a comment).
Please take a look at this again and please do let me know if any further changes are required.
Thank you.

brfrn169

Left several comments, but I haven’t reviewed all the code yet. I’ll continue reviewing this PR. Thanks.

brfrn169 · 2025-03-25T14:26:29Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/ImportManager.java

+   * @throws InterruptedException if the import process is interrupted
+   */
+  public ConcurrentHashMap<Integer, ImportDataChunkStatus> startImport()
+      throws ExecutionException, InterruptedException {


Suggested change

throws ExecutionException, InterruptedException {

{

brfrn169 · 2025-03-25T14:45:42Z

...ore/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/CsvImportProcessor.java

+  public ConcurrentHashMap<Integer, ImportDataChunkStatus> process(
+      int dataChunkSize, int transactionBatchSize, BufferedReader reader) {
+    int numCores = Runtime.getRuntime().availableProcessors();
+    ExecutorService dataChunkExecutor = Executors.newFixedThreadPool(numCores);


I'm not 100% sure, but it looks like we can replace this with Executors.newSingleThreadExecutor()?

Suggested change

ExecutorService dataChunkExecutor = Executors.newFixedThreadPool(numCores);

ExecutorService dataChunkExecutor = Executors.newSingleThreadExecutor();

brfrn169 · 2025-03-25T15:01:54Z

data-loader/core/src/main/java/com/scalar/db/dataloader/core/dataimport/ImportManager.java

+  @NonNull private final ImportOptions importOptions;
+  private final ImportProcessorFactory importProcessorFactory;
+  private final List<ImportEventListener> listeners = new ArrayList<>();
+  private final ScalarDBMode scalarDBMode;


It looks like I missed this in the previous PR review, but we should rename this to ScalarDbMode. We should also rename ScalarDBDao and ScalarDBDaoException accordingly.

Could you please handle this in a separate PR?

brfrn169 · 2025-03-25T15:08:17Z

...r/core/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/ImportProcessor.java

+    ExecutorService transactionBatchExecutor =
+        Executors.newFixedThreadPool(Math.min(transactionBatches.size(), numCores * 2));


Could you please explain why we use Math.min(transactionBatches.size(), numCores * 2) here?

komamitsu · 2025-03-26T03:15:57Z

...re/src/main/java/com/scalar/db/dataloader/core/dataimport/processor/JsonImportProcessor.java

@@ -45,6 +44,7 @@ public class JsonImportProcessor extends ImportProcessor {

  private static final DataLoaderObjectMapper OBJECT_MAPPER = new DataLoaderObjectMapper();
  private static final AtomicInteger dataChunkIdCounter = new AtomicInteger(0);
+  private static final int MAX_QUEUE_SIZE = 10;


This queue size might be too small when using a large number of cores like 32 cores. I guess this limitation is to prevent OOM. If so, 256 or 512 should be okay. Maybe it would be great if this is configurable.

I think you can decide whether to change this after conducting some performance tests with various numbers of cores.

komamitsu · 2025-03-26T04:10:43Z

...src/test/java/com/scalar/db/dataloader/core/dataimport/processor/CsvImportProcessorTest.java

+import org.mockito.Mock;
+import org.mockito.Mockito;
+
+public class CsvImportProcessorTest {


[minor] This can be package-private.

komamitsu · 2025-03-26T04:10:58Z

...src/test/java/com/scalar/db/dataloader/core/dataimport/processor/CsvImportProcessorTest.java

+import org.mockito.Mockito;
+
+public class CsvImportProcessorTest {
+  @Mock ImportProcessorParams params;


[minor] This can be private

inv-jishnu and others added 30 commits December 4, 2024 15:59

Util classes for data loader

753618b

Fix spotbug issue

8d39d02

Removed error message and added core error

bf94c49

Applied spotless

47be388

Fixed unit test failures

913eb1c

Merge branch 'master' into feat/data-loader/utils

1f204b8

Basic data import enum and exception

6cfa83a

Removed exception class for now

d381b2b

Added DECIMAL_FORMAT

67f2474

Path util class updated

14e3593

Feedback changes

a096d51

Merge branch 'master' into feat/data-loader/utils

dbf1940

Merge branch 'master' into feat/data-loader/utils

cd8add9

Changes

52890c8

Merge branch 'master' into feat/data-loader/import-data-1

5114639

Merge branch 'feat/data-loader/utils' into feat/data-loader/scaladb-dao

4f9cd75

Added ScalarDB Dao

1997eb8

Merge branch 'master' into feat/data-loader/scaladb-dao

91e6310

Remove unnecessary files

8a7338b

Initial commit [skip ci]

2b52eeb

Changes

e206073

Changes

26d3144

spotbugs exclude

b86487d

spotbugs exclude -2

818a2b4

Added a file [skip ci]

90c4105

Added unit test files [skip ci]

3d5d3e0

Spotbug fixes

6495202

Removed use of List.of to fix CI error

90abd9e

Merged changes from master after resolving conflict

ba2b3dd

Merge branch 'master' into feat/data-loader/metadata-service

b1b811b

inv-jishnu added 8 commits March 4, 2025 17:39

Changes

d0a73a3

gradle change reverted

bffa85b

Resolved conflicts and merged changes from master

328afe5

Spotless applied

adc7e56

Fixed unit test

5b61876

Reverted try-catch changes

b9842be

Optimizations

16ae46d

Error message changes and further optimizations

6b2536e

ypeckstadt marked this pull request as ready for review March 14, 2025 09:00

ypeckstadt assigned inv-jishnu Mar 14, 2025

ypeckstadt requested review from komamitsu, brfrn169, feeblefakie and Torch3333 March 14, 2025 09:01

ypeckstadt changed the title ~~Add import processes~~ Add the import process implementation for data loader Mar 14, 2025

Improve javadocs for the data loader import process

6aea83c

komamitsu reviewed Mar 17, 2025

View reviewed changes

inv-jishnu and others added 4 commits March 25, 2025 14:27

Changes added

851b691

Removed unused test util methods [skip ci]

c835730

Merge branch 'master' into feat/data-loader/import-process

ff87a9a

Fixed spotbugs test issues

8f7adc8

inv-jishnu requested a review from komamitsu March 25, 2025 09:56

inv-jishnu and others added 2 commits March 25, 2025 17:17

reader data updated [skip ci]

3aff018

Merge branch 'master' into feat/data-loader/import-process

35a758f

brfrn169 added the enhancement New feature or request label Mar 25, 2025

brfrn169 reviewed Mar 25, 2025

View reviewed changes

komamitsu reviewed Mar 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the import process implementation for data loader #2462

Add the import process implementation for data loader #2462

inv-jishnu commented Jan 10, 2025 •

edited

Loading

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

inv-jishnu Mar 25, 2025

komamitsu Mar 26, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

inv-jishnu Mar 25, 2025

komamitsu Mar 26, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

komamitsu Mar 17, 2025

inv-jishnu commented Mar 25, 2025

brfrn169 left a comment

brfrn169 Mar 25, 2025

brfrn169 Mar 25, 2025 •

edited

Loading

brfrn169 Mar 25, 2025

brfrn169 Mar 25, 2025

komamitsu Mar 26, 2025 •

edited

Loading

komamitsu Mar 26, 2025

komamitsu Mar 26, 2025

	ExecutorService dataChunkExecutor = Executors.newFixedThreadPool(numCores);
	ExecutorService dataChunkExecutor = Executors.newSingleThreadExecutor();

		ExecutorService transactionBatchExecutor =
		Executors.newFixedThreadPool(Math.min(transactionBatches.size(), numCores * 2));

Add the import process implementation for data loader #2462

Are you sure you want to change the base?

Add the import process implementation for data loader #2462

Conversation

inv-jishnu commented Jan 10, 2025 • edited Loading

Description

Related issues and/or PRs

Changes made

Checklist

Additional notes (optional)

Release notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inv-jishnu commented Mar 25, 2025

brfrn169 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brfrn169 Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

komamitsu Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inv-jishnu commented Jan 10, 2025 •

edited

Loading

brfrn169 Mar 25, 2025 •

edited

Loading

komamitsu Mar 26, 2025 •

edited

Loading