Skip to content

memory rises all the time #29

Description

@haojiang-scitix

I use nemo_curator to clean data. I try many way to control memory but failed.
I wanna to know why the memory rises all the time and how to solve it.

NVIDIA-NeMo/Curator#1443

2026-01-30 14:11:40.281 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:349 - Took 0.0024683475494384766 seconds to get node resource info.
2026-01-30 14:11:40.293 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:354 - Took 0.012156248092651367 seconds to get cluster info.
2026-01-30 14:11:40.300 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:363 - Took 0.007071256637573242 seconds to get actor info.
2026-01-30 14:11:40.301 | INFO     | cosmos_xenna.pipelines.private.monitoring:update:326 - took 0.0226137638092041 to get stats.
2026-01-30 14:11:40.302 | INFO     | cosmos_xenna.pipelines.private.monitoring:_print_state:394 - Pipeline Stats:
Pipeline duration: 60.02254738410314 minutes
Number of initial input samples: 1
Number of input samples remaining: 0
Streaming pipeline main loop rate: 99.37478621783299

Cluster Resources:
╒══════════════════════════╤═════════╤═════════════╕
│ Resource                 │   Total │   Available │
╞══════════════════════════╪═════════╪═════════════╡
│ CPUs                     │   20    │      19     │
├──────────────────────────┼─────────┼─────────────┤
│ GPUs                     │    0    │       0     │
├──────────────────────────┼─────────┼─────────────┤
│ Memory (GB)              │ 1374.39 │    1374.39  │
├──────────────────────────┼─────────┼─────────────┤
│ Object Store Memory (GB) │  200    │     143.155 │
╘══════════════════════════╧═════════╧═════════════╛

Resource Usage by Stage:
╒══════════════════════════════════╤═════════╤═══════════════╤═══════════════╤════════════════════╤══════════════════════════╕
│ Stage                            │   CPU % │   Memory (GB) │   Actor Count │   CPU % per worker │   Memory (GB) per worker │
╞══════════════════════════════════╪═════════╪═══════════════╪═══════════════╪════════════════════╪══════════════════════════╡
│ Stage 00 - FilePartitioningStage │     0   │          0    │             0 │                0   │                     0    │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 01 - JsonlReaderStage      │   298.8 │        359.22 │             3 │               99.6 │                   119.74 │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 02 - AddId                 │     4.5 │        322.51 │             3 │                1.5 │                   107.5  │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 03 - JsonlWriter           │   202.2 │         99.63 │             2 │              101.1 │                    49.81 │
╘══════════════════════════════════╧═════════╧═══════════════╧═══════════════╧════════════════════╧══════════════════════════╛

Stage state:
╒══════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═════════════╤═════════════════╤══════════════╤═══════════════╤════════════╤═════════════╤═════════════════╕
│ Stage                            │   Actors: │   Actors: │   Actors: │   Actors: │   Actors: │      Tasks: │          Tasks: │       Queue: │        Queue: │     Slots: │      Slots: │          Speed: │
│                                  │    Target │   Pending │     Ready │   Running │      Idle │   Completed │   Returned None │   Input Size │   Output Size │   Num Used │   Num Empty │   Tasks/actor/s │
╞══════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═════════════╪═════════════════╪══════════════╪═══════════════╪════════════╪═════════════╪═════════════════╡
│ Stage 00 - FilePartitioningStage │         0 │         0 │         0 │         0 │         0 │           1 │               0 │            0 │           126 │          0 │           0 │                 │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 01 - JsonlReaderStage      │         0 │         0 │         3 │         3 │         0 │         117 │               0 │            0 │             1 │          5 │           1 │       0.0148177 │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 02 - AddId                 │         0 │         0 │         3 │         0 │         3 │         116 │               0 │            0 │             6 │          0 │           6 │       1.28427   │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 03 - JsonlWriter           │         0 │         0 │         2 │         2 │         0 │         106 │               0 │            0 │           106 │          4 │           0 │       0.0219004 │
╘══════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═════════════╧═════════════════╧══════════════╧═══════════════╧════════════╧═════════════╧═════════════════╛
2026-01-30 14:11:40.302 | INFO     | cosmos_xenna.pipelines.private.streaming:run_pipeline:448 - StreamingExecutor Timing Summary:
  Auto Scaling  : 0.000001 seconds
  Pool Update   : 0.000522 seconds
  Monitor Update: 0.000004 seconds
  Add Tasks     : 0.000014 seconds
  Sleep         : 0.009521 seconds
  Total         : 0.010062 seconds
2026-01-30 14:11:40.302 | INFO     | cosmos_xenna.pipelines.private.streaming:run_pipeline:450 - Worker allocation:
Component    Utilization                              NVDEC    NVENC
Node 0       CPUs: [################----] 8.00/10.00
2026-01-30 14:26:40.361 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:349 - Took 0.0032367706298828125 seconds to get node resource info.
2026-01-30 14:26:40.373 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:354 - Took 0.011534690856933594 seconds to get cluster info.
2026-01-30 14:26:40.380 | INFO     | cosmos_xenna.pipelines.private.monitoring:_make_stats:363 - Took 0.007001161575317383 seconds to get actor info.
2026-01-30 14:26:40.381 | INFO     | cosmos_xenna.pipelines.private.monitoring:update:326 - took 0.022833824157714844 to get stats.
2026-01-30 14:26:40.382 | INFO     | cosmos_xenna.pipelines.private.monitoring:_print_state:394 - Pipeline Stats:
Pipeline duration: 75.02388019959132 minutes
Number of initial input samples: 1
Number of input samples remaining: 0
Streaming pipeline main loop rate: 99.36889092192622

Cluster Resources:
╒══════════════════════════╤═════════╤═════════════╕
│ Resource                 │   Total │   Available │
╞══════════════════════════╪═════════╪═════════════╡
│ CPUs                     │   20    │      19     │
├──────────────────────────┼─────────┼─────────────┤
│ GPUs                     │    0    │       0     │
├──────────────────────────┼─────────┼─────────────┤
│ Memory (GB)              │ 1374.39 │    1374.39  │
├──────────────────────────┼─────────┼─────────────┤
│ Object Store Memory (GB) │  200    │     133.957 │
╘══════════════════════════╧═════════╧═════════════╛

Resource Usage by Stage:
╒══════════════════════════════════╤═════════╤═══════════════╤═══════════════╤════════════════════╤══════════════════════════╕
│ Stage                            │   CPU % │   Memory (GB) │   Actor Count │   CPU % per worker │   Memory (GB) per worker │
╞══════════════════════════════════╪═════════╪═══════════════╪═══════════════╪════════════════════╪══════════════════════════╡
│ Stage 00 - FilePartitioningStage │       0 │          0    │             0 │               0    │                     0    │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 01 - JsonlReaderStage      │     304 │        498.44 │             3 │             101.33 │                   166.15 │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 02 - AddId                 │     103 │        463.17 │             3 │              34.33 │                   154.39 │
├──────────────────────────────────┼─────────┼───────────────┼───────────────┼────────────────────┼──────────────────────────┤
│ Stage 03 - JsonlWriter           │     302 │        121.4  │             2 │             151    │                    60.7  │
╘══════════════════════════════════╧═════════╧═══════════════╧═══════════════╧════════════════════╧══════════════════════════╛

Stage state:
╒══════════════════════════════════╤═══════════╤═══════════╤═══════════╤═══════════╤═══════════╤═════════════╤═════════════════╤══════════════╤═══════════════╤════════════╤═════════════╤═════════════════╕
│ Stage                            │   Actors: │   Actors: │   Actors: │   Actors: │   Actors: │      Tasks: │          Tasks: │       Queue: │        Queue: │     Slots: │      Slots: │          Speed: │
│                                  │    Target │   Pending │     Ready │   Running │      Idle │   Completed │   Returned None │   Input Size │   Output Size │   Num Used │   Num Empty │   Tasks/actor/s │
╞══════════════════════════════════╪═══════════╪═══════════╪═══════════╪═══════════╪═══════════╪═════════════╪═════════════════╪══════════════╪═══════════════╪════════════╪═════════════╪═════════════════╡
│ Stage 00 - FilePartitioningStage │         0 │         0 │         0 │         0 │         0 │           1 │               0 │            0 │            86 │          0 │           0 │                 │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 01 - JsonlReaderStage      │         0 │         0 │         3 │         3 │         0 │         156 │               0 │            0 │             0 │          6 │           0 │       0.0145159 │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 02 - AddId                 │         0 │         0 │         3 │         1 │         2 │         155 │               0 │            0 │             5 │          1 │           5 │       1.95257   │
├──────────────────────────────────┼───────────┼───────────┼───────────┼───────────┼───────────┼─────────────┼─────────────────┼──────────────┼───────────────┼────────────┼─────────────┼─────────────────┤
│ Stage 03 - JsonlWriter           │         0 │         0 │         2 │         2 │         0 │         146 │               0 │            0 │           146 │          4 │           0 │       0.021874  │
╘══════════════════════════════════╧═══════════╧═══════════╧═══════════╧═══════════╧═══════════╧═════════════╧═════════════════╧══════════════╧═══════════════╧════════════╧═════════════╧═════════════════╛
2026-01-30 14:26:40.382 | INFO     | cosmos_xenna.pipelines.private.streaming:run_pipeline:448 - StreamingExecutor Timing Summary:
  Auto Scaling  : 0.000000 seconds
  Pool Update   : 0.000647 seconds
  Monitor Update: 0.000004 seconds
  Add Tasks     : 0.000013 seconds
  Sleep         : 0.009391 seconds
  Total         : 0.010055 seconds
2026-01-30 14:26:40.383 | INFO     | cosmos_xenna.pipelines.private.streaming:run_pipeline:450 - Worker allocation:
Component    Utilization                              NVDEC    NVENC
Node 0       CPUs: [################----] 8.00/10.00
�[36m(Stage 03 - JsonlWriter pid=3279)�[0m 2026-01-30 14:27:15.063 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1268202 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/66acb184cb6f.jsonl
�[36m(Stage 03 - JsonlWriter pid=3277)�[0m 2026-01-30 14:27:23.629 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1332107 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/6421ca217cb6.jsonl
�[36m(Stage 03 - JsonlWriter pid=3277)�[0m 2026-01-30 14:28:19.515 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1273219 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/a6d743b4fc66.jsonl
�[36m(Stage 03 - JsonlWriter pid=3277)�[0m 2026-01-30 14:29:20.574 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1324632 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/f8a4bd595206.jsonl�[32m [repeated 2x across cluster]�[0m
�[36m(Stage 03 - JsonlWriter pid=3279)�[0m 2026-01-30 14:29:28.013 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1325925 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/558affcf223d.jsonl
�[36m(Stage 03 - JsonlWriter pid=3277)�[0m 2026-01-30 14:30:25.176 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1484739 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/3b0d1c7baba0.jsonl
�[36m(Stage 03 - JsonlWriter pid=3279)�[0m 2026-01-30 14:30:40.382 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1337055 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/07e4e385cc09.jsonl
�[36m(Stage 03 - JsonlWriter pid=3277)�[0m 2026-01-30 14:31:27.314 | DEBUG    | nemo_curator.stages.text.io.writer.base:process:102 - Written 1341660 records to /volume/ai4s-data/hjiang02/workspace/data/cpt_data/base_data-addId-files_per_partition/en-code/c971e454565c.jsonl
2026-01-30 14:31:34.604 | ERROR    | nemo_curator.backends.xenna.executor:execute:144 - Pipeline execution failed: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.16.109.22, ID: f78f41633e31122952c27a1f2ed916f52a1d4c5297480930b9ea22fd) where the lease (lease ID: 0c000000926601560b83c03da89e435afec51a55243325741cbebb197748386a, name=StageWorker.__init__, pid=3277, memory used=74.85GB) was running was 1216.42GB / 1280.00GB (0.950331), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 90d09bb73c4895edf616a22ec03e40df9cae2c8a25c60529d04bc4fb) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.16.109.22`. To see the logs of the worker, use `ray logs worker-90d09bb73c4895edf616a22ec03e40df9cae2c8a25c60529d04bc4fb*out -ip 172.16.109.22. Top 10 memory users:
PID	MEM(GB)	COMMAND
3285	326.02	ray::StageWorker.process_data
3278	180.23	ray::StageWorker
3281	162.70	ray::StageWorker
3283	159.71	ray::StageWorker.process_data
3284	130.38	ray::StageWorker
3277	74.85	ray::StageWorker
3279	62.67	ray::StageWorker.process_data

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions