Orient ETL perfomance issue with importing edges to plocal on SSD

78 Views Asked by At

My goal is importing 25M edges in the graph which has about 50M vertices. Target time:

The current speed of importing is ~150 edges/sec. Speed on remote connection was about 100 edges/sec.

  • extracted 20,694,336 rows (171 rows/sec) - 20,694,336 rows -> loaded 20,691,830 vertices (171 vertices/sec) Total time: 35989762ms [0 warnings, 4 errors]
  • extracted 20,694,558 rows (156 rows/sec) - 20,694,558 rows -> loaded 20,692,053 vertices (156 vertices/sec) Total time: 35991185ms [0 warnings, 4 errors]
  • extracted 20,694,745 rows (147 rows/sec) - 20,694,746 rows -> loaded 20,692,240 vertices (147 vertices/sec) Total time: 35992453ms [0 warnings, 4 errors]
  • extracted 20,694,973 rows (163 rows/sec) - 20,694,973 rows -> loaded 20,692,467 vertices (162 vertices/sec) Total time: 35993851ms [0 warnings, 4 errors]
  • extracted 20,695,179 rows (145 rows/sec) - 20,695,179 rows -> loaded 20,692,673 vertices (145 vertices/sec) Total time: 35995262ms [0 warnings, 4 errors]

I tried to enable parallel in etl config, but looks like it is completely broken in Orient 2.2.12 (Inconsistency with multi-threading changes in 2.1?) and gives me nothing but 4 errors in the log above. Dumb parallel mode (running 2+ ETL processes) also impossible for plocal connection.

My config:

{
"config": {
    "log": "info",
    "parallel": true
},
"source": {
    "input": {}
},
"extractor": {
    "row": {
        "multiLine": false
    }
},
"transformers": [
    {
          "code": {
            "language": "Javascript",
              "code": "(new com.orientechnologies.orient.core.record.impl.ODocument()).fromJSON(input);"
        }
    },
    {
        "merge": {
            "joinFieldName": "_ref",
            "lookup": "Company._ref"
        }
    },
    {
        "vertex": {
            "class": "Company",
            "skipDuplicates": true
        }
    },
    {
        "edge": {
            "joinFieldName": "with_id",
            "lookup": "Person._ref",
            "direction": "in",
            "class": "Stakeholder",
            "edgeFields": {
                "_ref": "${input._ref}",
                "value_of_share": "${input.value_of_share}"
            },
            "skipDuplicates": true,
            "unresolvedLinkAction": "ERROR"
        }
    },
    {
        "field": {
            "fieldNames": [
                "with_id",
                "with_to",
                "_type",
                "value_of_share"
            ],
            "operation": "remove"
        }
    }
],
"loader": {
    "orientdb": {
        "dbURL": "plocal:/mnt/disks/orientdb/orientdb-2.2.12/databases/df",
        "dbUser": "admin",
        "dbPassword": "admin",
        "dbAutoDropIfExists": false,
        "dbAutoCreate": false,
        "standardElementConstraints": false,
        "tx": false,
        "wal": false,
        "batchCommit": 1000,
        "dbType": "graph",
        "classes": [
            {
                "name": "Company",
                "extends": "V"
            },
            {
                "name": "Person",
                "extends": "V"
            },
            {
                "name": "Stakeholder",
                "extends": "E"
            }
            ]
        }
    }
}

Data sample:

{"_ref":"1072308006473","with_to":"person","with_id":"010703814320","_type":"is.stakeholder","value_of_share":10000.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600656732","_type":"is.stakeholder","value_of_share":15925.0} {"_ref":"1075837000095","with_to":"person","with_id":"583600851010","_type":"is.stakeholder","value_of_share":33150.0}

Server's specs are: instance on Google Cloud, PD-SSD, 6CPU, 18GB RAM.

Btw, on the same server I managed to get ~3k/sec on importing vertices using remote connection (it is still too slow, but acceptable for my current dataset).

And the question: is it any reliable way to increase speed of importing to let's say 10k inserts per second, or at least 5k? I wouldn't like to turn off indexes, it is still millions of records, not billions.

UPDATE

After few hours the performance continue to deteriorate.

  • extracted 23,146,912 rows (56 rows/sec) - 23,146,912 rows -> loaded 23,144,406 vertices (56 vertices/sec) Total time: 60886967ms [0 warnings, 4 errors]
  • extracted 23,146,981 rows (69 rows/sec) - 23,146,981 rows -> loaded 23,144,475 vertices (69 vertices/sec) Total time: 60887967ms [0 warnings, 4 errors]
  • extracted 23,147,075 rows (39 rows/sec) - 23,147,075 rows -> loaded 23,144,570 vertices (39 vertices/sec) Total time: 60890356ms [0 warnings, 4 errors]
0

There are 0 best solutions below