After copying around 18GB csv file from data lake to DocumentDB, it shows me 100 GB in DocumentDB why?

237 Views Asked by At

I have copied around 18 GB csv file from data lake store to documentDB using copy activity of azure data factory. Its total of of 1 months data. I have copied 5 days data at a time using copy activity of ADF. After loading 25 days data I get error "Storage quota for 'Document' exceeded." I can see that in documentDB it shows size of that collection is 100GB. I am not getting how 18GB data becomes 100GB in DocumentDB. I have partition key in DocumentDB and default indexing policy. I know that because of indexing it will increase the size little bit. But I was not expecting this much. I am not sure whether I am doing anything wrong here. I do not have much experience with documentDB and while searching on this question I do not get any answer so posting this question here.

I tried copying another small data of 1.8 GB from data lake store to document DB in another collection. And it shows me size of around 14 GB in documentDB.

So it means documentdb has more data than actual data. Please help to understand why it shows almost 5 to 7 times more size in documentdb than actual size in data lake store.

2

There are 2 best solutions below

1
On

Based on my experience, index occupy the space but the main reason for this issue is that the data is stored in the form of json in documentdb.

{
    "color": "white",
    "name": "orange",
    "count": 1,
    "id": "fruit1",
    "arr":[1,2,3,4],
    "_rid": "F0APAPzLigUBAAAAAAAAAA==",
    "_self": "dbs/F0APAA==/colls/F0APAPzLigU=/docs/F0APAPzLigUBAAAAAAAAAA==/",
    "_etag": "\"06001f2f-0000-0000-0000-5989c6da0000\"",
    "_attachments": "attachments/",
    "_ts": 1502201562
}

If you observe the json data, you could find that they are all key-values , because json schema-less.These key values are needed to occupy the space (1 byte per letter).

The JSON would also generate characters to be very human readable ,such as [ ] ,{ }, : and so on.These special characters also occupy the space.

Also, documentdb would generate System property occupy space,such as _rid,_self,_etag,_ts. You could refer to the official document.

If it's possible, shorter keys could effectively save space, like use n1 instead of name1.

Hope it helps you.

2
On

This is a common "problem" with hierarchical, self-describing formats such as XML, JSON, YAML etc.

First if you take a "relational format" with a fixed schema or formats that have no metadata such as CSV and represent it in JSON, you now explode the schema information into every single key/value property as Jay explains.

Additionally, if you then store that document, often the so called Document Object Model used to store it is exploding the original textual size by 2 to 10 times (depending on lengths of keys, complexity of documents etc.).

Thus the recommendation is that unless you really need the semistructured format provided by XML, JSON etc, you should consider reverting the storage back into a structured format such as a table.