I have copied around 18 GB csv file from data lake store to documentDB using copy activity of azure data factory. Its total of of 1 months data. I have copied 5 days data at a time using copy activity of ADF. After loading 25 days data I get error "Storage quota for 'Document' exceeded." I can see that in documentDB it shows size of that collection is 100GB. I am not getting how 18GB data becomes 100GB in DocumentDB. I have partition key in DocumentDB and default indexing policy. I know that because of indexing it will increase the size little bit. But I was not expecting this much. I am not sure whether I am doing anything wrong here. I do not have much experience with documentDB and while searching on this question I do not get any answer so posting this question here.
I tried copying another small data of 1.8 GB from data lake store to document DB in another collection. And it shows me size of around 14 GB in documentDB.
So it means documentdb has more data than actual data. Please help to understand why it shows almost 5 to 7 times more size in documentdb than actual size in data lake store.
Based on my experience, index occupy the space but the main reason for this issue is that the data is stored in the form of json in documentdb.
If you observe the json data, you could find that they are all key-values , because json schema-less.These key values are needed to occupy the space (1 byte per letter).
The JSON would also generate characters to be very human readable ,such as [ ] ,{ }, : and so on.These special characters also occupy the space.
Also, documentdb would generate System property occupy space,such as _rid,_self,_etag,_ts. You could refer to the official document.
If it's possible, shorter keys could effectively save space, like use n1 instead of name1.
Hope it helps you.