I'm trying to understand the concepts of a datalake and a lakehouse. Most links I've found so far only explain these differences from a macro/high-level perspective, for example this IBM page.
- datawarehouse -> relational model database (SQL)
- datalake -> both relational data and semi/un-structured data.
- lakehouse -> best of both worlds.
I'm having a hard time to understand how exactly one would implement a datalake and a lakehouse.
For a data lake, it doesn't seem enough, at least to me, to have a NoSQL db, like MongoDB... since if we want to store audio or video, transforming it into a format compatible with the json like storing format of MongoDB seems very unnatural... So, we probably should have MongoDB for semi-structured data coupled to a blob storage service, like S3 or MinIO.
Then for a lakehouse, we would improve on the semi-structured data format choosing something like Parquet, and then use a database that could query on parquet files, somehow... From here onwards, I have no idea what else could be done.
It's likely I'm completely missing the point in both concepts. That's why any help would be appreciated.
P.S.: Explanations at a level of a '5-yr old', would be most welcomed... :D