I`m a C# developer trying to solve a problem with python, please bear with my nonpythonic practices.
I'm trying to get my head around on how to implement a bulk insert on a Many-to-Many relationship scenario to create a shrank down analitical database (using SQLite3 for now) whilst normalizing repeating data. Here are the relevant model classes:
class BaseModel(Model):
class Meta:
database = ServiceLocator().getDatabase()
legacy_table_names=False
class LogEntry(BaseModel):
Id = AutoField()
Level = ForeignKeyField(Level)
File = ForeignKeyField(File, on_delete='CASCADE')
Host = ForeignKeyField(Host)
Date = DateTimeField()
Contents = ManyToManyField(ContentDetail, backref="Entries")
LineNumber = IntegerField()
class ContentDetail(BaseModel):
Id = AutoField()
PropertyName = TextField(index=True)
PropertyValue = TextField(null=True, default=None)
ValueSizeInBytes = BigIntegerField()
and here is the a bit of the configuration to map the relationship table startup.py:
def main() -> None:
locator = ServiceLocator()
logger = locator.getLogger()
try:
with locator.getDatabase() as db:
ContentDetailLogEntry = LogEntry.Contents.get_through_model() #<-- please notice this
db.create_tables([File, Level, Host, ContentDetail, LogEntry, ContentDetailLogEntry])
LogService().FeedDatabase()
except Exception:
logger.critical(traceback.format_exc())
if __name__ == "__main__":
main()
In order to process the big log files more quickly I devised a dictionary using the following "interface" (tuple of normalizable values as key and the list of related log entries model as value)
(propName:str, propVal:str, lenght:int) : []:LogEntry
that is automatically persisted but one by one which makes the whole process run for some minutes on each file (+/- 14min) and I`m facing like 200 files per batch
(...)
with self.__resolver.getDatabase().atomic():
with open(fileEntity.FileOrigin, mode="r", encoding="utf-8") as logFile:
(...)
for tuple in self.__contentCache.keys():
for entry in self.__contentCache[tuple]:
entry.Contents.add(ContentDetail.create(PropertyName=tuple[0], PropertyValue=tuple[1], ValueSizeInBytes=tuple[2]))
self.__logger.debug(f"Created content for {entry.LineNumber}")
(...)
I'm trying to figure out how I might make the ORM understand the FileEntry and ContentDetail relationship according to the docs http://docs.peewee-orm.com/en/latest/peewee/querying.html#inserting-rows-in-batches but I can`t in all my honesty even figure out how PeeWee is going to figure out the log entry and content relationship without the one-by-one Contents.add. Any ideas?
Maybe my approach is all wrong. Any suggestions are welcome