Record duplication in Postgres using Peewee

43 Views Asked by At

I've been handed the creds to a blank postgres instance and given the task of designing and implementing the back end of our small application. The app allows the customer to search through devices and make correlations between them based on scans of those devices.

Summary first:
I have multiple tables referencing the entries in another table in a one-to-many setup; when I ingest data into any of the tables that backreference the one, I am accidentally creating a duplicate of that record.

Details:
I've set up the database like this:

Table Contains
Devices physical + logical address info, name, etc.
Messages traffic sent/received by device
Files installed applications + access info
User Accts user access logs

My goal was to have one entry in the device table for each device, and then a foreign key field in each of the other tables (Messages, Files, User Accounts) referring to the device they belong to.

In peewee I implemented that as:

class Device(BaseModel):
    id = AutoField()
    md5 = FixedCharField(32) # hash of the file the device comes from
    #... other attrs


class Message(BaseModel):
    id = AutoField()
    dev_ref = ForeignKey(Device, backref="messages")
    #... other attrs

Right now all of our device scan data comes to us in the form of excel spreadsheets, with multiple sheets correlating to entries in each table. Ingest looks something like this:

def sheet_to_model(
    source_file: str,
    sheet: str,
    model: peewee.Model):

    df = pd.read_excel(source_file, sheet_name=sheet)
    file_hash = md5(open(source_file,'rb').read()).hexdigest()
    
    # try to get an existing device, otherwise make a new one
    # **probably the crux of the problem**
    try:
        device = Device.select().where(Device.md5 == file_hash).get()
    except:
        device = Device(md5 = file_hash, ...) #
        device.save()

    # iterate over rows, translate them to db col names, etc.
    for row in df.iterrows():
        ...
        attrs = { 'column' : 'data from spreadsheet' } 
    
        # also maybe the problem here?
        entry = model.create(dev_ref = device.id, **attrs)
        entry.save()

Then I ingest into the database like this:

# map of sheet name to the peewee model it corresponds to
sheet_model = {
    "Messages" : Message,
    "Files" : File,
    "User Accounts": UserAccounts
}

# go file by file and ingest sheet by sheet
for file in glob.glob("file/location/whatever"):
    xl_file = openpyxl.load_workbook(file, read_only=True)
    for sheet in filter(lambda k: k in sheet_model, xl_file.sheetnames):
        sheet_to_model(file, sheet, sheet_model[sheet])

All of my entries in Messages, Files, and UserAccounts go into their respective tables just fine, but for every new sheet I ingest from a spreadsheet, it creates a duplicate row in the Devices table. That is, if I ingest an excel file for device 1 that has sheets for messages, files, and user accounts, I will now have 3 rows called device 1 in my Devices table. What am I doing wrong here?

Is the issue in my code, or in the underlying design with the foreign keys?

I've double and triple checked the lookup to see if the device exists, i.e. if I have one device in the Devices table it's completely right:

dev = Device.select().where(Device.md5 == file_hash).get()
dev.id   # 1
dev.md5 # 123456789abcdef123456789abcdef (or whatever)

I've tried all sorts of things relating to peewee objects .save(), and putting that in other locations, but all for naught.

I'm hoping it's something simple! Thank you for any help you can provide!

1

There are 1 best solutions below

1
coleifer On

The problem may be your custom AutoKey or whatever, so peewee is issuing INSERTs. You should log your queries by adding a DEBUG handler to the peewee logging namespace. Or you can see which branch of your code is executing when you are doing the get-or-create. Or you can add a UNIQUE constraint on the file hash. Many things you can do to figure this out so go get more info.