How to delete 200,000 records with DJango?

419 Views Asked by At

Situation: I have Model have a relation 1-1, sample:

class User(models.Model):
    user_namme = models.CharField(max_length=40)
    type = models.CharField(max_length=255)
    created_at = models.DatetimeField()
    ...

class Book(models.Model):
    user = models.OneToOneField(User, on_delete=models.CASCADE)

And I have a around 200,000 records.

  • Languague: Python
  • Framework: Django
  • Database: Postgres

Question:

  • How can I delete 200,000 records above with minimal cost?

Solution I have tried:

user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 
for i, _ in enumerate(user_ids[:: 1000]):
    with transaction.atomic():
        batch_start = i * self.batch_size
        batch_end = batch_start + self.batch_size
        _, deleted = Users.objects.filter(id__in=user_ids[batch_start,batch_end]
    

With this solution, my server use arround:

  • 600MB CPU
  • 300MB RAM
  • Take more 15 minutes to finish workload.

I wonder do anyone have a better solution?

3

There are 3 best solutions below

0
On

By first principles, nothing beats raw(Django query) SQL in terms of speed because it operates closest to the database!

cursor.execute(“DELETE FROM DB WHERE Column = %s”)

Or else you can do it by:

Variable = Model.objects.filter(variable=variable)

if Variable.exists():

    Variable.delete()
1
On

Thanks, everyone. I have tried the solution with RawQuery

user_ids = Users.objects.filter(type='sample', created_date__gte='2022-11-15 08:00', created_date__lt="2022-11-15 08:30").values_list('id',flat=True)[:200000] # Fetch 200,000 user ids. 

for i in range(0, 3):
    user_ids_str = ""
    for user_id in user_ids.iterator(chunk_size=5000):
        user_ids_str += f"{user_id},"
    query = f"""
            DELETE FROM "user" WHERE "user"."id" IN ({user_ids_str});
            DELETE FROM "book" WHERE "user"."id" IN ({user_ids_str});
    """
    with transaction.atomic():
        with connection.cursor() as c:
            c.execute("SET statement_timeout = '10min';")
            c.execute(query)

This one can remove 600000 records and take around 10 minutes. And the server used around:

  • CPU: 50MB
  • RAM: 200MB
2
On

If you are using straight sql why not do a join on the user table with the date criteria to delete the books and then delete all the users using the created_date criteria? Let the database do all the work!

Even without writing the join

DELETE FROM "book" WHERE "user"."id" IN (select id from user where created_date >= '2022-11-15 08:00' and...)
DELETE FROM "user" WHERE created_date >= '2022-11-15 08:00' and...

would be better than what you have.