SQL query performance, archive vs status change

613 Views Asked by At

Straight to the point, I've tried searching on google and on SO but cant find what I'm looking for. It could be because of not wording my searching correctly.

My question is,
I have a couple of tables which will be holding anywhere between 1,000 lines to 100,000 per year. I'm trying to figure out, do I/ how should I handle archiving the data? I'm not well experienced with databases, but below are a few method's I've came up with and I'm unsure which is a better practice. Of course taking into account performance and ease of coding. I'm using Java 1.8, Sql2o and Postgres.

Method 1 Archive the data into a separate database every year.
I don't really like this method because when we want to search for old data, our application will need to search into a different database and it'll be a hassle for me to add in more code for this.

Method 2 Archive the data into a separate database for data older than 2-3 years.
And use status on the lines to improve the performance. (See method 3) This is something I'm leaning towards as an 'Optimal' solution where the code is not as complex to do but also keeps by DB relatively clean.

Method 3 Just have status for each line (eg: A=active, R=Archived) to possibly improving the performance of the query. Just having a "select * from table where status = 'A' " to reduce the the number of line to look through.

2

There are 2 best solutions below

0
On BEST ANSWER

100,000 rows per year is not that much. [1]

There's no need to move that to a separate place. If you already have good indexes in place, you almost certainly won't notice any degraded performance over the years.

However, if you want to be absolutely sure, you could add a year column and create an index for that (or add that to your existing indexes). But really, do that only for the tables where you know you need it. For example, if your table already has a date column which is part of your index(es), you don't need a separate year column.

[1] Unless you have thousands of columns and/or columns that contain large binary blobs - which doesn't seems to be the case here.

0
On

As Vog mentions, 100,000 rows is not very many. Nor is 1,000,000 or 5,000,000 -- sizes that your tables may grow to.

In many databases, you could use a clustered index where the first key is the "active" column. However, Postgres does not really support clustered indexes.

Instead, I would suggest that you look into table partitioning. This is a method where the underlying storage is split among different "files". You can easily specify that a query reads one or more partitions by using the partitioning key in a where clause.

For your particular use-case, I would further suggest having views on the data only for the active data. This would only read one partition, so the performance should be pretty much the same as reading a table with only the most recent data.

That said, I'm not sure if it is better to partition by an active flag or by year. That depends on how you are accessing the data, particularly the older data.