How to append an auto-incrementing value to a duplicate value?

1.3k Views Asked by At

I have access to a reporting dataset (that I don't control) that we retrieve daily from a cloud service and store in a mysql db to run advanced reporting and report combining locally with 3rd party data visualization software.

The data often has duplicate values on an id field that create problems when joining with other tables for data analysis.

For example:

 +-------------+----------+------------+----------+
 | workfile_id | zip_code |    date    |   total  |
 +-------------+----------+------------+----------+
 |       78002 |    90210 | 2016-11-11 | 2010.023 |
 |       78002 |    90210 | 2016-12-22 |  427.132 |
 +-------------+----------+------------+----------+

Workfile_id is duplicated because this is the same job, but additional work on the job was performed in a different month than the original work. Instead of the software creating another workfile id for the job, the same is used.

Doing joins with other tables on workfile_id is problematic when more than one of the same id is present, so I was wondering if it is possible to do one of two things:

  1. Make duplicate workfile_id's unique. Have sql append a number to the workfile id when a duplicate is found. The first duplicate (or second occurrence of the same workfile id) would need to get a .01 appended to the end of the workfile id. Then later, if another duplicate is inserted, it would need to auto increment the appended number, say .02, and so on with any subsequent duplicate workfile_id. This method would work best with our data but I'm curious how difficult this would be for the server from a performance perspective. If I could schedule the alteration to take place after the data is inserted to speed up the initial data insert, that would be ideal.
  2. Sum total columns and remove duplicate workfile_id row. Have a task that identifies duplicate workfile_ids and sums the financial columns of the duplicates, replacing the original total with new sum and deleting the 'new row' after the columns have been added together. This is more messy from a data preservation perspective, but is acceptable if the first solution isn't possible.

My assumption is that there will be significant overhead to have the server compare new workfile_id values to all existing worlfile_id values each time data is inserted, but our dataset is small and new data is only inserted once daily, at 1:30am, and it also should be feasible to keep the duplicate workfile_id searching to rows inserted within the last 6 mo.

Is finding duplicates in a column (workfile_id) and appending an auto-incrementing value onto the workfile_id possible?

EDIT: I'm having trouble getting my trigger to work based on sdsc81's answer below. Any ideas?

DELIMITER //

CREATE TRIGGER append_subID_to_workfile_ID_salesjournal
AFTER INSERT
   ON salesjournal FOR EACH ROW

BEGIN
   SET @COUNTER = ( SELECT (COUNT(*)-1) FROM salesjournal WHERE workfile_id = NEW.workfile_id );
   IF @COUNTER > 1 THEN
       UPDATE salesjournal SET workfile_id = CONCAT(workfile_id, @COUNTER) WHERE id = NEW.id;
   END IF;

END;//

DELIMITER ;

It's hard to know if the trigger isn't working at all, or if just the code in the trigger isn't working. I get no errors on insert. Is there any way to debug trigger errors?

1

There are 1 best solutions below

11
On

Well, everything is posible ;)

You dont control the dataset but you can modifify the database, right? Then you could use a trigger after every insert of a new value, and update it, if its duplicate. Something like:

SET @COUNTER = ( SELECT (COUNT(*)-1) FROM *your_table* WHERE workfile_id = NEW.workfile_id );
IF @COUNTER > 1 THEN
    UPDATE *your_table* SET workfile_id = CONCAT(workfile_id, @COUNTER) WHERE some_unique_id = NEW.some_unique_id;
END IF;

If there are only one insert a day, and there is defined an index over the workfile_id value, then it shouldn't be any problem for your server at all.

Also, you could implement the second solution, doing:

DELIMITER //

CREATE TRIGGER append_subID_to_workfile_ID_salesjournal
AFTER INSERT ON salesjournal FOR EACH ROW

BEGIN
SET @COUNTER = ( SELECT (COUNT(*)-1) FROM salesjournal WHERE workfile_id = NEW.workfile_id );
IF @COUNTER > 1 THEN
    UPDATE salesjournal SET total = total + NEW.total WHERE workfile_id = NEW.workfile_id AND id <> NEW.id;
    DELETE FROM salesjournal WHERE id = NEW.id;
END IF;

END;//

DELIMITER ;

Hope this helps.