Processing CSV File using Python with data merge with complicated merging logic

117 Views Asked by At

I have data in csv file. Here is the example

Sr# Inv_ID      IdType  IdValue          MIC

1   F000000EU6  CUSIP   29088TEST        null
2   F000000EU6  ISIN    US290886TEST     null
3   F000000EU6  SEDOL   23TEST1          XNAS
4   F000000EU6  TICKER  ETEST            XNAS
5   F000000EU6  SEDOL   23TEST2          XNGS
6   F000000EU6  TICKER  ETEST            XNGS

Investment Vehicle ID + MIC is a unique Key. After processing it should give me below result

Sr# SEDOL    MIC     ISIN            CUSIP      TICKER  Inv_ID
2   23TEST1  XNAS    US290886TEST    29088TEST  ETEST   F000000EU6
3   23TEST2  XNGS    US290886TEST    29088TEST  ETEST   F000000EU6

Since Investment Vehicle ID + MIC is unique combination it should generate two rows mentioned above. However Sr#1 and #2 same Investment Vehicle ID but MIC is null it should not generate separate row but add ISIN and CUSIP to the existing records

I was thinking of reading file twice #1 for IDTypes = SEDOL and TICKER and skip ISINs and CUSIPs. Generate records with key Investment Vehicle ID + MIC So in above case two records will be generated with two different keys --> F000000EU6XNAS AND F000000EU6XNGS Next read the file with ISIN and CUSIP Update the existing records which with keys start with F000000EU6 with ISIN and CUSIPs values No Match then create separate records with ISIN and CUSIP I am facing lot of performance issues.

1

There are 1 best solutions below

3
Bart Van Loon On

You can definitely do it in one run. Your question is not very clear or complete. I had to cut some corners for the following code to work:

  • the vertical alignment and the empty line in the input CSV may or may not be present in the actual input file, I am now processing a preprocessed version with the csv module, but this could be made more efficient
  • I couldn't figure out the logic behind the Sr# column in the output, so I just left it out
  • I haven't profiled this on large datasets which you might have

Anyway, I hope the following is helpful:

#!/usr/bin/env python

import csv
import re
from collections import defaultdict

FILENAME_INPUT = "data.csv"
with open(FILENAME_INPUT, "r", encoding="utf-8") as file:
    lines = file.readlines()

DELIMITER_REGEX = r"\s+"
processed_lines = (
    re.sub(DELIMITER_REGEX, ",", line.strip()) for line in lines if line.strip()
)

investments: dict[str, dict[str, str]] = defaultdict(dict)
inv_info: dict[str, dict[str, str]] = defaultdict(lambda: defaultdict(str))

csv_reader = csv.DictReader(processed_lines)
for row in csv_reader:
    mic = row["MIC"]
    inv_id = row["Inv_ID"]
    if mic == "null":
        inv_info[inv_id][row["IdType"]] = row["IdValue"]
        continue

    key = inv_id + mic
    investments[key]["MIC"] = mic
    investments[key]["Inv_ID"] = inv_id
    investments[key][row["IdType"]] = row["IdValue"]

FILENAME_OUTPUT = "output.csv"
with open(FILENAME_OUTPUT, "w", encoding="utf-8") as file:
    csv_writer = csv.DictWriter(
        file, fieldnames=("SEDOL", "MIC", "ISIN", "CUSIP", "TICKER", "Inv_ID")
    )
    csv_writer.writeheader()
    for investement in investments.values():
        investement["ISIN"] = inv_info[investement["Inv_ID"]]["ISIN"]
        investement["CUSIP"] = inv_info[investement["Inv_ID"]]["CUSIP"]
        csv_writer.writerow(investement)

The clue is that on scanning the file, you either store data for an investment at a specific (investments) or more generic (inv_info) level and just before writing collect all relevant data for a line.

(I only just now realise that I was assuming that 'inv' stands for 'investment', but probably 'inventory' makes more sense? :-D)