Trying to create unique key for a dataframe based on some columns. Used hashlib and zlib , both generating different values for each new python program execution for the same record in dataframe.
Looking for a way to create unique checksum and it should be the same for given data record in dataframe. There are many columns , so don't want to use concatenated column as a key. Any insights would be much appreciated. Sample code tested using hashlib and zlib below
Hashlib
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (hashlib.sha512(x).hexdigest().upper()))
zlib.adler32
stg_matchdf["Unique travelid"] = pd.DataFrame(stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1))[0].\
str.encode('utf-8').apply(lambda x: (zlib.adler32(x) & 0xffffffff ))
Edited(10/21) Changed code and hitting new problem. Please review. Sorry for any confusion
Above code snippets have problem. For a row , some other row's column values hash was added in 'Unique travelid' column due to pd.DataFrame() altering original df rows order. Below modified code fetches respective column values for a given row but hitting new issue explained below
Modified code
stg_matchdf["Unique travelid_Sum"] = stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1)
stg_matchdf["Unique travelid_Key"] = stg_matchdf["Unique travelid_Sum"].apply(lambda x: (zlib.adler32(str(x).encode('utf-8')) & 0xffffffff))
stg_matchdf[uniquecols_list].astype(str).values.sum(axis=1) is not concatenating columns in one particular order across multiple runs. Please see sample below for two runs. Entire length is same , but order of concatenation is random. So it is causing hashlib or zlib to return different values each time. Is there any way to specify order of columns in above code?
Run1:
AHKGCANADACANADANORTH AMERICA266430RDirect WDAYYZINTERNATIONALMANULIFE - CANADA TRANSIENTFeb-2020HONG KONGASIA/PACIFICPARTIAL REFUND2020-02-15Canada266430.02020-02-02Hong Kong2020-03-01QVKGS6
Run2:
YYZCANADAPARTIAL REFUND2664302020-02-02AMANULIFE - CANADA TRANSIENTHONG KONGNORTH AMERICA2020-03-01Hong KongQVKGS6INTERNATIONALDirect WDRHKGACanadaFeb-2020266430.02020-02-15CANADAASIA/PACIFIC