How to generate a fake name using Faker() passing existing name as the seed_instance

179 Views Asked by At

I have a dataframe with customer names which I need to use for test data purposes, but need to obfuscate the names. The name needs to be deterministic: if the same name exists in the table then it should be obfuscated with the same 'fake' name.

For example: Susan H both need to have the same 'Fake' name

FullName FakeName
Susan H John F
Eva B Sarah E
Susan H John F

I have discovered Faker() for this purpose. How can I adapt the below so that I can pass in the existing name as the 'seed_instance' so that the resulting 'fake' name will be the same for all instances of that name in the dataframe?

from faker import Faker
import pyspark.sql.functions as F

fullname_list = [[1,"Sarah Markwaithe"]
,[2,"John Bellamy"]
,[3,"Jordan Fingleberry"]
,[4,"Susan Merchant"]
,[5,"Bobby Franker"]
,[6,"Sally Smith-Holdern"]
,[7,"Finley Farringdon"]
,[8,"Sarah Markwaithe"]
,[9,"Simone Grath"]
,[10,"Frederick Balchum"]
]
df_schema = ["Id","FullName"]
# create example df
df = spark.createDataFrame(fullname_list, df_schema)

fake = Faker('en_GB')
fake_name = F.udf(fake.name)

df = df.withColumn("FakeFullName", fake_name())

df.display()

I understand that I can use seed_instance, but have no clue as to how to implement this in the code above so that I can pass "FullName" to the udf (apologies, Python newbie and tight delivery deadlines)

fake.seed_instance("Susan H")
fake.name()
1

There are 1 best solutions below

0
Rebecca On BEST ANSWER

Think I have worked out what to do. No idea whether it is the right approach (best practice, etc). Feel free to comment and let me know any other (and more efficient/Pythonic) methods:

from faker import Faker
import pyspark.sql.functions as F
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

fullname_list = [[1,"Sarah Markwaithe"]
,[2,"John Bellamy"]
,[3,"Jordan Fingleberry"]
,[4,"Susan Merchant"]
,[5,"Bobby Franker"]
,[6,"Sally Smith-Holdern"]
,[7,"Finley Farringdon"]
,[8,"Sarah Markwaithe"]
,[9,"Simone Grath"]
,[10,"Frederick Balchum"]
]
df_schema = ["Id","FullName"]
# create example df
df = spark.createDataFrame(fullname_list, df_schema)

fake = Faker('en_GB')

# create function that does what I need to do
def generate_fake_name(str):
    fake.seed_instance(str)
    return fake.name()

# Convert to UDF function
fake_name = udf(generate_fake_name, StringType())

# us UDF over dataframe
df = df.withColumn("FakeFullName", fake_name(col("FullName")))
df.show()

results

UPDATE: also including this if it helps someone else trying to achieve the same thing (I only wanted to generate a 'fake' name if the column contained a name): Updated dataframe above: ,[3,"Jordan Fingleberry"] to :,[3,""]

# use UDF over dataframe to overwrite the existing column
# only replace with a fake name if the column to be replaced contains a value
Removed: 
df = df.withColumn("FullName", when(col("FullName") == "",lit(None)).otherwise(fake_name(col("FullName"))))
df.show()

enter image description here