how to de-identify/ redact word files using GCPs DLP API in python

359 Views Asked by At

I am using the GCP's DLP API in python to redact images in the following way and it works fine:

def redact_image_all_text(
    project,
    filename,
    output_filename,
):
    """Uses the Data Loss Prevention API to redact all text in an image.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        filename: The path to the file to inspect.
        output_filename: The path to which the redacted image will be written.
    Returns:
        None; the response from the API is printed to the terminal.
    """
    # Import the client library
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp_v2.DlpServiceClient()

    # Construct the image_redaction_configs, indicating to DLP that all text in
    # the input image should be redacted.
    image_redaction_configs = [{"redact_all_text": True}]

    # Construct the byte_item, containing the file's byte data.
    with open(filename, mode="rb") as f:
        byte_item = {"type_": google.cloud.dlp_v2.FileType.IMAGE, "data": f.read()}

    # Convert the project id into a full resource id.
    parent = f"projects/{project}"

    # Call the API.
    response = dlp.redact_image(
        request={
            "parent": parent,
            "image_redaction_configs": image_redaction_configs,
            "byte_item": byte_item,
        }
    )

    # Write out the results.
    with open(output_filename, mode="wb") as f:
        f.write(response.redacted_image)

    print(
        "Wrote {byte_count} to {filename}".format(
            byte_count=len(response.redacted_image), filename=output_filename
        )
    )

Now I want to apply this to word docs files. I have seen a few examples using dlp.deidentify_content but it seems to be only for text input.

 # Call the API
    response = dlp.deidentify_content(
        request={
            "parent": parent,
            "deidentify_config": deidentify_config,
            "item": contentItem,
        }
    )

So, I want to know if cloud DLP natively supports redaction/ de-identification on word DOCs. If so, how do I do it? If not, is there an elegant way to implement DLP redaction on word docs

1

There are 1 best solutions below

1
Jordanna Chord On

Others are right -> although inspect_content does support inspecting docx files (not doc), de-identify does not.

If you want to split up each paragraph, using the Record object and passing in each paragraph as a row will allow you to reduce you traffic.