match 2 strings exactly except at places where there is a particular string in python

102 Views Asked by At

I have a master file which contains certain text- let's say-

file contains x
the image is of x type
the user is admin
the address is x

and then there 200 other text files containing texts like-

file contains xyz
the image if of abc type
the user is admin
the address if pqrs

I need to match these files up. The result will be true if the files contains the text exactly as is in the master file, with x being different for each file i.e. 'x' in the master can be anything in the other files and the result will be true.What I have come up with is

arr=master.split('\n')
for file in files:
    a=[]
    file1=file.split('\n')
    i=0
    for line in arr:
        line_list=line.split()
        indx=line_list.index('x')
        line_list1=line_list[:indx]+line_list[indx+1:]
        st1=' '.join(line_list1)
        file1_list=file1[i].split()
        file1_list1=file1_list[:indx]+file1_list[indx+1:]
        st2=' '.join(file1_list1)
        if st1!=st2:
            a.append(line)
        i+=1

which is highly inefficient. Is there a way that I can sort of map the files with master file and generate the differences in some other file?

3

There are 3 best solutions below

1
On

Is that "universal" unique on the line? For instance, if the key is, indeed, x, are you guaranteed that x appears nowhere else in the line? Or could the master file have something like

excluding x records and x axis values

If you do have a unique key ...

For each line, split the master file on your key x. This gives you two pieces for the line, front and back. Then merely check whether the line startswith the front part and endswith the back part. Something like

for line in arr:
    front, back = line.split(x_key)
    # grab next line in input file
    ...
    if line_list1.startswith(front) and 
       line_list1.endswith(back):
        # process matching line
    else:
        # process non-matching line

See documentation


UPDATE PER OP COMMENT

So long as x is unique within the line, you can easily adapt this. As you mention in your comment, you want something like

if len(line) == len(line_list1):
    if all(line[i] == line_list1[i] for i in len(line) ):
        # Found matching lines
    else:
        # Advance to the next line
1
On

Here's one approach that I think satisfies your requirements. It also allows you to specify whether only the same difference should be allowed on each line or not (which would consider your second file example as not matching):

UPDATE: this accounts for lines in the master and other files not necessarily being in the same order

from itertools import zip_longest

def get_min_diff(master_lines, to_check):
    min_diff = None
    match_line = None
    for ln, ml in enumerate(master_lines):
        diff = [w for w, m in zip_longest(ml, to_check) if w != m]
        n_diffs = len(diff)
        if min_diff is None or n_diffs < min_diff:
            min_diff = n_diffs
            match_line = ln

    return min_diff, diff, match_line

def check_files(master, files):
    # get lines to compare against
    master_lines = []
    with open(master) as mstr:
        for line in mstr:
            master_lines.append(line.strip().split())      
    matches = []
    for f in files:
        temp_master = list(master_lines)
        diff_sizes = set()
        diff_types = set()
        with open(f) as checkfile:
            for line in checkfile:
                to_check = line.strip().split()
                # find each place in current line where it differs from
                # the corresponding line in the master file
                min_diff, diff, match_index = get_min_diff(temp_master, to_check)
                if min_diff <= 1:  # acceptable number of differences
                    # remove corresponding line from master search space
                    # so we don't match the same master lines to multiple
                    # lines in a given test file
                    del temp_master[match_index]
                    # if it only differs in one place, keep track of what
                    # word was different for optional check later
                    if min_diff == 1:
                        diff_types.add(diff[0])
                diff_sizes.add(min_diff)
            # if you want any file where the max number of differences
            # per line was 1
            if max(diff_sizes) == 1:
                # consider a match if there is only one difference per line
                matches.append(f)
            # if you instead want each file to only
            # be different by the same word on each line
            #if len(diff_types) == 1:
                #matches.append(f)
    return matches

I've made a few test files to check, based on your supplied examples:

::::::::::::::
test1.txt
::::::::::::::
file contains y
the image is of y type
the user is admin
the address is y
::::::::::::::
test2.txt
::::::::::::::
file contains x
the image is of x type
the user is admin
the address is x
::::::::::::::
test3.txt
::::::::::::::
file contains xyz
the image is of abc type
the user is admin
the address is pqrs
::::::::::::::
testmaster.txt
::::::::::::::
file contains m
the image is of m type
the user is admin
the address is m
::::::::::::::
test_nomatch.txt
::::::::::::::
file contains y and some other stuff
the image is of y type unlike the other
the user is bongo the clown
the address is redacted
::::::::::::::
test_scrambled.txt
::::::::::::::
the image is of y type
file contains y
the address is y
the user is admin

When run, the code above returns the correct files:

In: check_files('testmaster.txt', ['test1.txt', 'test2.txt', 'test3.txt', 'test_nomatch.txt', 'test_scrambled.txt'])
Out: ['test1.txt', 'test2.txt', 'test3.txt', 'test_scrambled.txt']
3
On

I know this is not really a solution, but you can check if the file is in the same format by doing somwthing like:

if "the image is of" in var:
    to do

by checking the rest of the lines

"file contains"

"the user is"

"the address is"

you will be able to somewhat validade if the file you are checking is valid

You can check this link to read more about this "substring idea"

Does Python have a string contains substring method?