Pulling file names from a list of full paths?

120 Views Asked by At

I am trying to pull out file names from a specifically formatted document, and put them into a list. The document contains a large amount of information, but the lines I am concerned about look like the following with "File Name: " always at the start of the line:

File Name: C:\windows\system32\cmd.exe

I tried the following:

xmlfile = open('my_file.xml', 'r')
filetext = xmlfile.read()
file_list = []
file_list.append(re.findall(r'\bFile Name:\s+.*\\.*(?=\n)', filetext))

This makes file_list look like:

[['File Name: c:\\windows\\system32\\file1.exe',
  'File Name: c:\\windows\\system32\\file2.exe',
  'File Name: c:\\windows\\system32\\file3.exe']]

I'm looking for my output to simply be:

(file1.exe, file2.exe, file3.exe)

I also tried using ntpath.basename on my above output, but it looks like it wants a string as input and not a list.

I'm very new to Python and scripting in general, so any suggestions would be appreciated.

4

There are 4 best solutions below

2
On

You're on the right track. The reason basename wasn't working was because re.findall() returns a list which was being put into yet another list. Here's a fix for that which iterates through that list returned and creates another with just the base file names in:

import re
import os

with open('my_file.xml', 'rU') as xmlfile:
    file_text = xmlfile.read()
    file_list = [os.path.basename(fn)
                    for fn in re.findall(r'\bFile Name:\s+.*\\.*(?=\n)', file_text)]
0
On

You can do it in a more declarative style. It ensures less bugs, high memory efficiency.

import os.path

pat = re.compile(r'\bFile Name:\s+.*\\.*(?=\n)')
with open('my_file.xml') as f:
    ms = (pat.match(line) for line in f)
    ns = (os.path.basename(m) for m in ms)
# the iterator ns emits names such as 'foo.txt'
for n in ns:
    # do something

If you change the regex slightly, i.e the grouping you don't even need os.path.

0
On

You can get the expected output with following regular expression:

file_list = re.findall(r'\bFile Name:\s+.*\\([^\\]*)(?=\n)', filetext)

([^\\]*) will capture everything except a slash after final path separator until \n is encountered, see online example. Since findall already returns a list there's no need to append the return value to existing list.

2
On

I would change this up a bit to make it a bit clearer to read and separate the process a bit - clearly it can be done in one step, but I think your code is going to be tough to manage later

import re
import os

with open('my_file.xml', 'r') as xmlfile:
    filetext = xmlfile.read()   # this way the file handle goes away - you left the file open
file_list = []
my_pattern = re.compile(r'\bFile Name:\s+.*\\.*(?=\n)')
for filename in my_pattern.findall(filetext):
    cleaned_name = filename.split(os.sep)[-1]
    file_list.append(cleaned_name)