使用 Python 递归目录并按照特定规则修改其下文件夹及文件名

风险提示

git 用户执行操作前，请确保你的 git 工作区是 clean 的（所有代码已 push）或者可 revert/reset 的（commited），否则修改导致出现问题本人不负连带责任；
如果是重要文件，普通用户操作前请确保将目录数据备份到安全区域；

更新

最近发现微软有一个小工具专门做这件事：👉适用于 Windows 10 的 PowerToys PowerRename 实用工具 | Microsoft Docs，试用了一下还是很香的！如果只是在单个目录做文件重命名这件事，可能这个工具比我做得更好。（bug 更少）😳

缘起

在使用 vuepress-theme-vdoing 主题构建自己的个人知识库的时候，有一个需求是需要重命名自己的文档文件名和文档目录。

命名约定

无论是文件还是文件夹，请为其名称添加上正确的正整数序号和.，从00或01开始累计，如01.文件夹、02.文件.md，我们将会按照序号的顺序来决定其在侧边栏当中的顺序。

同一级别目录别内即使只有一个文件或文件夹也要为其加上序号。

文件或文件夹名称中间不能出现多余的点.，如01.我是.名称.md中间出现.将会导致解析错误。

详情参阅：构建结构化站点的核心配置和约定 | vuepress-theme-vdoing

本来想直接手动修改的，改了一点之后感觉工作量有点大，然后去网上找轮子，但是发现找到的都不是很满足自己的需求，所以只能自己动手造一个了。

基本思路

使用os.walk遍历指定目录
将文件重命名
文件重命名时由于具有后缀名md，所以除了后缀名中的部分，其余部分如果有.则用下划线替代；但是，如果开头是数字点的结构，则直接保留。如：
- 01.this.is-test-file-name.md修改后应该为01.this-is-test-file-name.md
- 02-this.is-another-test-file-name.md修改后应该为02.this-is-another-test-file-name.md
- this-is-normal-test-file-name.md修改后应该为03.this-is-normal-test-file-name.md
- this-is007.abnormal-file-name.md修改后应该为04.this-is007-abnormal-file-name.md
将目录重命名
1. 应该先修改文件，后面从内向外修改
2. 部分 vuepress 主题原有目录、我们自定义排除的目录及目录下的子文件应该排除，不能修改

show me code

遍历目录

import os

# 某个你需要处理的目录
ROOT_PATH = 'some/path/you/will/deal'
for root, dirs, files in os.walk(ROOT_PATH):
    pass

遍历时需要排除某些我们不需要修改的目录及目录下的文件

import os

current_path = os.path.dirname(os.path.abspath(__file__))
ROOT_PATH = os.path.join(current_path, 'docs')  # 需要执行的目录
EXCLUDE_DIR = ['.vuepress', '@pages', '_posts', 'styles']       # 需要排除的目录
for root, dirs, files in os.walk(ROOT_PATH, topdown=True):
    dirs[:] = [d for d in dirs if d not in EXCLUDE_DIR]
    print(dirs)

自底向顶

...
for root, dirs, files in os.walk(ROOT_PATH, topdown=False):
    pass

因为需要同时排除目录和目录子文件。所以我们把排除方法写成一个函数

"""排除给定过滤条件的选项"""

def _not_in(seq, exclude):
    """
    使用 not in
    :param all_seq: 
    :param filter_seq: 
    :return: 
    """
    return [item for item in seq if item not in exclude]

def _filter_sth(seq, exclude):
    """
    使用filter
    :param seq: 
    :param exclude: 
    :return: 
    """
    return list(filter(lambda x: x not in exclude, seq))

def _subtract_set(seq, exclude):
    """
    差集法
    :param seq: 
    :param exclude: 
    :return: 
    """
    return list(set(seq) - set(exclude))

对于上述处理方案，选择性能更好的：

A = list(range(8888))
B = list(range(2000, 6666))
nt = timeit.Timer(lambda: not_in(A, B))
ft = timeit.Timer(lambda: filter_sth(A, B))
st = timeit.Timer(lambda: subtract_set(A, B))
x = nt.timeit(5)
y = st.timeit(5)
z = ft.timeit(5)
print(f'not_in:{x}, subtract_set:{y}, filter_sth:{z}')

# not_in:5.2498173, subtract_set:0.008623699999999346, filter_sth:4.9613408

参见python - List comprehension vs. lambda + filter - Stack Overflow。

处理规则

对于文件

只处理markdown文件

import pathlib
def is_md_file(file_path):
    """
    指定文件是md文件
    :param file_path:
    :return:
    """
    return pathlib.PurePath(file_path).suffix[1:].lower() == 'md'

如果已经以数字开头，则按以下规则处理

def reg_startswith(check_str, reg):
    """
    10.dsgfdh.md  >>> re.match.obj
    dsgfdh  >>> None
    :param check_str:str,被检查字符
    :param reg:str,正则表达式
    :return:匹配对象或None
    """
    return re.match(f'^{reg}', check_str)

if __name__ == '__main__':
    test_list = ['10.dsgfdh.md', 'dsgfdh', '00xxx', '88,yyy']
    for test in test_list:
        print(reg_startswith(test, REG_EXP))

如果剩余名字部分以['.', '-', '_']开头，则排除分隔符之后替换剩余部分中的.

def make_rename(sub_line):
    """
    _xx.yyy:xx-yyy
    xx-yyy:xx-yyy
    xx.yyy:xx-yyy
    -xx.yyy:xx-yyy
    .xx-yyy:xx-yyy
    你好:你好
    💻:💻
    :param sub_line:
    :return:
    """

    if sub_line and sub_line[0] in ['.', '-', '_']:
        slice_symbol_str = sub_line[1:]
    else:
        slice_symbol_str = sub_line

    if '.' in slice_symbol_str:
        str_replace_dot_inline = slice_symbol_str.replace('.', '-')
        rename_str = str_replace_dot_inline
    else:
        rename_str = slice_symbol_str

    return rename_str

否则加数字并加.

def handler_action(_root, path_item, is_file=True):
    nonlocal count, count_set
    add_suffix = ''
    if is_file:
        add_suffix = '.md'

    reg_exp = r'\d+'
    reg_match_obj = reg_startswith(path_item, reg_exp)
    if reg_match_obj:
        # 本来有数字
        digital = reg_match_obj.group()
        count = int(digital)
        count_set.add(count)
        if is_file:
            deal_line = pathlib.PurePath(path_item).stem
        else:
            deal_line = pathlib.PurePath(path_item).parts[-1]

        sub_line = re.sub(reg_exp, "", deal_line)

        if sub_line.startswith('.'):
            sub_line = sub_line[1:]
        sub_name = make_rename(sub_line)
        new_name_with_suffix = f'{digital}.{sub_name}{add_suffix}'

    else:
        if is_file:
            path_str = pathlib.PurePath(path_item).stem
        else:
            path_str = pathlib.PurePath(path_item).parts[-1]

        new_name = make_rename(path_str)
        # 找出最大count，然后+1作为新编号
        if count_set:
            count = max(count_set)
        count += 1
        count_set.add(count)

        new_name_with_suffix = f'{count:02}.{new_name}{add_suffix}'

    old = os.path.join(_root, path_item)
    new = os.path.join(_root, new_name_with_suffix)
    return old, new

对于目录
执行处理文件时的 2 规则
文件和目录重命名的规则不同
- 文件最后重名的时候需要加后缀.md，目录直接重命名即可
- 文件取文件 pathlib.PurePath.stem 即可，而目录需要取 pathlib.PurePath.parts
  if is_file:
  deal_line = pathlib.PurePath(path_item).stem
  else:
  deal_line = pathlib.PurePath(path_item).parts[-1]

处理方法
文件路径重命名

def rename_path(old, new):
    p = pathlib.Path(fr'{old}')
    target = pathlib.Path(fr'{new}')
    p.rename(target)

源码下载：vdoing_rename

疑问困惑

[:]的作用是什么？
python - What is the difference between slice assignment that slices the whole list and direct assignment? - Stack Overflow
What is the difference between list and list[:] in python? - Stack Overflow
os.walk如何排除指定目录？
参见python - Excluding directories in os.walk - Stack Overflow
nonlocal关键字
实现闭包函数内部使用外部变量

对os.walk中的topdown参数的理解

mkdir root
cd root
mkdir \
  d0 \
  d1 \
  d0/d0_d1
touch \
  f0 \
  d0/d0_f0 \
  d0/d0_f1 \
  d0/d0_d1/d0_d1_f0
  d1/d1_f0

查看目录结构：

tree /f

└─root
    │  f0
    │
    ├─d0
    │  │  d0_f0
    │  │  d0_f1
    │  │
    │  └─d0_d1
    │          d0_d1_f0
    │
    └─d1
            d1_f0

分别测试 topdown 的传参

import os

current_path = os.path.dirname(os.path.abspath(__file__))
ROOT_PATH = os.path.join(current_path, 'root')
top_down_args = [True, False]
for top_down in top_down_args:
    print(f'Top_down is {top_down} ……')
    for root, dirs, files in os.walk(ROOT_PATH, topdown=top_down):
        for dir_item in dirs:
            print(f'dir is:{dir_item}')
        for f_item in files:
            print(f'file is {f_item}')

返回结果：

Top_down is True ……
dir is:d0
dir is:d1
file is f0
dir is:d0_d1
file is d0_f0
file is d0_f1
file is d0_d1_f0
file is d1_f0

Top_down is False ……

file is d0_d1_f0
dir is:d0_d1
file is d0_f0
file is d0_f1
file is d1_f0
dir is:d0
dir is:d1
file is f0

我们可以看到：

在topdown传参True的时候，返回结果按照由外（根目录）向内的顺序扫描：
先扫描目录 d0，接着是 d1 目录，然后是 f0，然后进入 d0 目录，扫描到 d0_d1 目录和文件 d0_f0 与 d0_f1，然后进入 d0_d1 扫描到 d0_d1_f0，最后得到 d1_f0
在topdown传参False的时候，返回结果按照由内（根目录）向外的顺序扫描：
先扫描最里层的 d0_d1 目录得到 d0_d1_f0，扫描完 d0_d1 目录之后扫描 d0_d1 目录，之后到 d1 目录中的 d1_f0，最后到 f0 文件。

当 topdown 为 True 时，调用者可以就地修改 dirnames 列表（也许用到了 del 或切片），而 walk()将仅仅递归到仍保留在 dirnames 中的子目录内。这可用于减少搜索、加入特定的访问顺序，甚至可在继续 walk()之前告知 walk() 由调用者新建或重命名的目录的信息。当 topdown 为 False 时，修改 dirnames 对 walk 的行为没有影响，因为在自下而上模式中，dirnames 中的目录是在 dirpath 本身之前生成的。

os.walk,docs.python.org/zh-cn/3/library/os.html#os.walk

我们改写代码：

for top_down in top_down_args:
    print(f'Top_down is {top_down} ……')

    for root, dirs, files in os.walk(ROOT_PATH, topdown=top_down):
        if dirs:
            print(dirs, '=======ddd=========')
            dirs[:] = [dirs[0]]     # 注意此行
            print(f'==after=slice====ddd====={dirs}====')
            for dir_item in dirs:
                print(f'dir is:{dir_item}')

        for f_item in files:
            print(f'file is {f_item}')

返回结果：

# topdown为True时
Top_down is True ……
['d0', 'd1'] =======ddd=========
==after=slice====ddd=====['d0']====
dir is:d0
file is f0
['d0_d1'] =======ddd=========
==after=slice====ddd=====['d0_d1']====
dir is:d0_d1
file is d0_f0
file is d0_f1
file is d0_d1_f0

# topdown为False时
Top_down is False ……
file is d0_d1_f0
['d0_d1'] =======ddd=========
==after=slice====ddd=====['d0_d1']====
dir is:d0_d1
file is d0_f0
file is d0_f1
file is d1_f0
['d0', 'd1'] =======ddd=========
==after=slice====ddd=====['d0']====
dir is:d0
file is f0

对比结果，我们发现当topdown为True时，d1 目录可以通过切片等操作，如：regex - Python os.walk topdown true with regular expression - Stack Overflow 被原地修改而过滤掉；但是当topdown为False时，即使上面的代码一样，d1 目录还是会被扫描到。所以说topdown为True时可以用于减少搜索、加入特定的访问顺序。