数据集的获取

tiny-kinetics-400数据集的获取

参考：Tramac/tiny-kinetics-400: Tiny Kinetics-400 for test

也已转存到天翼云盘：https://cloud.189.cn/web/share?code=iy6NZjERvuIv（访问码：6cbv）

服务器租用&依赖环境安装

此处用的是funHPC的四毛六一小时的P40 24G，环境选择 pytorch1.9/python3.8/cuda11.1

官方教程（安装 — MMAction2 1.2.0 文档）上面讲了安装torch之类的，因为我这里是用的云服务器，直接就是部署好了的，不再重新安装

首先pip更换科大源

pip config set global.index-url https://mirrors.ustc.edu.cn/pypi/simple

根据官方教程安装 MMEngine、MMCV、MMDetection 和 MMPose。

pip install -U openmim
mim install mmengine
mim install mmcv==2.1.0  # 好像不加版本要求后面会出错，可能是对不同cuda版本有不同要求
mim install mmdet
mim install mmpose

此处我们选择直接将mmaction2安装为python包（省的编译太费时间）（但是这样安装后面会有个问题，源码mmaction2/mmaction/models/localizers/drn目录好像并没有存在于安装的python包下，所以当终端报错找不到drn时直接cp -r把这个drn的目录复制过去即可解决问题）

pip install mmaction2

一般情况下，安装过程到此就完成了

数据导入和预处理

进入云服务器的code-server界面后，先拉mmaction2代码

git clone https://github.com/open-mmlab/mmaction2.git

tools/data下是各种数据集的获取方式，但是直接在这里获取的话因为网络原因应该会很慢，我们直接把tiny-kinetics-400导入进来

在项目根目录创建文件夹data/kinetics400，并将前面提到的数据集放进去并解压

mkdir data
mkdir data/kinetics400
#TODO: 上传数据集压缩文件
unzip tK-400.zip

将解压后的目录重命名并移动位置（train_256重命名为videos_train,val_256重命名为videos_val）,最终是这样的目录结构，最上层的data目录位于mmaction2项目根目录

data
└── kinetics400
    ├── videos_train
    │   ├── abseiling
    │   │   └── _4YTwq0-73Y_000044_000054.mp4
    │   ├── air_drumming
    │   │   └── _axE99QAhe8_000026_000036.mp4
    │   └── ...
    │
    └── videos_val
        ├── abseiling
        ├── air_drumming
        └── ...

使用mmaction2自带的脚本进行抽帧（由于没有装denseflow，这里我用的是opencv抽帧），运行脚本自动抽帧

bash tools/data/kinetics/extract_rgb_frames_opencv.sh kinetics400

抽帧结束的目录结构大致是这样

data
└── kinetics400
    ├── rawframes_train
    ├── rawframes_val
    ├── videos_train
    └── videos_val

生成train.txt和val.txt，这两个文件记录了每个视频帧存档的目录、帧数、对应的标签

我在data/kinetics目录下创建了个gen_list.py文件，内容如下，参照这位老哥的博客kinetics数据集路径txt生成_wbiqb-CSDN博客

# gen_list.py
import os
import datetime
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('path', type=str, help='data_path of videos, absolute path')
parser.add_argument('outfile', type=str, help='output.txt file')
args = parser.parse_args()

start_time = datetime.datetime.now()
# get label dictionary 
labels = [ i for i in os.listdir(args.path)]
labels.sort()
if '.DS_Store' in labels:
    labels.remove('.DS_Store')
dic = {label:idx for (idx, label) in enumerate(labels)}

# get [video_path, num_of_frames, labels] 
tt = 0
dirss = [i for i in os.listdir(args.path)]
dirss.sort()
# print(dirss)

record = []
dic_cor = 0
for dirs in dirss[:]:
    if dirs == '.DS_Store':
        continue
    print(dic_cor)
    dic_cor += 1
    frames_path = []
    for video in os.listdir(os.path.join(args.path, dirs)):
        if video == '.DS_Store':
            continue
        # print(os.path.join(args.path, dirs, video))
        frames_path = [i for i in os.listdir(os.path.join(args.path, dirs, video))]
        frames_len = len(frames_path) - 2 if ".DS_Store" in frames_path else len(frames_path)-1
        # print(dirs, video) 
        record.append([os.path.join(args.path, dirs, video), frames_len, dic[dirs]])
        tt += 1
        if tt % 10000 == 0:
            print('record:', tt)
            with open(args.outfile,"a") as f:
                for i in range(len(record)):
                    rec =  str(record[i][0] + ' ' + str(record[i][1]) + ' ' + str(record[i][2]) + '\n')
                    f.write(rec)
                record = []

with open(args.outfile,"a") as f:
    for i in range(len(record)):
        rec =  str(record[i][0] + ' ' + str(record[i][1]) + ' ' + str(record[i][2]) + '\n')
        f.write(rec)

print("Run time:", datetime.datetime.now()-start_time)

运行脚本

python gen_list.py data/kinetics400/rawframes_train ./train.txt
python gen_list.py data/kinetics400/rawframes_val ./val.txt

最终的目录结构如下

data
└── kinetics400
    ├── gen_list.py
    ├── rawframes_train
    ├── rawframes_val
    ├── train.txt
    ├── val.txt
    ├── videos_train
    └── videos_val

制作timesformer训练的配置文件

配置文件在configs/recognition/timesformer/目录下，因为此处是抽帧后进行处理而不是直接处理视频，所以我们根据timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.py这个配置文件来修改，自定义配置文件暂时命名为timsformer_tiny_kinetics_400.py修改后内容如下

_base_ = ['../../_base_/default_runtime.py']

# model settings
model = dict(
    type='Recognizer3D',
    backbone=dict(
        type='TimeSformer',
        pretrained=  # noqa: E251
        'https://download.openmmlab.com/mmaction/recognition/timesformer/vit_base_patch16_224.pth',  # noqa: E501
        num_frames=8,
        img_size=224,
        patch_size=16,
        embed_dims=768,
        in_channels=3,
        dropout_ratio=0.,
        transformer_layers=None,
        attention_type='divided_space_time',
        norm_cfg=dict(type='LN', eps=1e-6)),
    cls_head=dict(
        type='TimeSformerHead',
        num_classes=400,
        in_channels=768,
        average_clips='prob'),
    data_preprocessor=dict(
        type='ActionDataPreprocessor',
        mean=[127.5, 127.5, 127.5],
        std=[127.5, 127.5, 127.5],
        format_shape='NCTHW'))

# dataset settings
dataset_type = 'RawframeDataset'
data_root = 'data/kinetics400/rawframes_train'
data_root_val = 'data/kinetics400/rawframes_val'
ann_file_train = 'data/kinetics400/train.txt'
ann_file_val = 'data/kinetics400/val.txt'
ann_file_test = 'data/kinetics400/val.txt'

file_client_args = dict(io_backend='disk')

train_pipeline = [
    # dict(type='DecordInit', **file_client_args),
    dict(type='SampleFrames', clip_len=8, frame_interval=32, num_clips=1),
    dict(type='RawFrameDecode'),
    dict(type='RandomRescale', scale_range=(256, 320)),
    dict(type='RandomCrop', size=224),
    dict(type='Flip', flip_ratio=0.5),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='PackActionInputs')
]
val_pipeline = [
    # dict(type='DecordInit', **file_client_args),
    dict(
        type='SampleFrames',
        clip_len=8,
        frame_interval=32,
        num_clips=1,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 256)),
    dict(type='CenterCrop', crop_size=224),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='PackActionInputs')
]
test_pipeline = [
    # dict(type='DecordInit', **file_client_args),
    dict(
        type='SampleFrames',
        clip_len=8,
        frame_interval=32,
        num_clips=1,
        test_mode=True),
    dict(type='RawFrameDecode'),
    dict(type='Resize', scale=(-1, 224)),
    dict(type='ThreeCrop', crop_size=224),
    dict(type='FormatShape', input_format='NCTHW'),
    dict(type='PackActionInputs')
]
train_dataloader = dict(
    batch_size=8,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=dict(
        type=dataset_type,
        ann_file=ann_file_train,
        data_prefix=dict(img=data_root),
        pipeline=train_pipeline))
val_dataloader = dict(
    batch_size=8,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        ann_file=ann_file_val,
        data_prefix=dict(img=data_root_val),
        pipeline=val_pipeline,
        test_mode=True))
test_dataloader = dict(
    batch_size=1,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=dict(
        type=dataset_type,
        ann_file=ann_file_test,
        data_prefix=dict(img=data_root_val),
        pipeline=test_pipeline,
        test_mode=True))

val_evaluator = dict(type='AccMetric')
test_evaluator = val_evaluator

train_cfg = dict(
    type='EpochBasedTrainLoop', max_epochs=15, val_begin=1, val_interval=1)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

optim_wrapper = dict(
    optimizer=dict(
        type='SGD', lr=0.005, momentum=0.9, weight_decay=1e-4, nesterov=True),
    paramwise_cfg=dict(
        custom_keys={
            '.backbone.cls_token': dict(decay_mult=0.0),
            '.backbone.pos_embed': dict(decay_mult=0.0),
            '.backbone.time_embed': dict(decay_mult=0.0)
        }),
    clip_grad=dict(max_norm=40, norm_type=2))

param_scheduler = [
    dict(
        type='MultiStepLR',
        begin=0,
        end=15,
        by_epoch=True,
        milestones=[5, 10],
        gamma=0.1)
]

default_hooks = dict(checkpoint=dict(interval=5))

# Default setting for scaling LR automatically
#   - `enable` means enable scaling LR automatically
#       or not by default.
#   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=64)

训练模型

在项目根目录执行命令启动训练脚本

python tools/train.py configs/recognition/timesformer/timesformer_tiny_kinetics_400.py

训练过程中会在项目根目录的work_dir目录下生成.pth文件

测试模型

在项目根目录执行命令启动测试脚本

注：此处work_dirs/timesformer_tiny_kinetics_400/epoch_15.pth是最后的训练结果

python tools/test.py configs/recognition/timesformer/timesformer_tiny_kinetics_400.py \ work_dirs/timesformer_tiny_kinetics_400/epoch_15.pth --dump result.pkl

测试结果位于work_dir目录下的日志中此处是位于（work_dirs/timesformer_tiny_kinetics_400/20241104_105423/20241104_105423.log）（初次尝试，训练效果一般）

2024/11/04 10:54:43 - mmengine - INFO - Epoch(test) [ 20/400]    eta: 0:01:38  time: 0.2584  data_time: 0.0413  memory: 689  
2024/11/04 10:54:47 - mmengine - INFO - Epoch(test) [ 40/400]    eta: 0:01:25  time: 0.2138  data_time: 0.0026  memory: 689  
2024/11/04 10:54:51 - mmengine - INFO - Epoch(test) [ 60/400]    eta: 0:01:17  time: 0.2144  data_time: 0.0027  memory: 689  
2024/11/04 10:54:55 - mmengine - INFO - Epoch(test) [ 80/400]    eta: 0:01:12  time: 0.2138  data_time: 0.0027  memory: 689  
2024/11/04 10:55:00 - mmengine - INFO - Epoch(test) [100/400]    eta: 0:01:06  time: 0.2139  data_time: 0.0028  memory: 689  
2024/11/04 10:55:04 - mmengine - INFO - Epoch(test) [120/400]    eta: 0:01:01  time: 0.2136  data_time: 0.0025  memory: 689  
2024/11/04 10:55:08 - mmengine - INFO - Epoch(test) [140/400]    eta: 0:00:57  time: 0.2139  data_time: 0.0025  memory: 689  
2024/11/04 10:55:12 - mmengine - INFO - Epoch(test) [160/400]    eta: 0:00:52  time: 0.2140  data_time: 0.0026  memory: 689  
2024/11/04 10:55:17 - mmengine - INFO - Epoch(test) [180/400]    eta: 0:00:48  time: 0.2142  data_time: 0.0027  memory: 689  
2024/11/04 10:55:21 - mmengine - INFO - Epoch(test) [200/400]    eta: 0:00:43  time: 0.2150  data_time: 0.0028  memory: 689  
2024/11/04 10:55:25 - mmengine - INFO - Epoch(test) [220/400]    eta: 0:00:39  time: 0.2146  data_time: 0.0028  memory: 689  
2024/11/04 10:55:30 - mmengine - INFO - Epoch(test) [240/400]    eta: 0:00:34  time: 0.2136  data_time: 0.0025  memory: 689  
2024/11/04 10:55:34 - mmengine - INFO - Epoch(test) [260/400]    eta: 0:00:30  time: 0.2138  data_time: 0.0026  memory: 689  
2024/11/04 10:55:38 - mmengine - INFO - Epoch(test) [280/400]    eta: 0:00:26  time: 0.2141  data_time: 0.0027  memory: 689  
2024/11/04 10:55:43 - mmengine - INFO - Epoch(test) [300/400]    eta: 0:00:21  time: 0.2146  data_time: 0.0029  memory: 689  
2024/11/04 10:55:47 - mmengine - INFO - Epoch(test) [320/400]    eta: 0:00:17  time: 0.2138  data_time: 0.0026  memory: 689  
2024/11/04 10:55:51 - mmengine - INFO - Epoch(test) [340/400]    eta: 0:00:13  time: 0.2142  data_time: 0.0027  memory: 689  
2024/11/04 10:55:55 - mmengine - INFO - Epoch(test) [360/400]    eta: 0:00:08  time: 0.2137  data_time: 0.0024  memory: 689  
2024/11/04 10:56:00 - mmengine - INFO - Epoch(test) [380/400]    eta: 0:00:04  time: 0.2139  data_time: 0.0028  memory: 689  
2024/11/04 10:56:04 - mmengine - INFO - Epoch(test) [400/400]    eta: 0:00:00  time: 0.2136  data_time: 0.0025  memory: 689  
2024/11/04 10:56:04 - mmengine - INFO - Results has been saved to result.pkl.
2024/11/04 10:56:04 - mmengine - INFO - Epoch(test) [400/400]    acc/top1: 0.2000  acc/top5: 0.4200  acc/mean1: 0.2000  data_time: 0.0046  time: 0.2162