Skip to content

Commit dcbd456

Browse files
authored
Merge pull request #2 from RaivoKoot/frame_ranges
Changed NUM_FRAMES in annotations.txt to START and END frame for usin…
2 parents caecde6 + 3e6c74d commit dcbd456

File tree

4 files changed

+70
-42
lines changed

4 files changed

+70
-42
lines changed

README.md

Lines changed: 31 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ For a demo, visit `demo.py`.
2828
### QuickDemo (demo.py)
2929
```python
3030
root = os.path.join(os.getcwd(), 'demo_dataset') # Folder in which all videos lie in a specific structure
31-
annotation_file = os.path.join(root, 'annotations.txt') # A row for each video sample as: (VIDEO_PATH NUM_FRAMES CLASS_INDEX)
31+
annotation_file = os.path.join(root, 'annotations.txt') # A row for each video sample as: (VIDEO_PATH START_FRAME END_FRAME CLASS_INDEX)
3232

3333
""" DEMO 1 WITHOUT IMAGE TRANSFORMS """
3434
dataset = VideoFrameDataset(
@@ -73,12 +73,13 @@ python >= 3.6
7373
### 2. Custom Dataset
7474
To use any dataset, two conditions must be met.
7575
1) The video data must be supplied as RGB frames, each frame saved as an image file. Each video must have its own folder, in which the frames of
76-
that video lie. The frames of a video inside its folder must be named uniformly as `img_00001.jpg` ... `img_00120.jpg`, if there are 120 frames. The filename template
77-
for frames is then "img_{:05d}.jpg" (python string formatting, specifying 5 digits after the underscore), and must be supplied to the
78-
constructor of VideoFrameDataset as a parameter. Each video folder lies inside a `root` folder of this dataset.
76+
that video lie. The frames of a video inside its folder must be named uniformly with consecutive indices such as `img_00001.jpg` ... `img_00120.jpg`, if there are 120 frames.
77+
Indices can start at zero or any other number and the exact file name template can be chosen freely. The filename template
78+
for frames in this example is "img_{:05d}.jpg" (python string formatting, specifying 5 digits after the underscore), and must be supplied to the
79+
constructor of VideoFrameDataset as a parameter. Each video folder must lie inside some `root` folder.
7980
2) To enumerate all video samples in the dataset and their required metadata, a `.txt` annotation file must be manually created that contains a row for each
80-
video sample in the dataset. The training, validation, and testing datasets must have separate annotation files. Each row must be a space-separated list that contains
81-
`VIDEO_PATH NUM_FRAMES CLASS_INDEX`. The `VIDEO_PATH` of a video sample should be provided without the `root` prefix of this dataset.
81+
video sample or video clip (in case of clips for action recognition for example) in the dataset. The training, validation, and testing datasets must have separate annotation files. Each row must be a space-separated list that contains
82+
`VIDEO_PATH START_FRAME END_FRAME CLASS_INDEX`. The `VIDEO_PATH` of a video sample should be provided without the `root` prefix of this dataset.
8283

8384
This example project demonstrates this using a dummy dataset inside of `demo_dataset/`, which is the `root` dataset folder of this example. The folder
8485
structure looks as follows:
@@ -108,19 +109,30 @@ demo_dataset
108109
109110
110111
```
111-
The accompanying annotation `.txt` file contains the following rows
112+
The accompanying annotation `.txt` file contains the following rows (PATH, START_FRAME, END_FRAME, LABEL_ID)
112113
```
113-
jumping/0001 17 0
114-
jumping/0002 18 0
115-
running/0001 15 1
116-
running/0002 15 1
114+
jumping/0001 1 17 0
115+
jumping/0002 1 18 0
116+
running/0001 1 15 1
117+
running/0002 1 15 1
117118
```
119+
Another annotations file that uses multiple clips from each video could be
120+
```
121+
jumping/0001 1 8 0
122+
jumping/0001 5 17 0
123+
jumping/0002 1 18 0
124+
running/0001 10 15 1
125+
running/0001 5 10 1
126+
running/0002 1 15 1
127+
```
128+
(END_FRAME is inclusive)
129+
118130
Instantiating a VideoFrameDataset with the `root_path` parameter pointing to `demo_dataset`, the `annotationsfile_path` parameter pointing to the annotation file, and
119131
the `imagefile_template` parameter as "img_{:05d}.jpg", is all that it takes to start using the VideoFrameDataset class.
120132

121133
### 3. Video Frame Sampling Method
122134
When loading a video, only a number of its frames are loaded. They are chosen in the following way:
123-
1. The frame indices [1,N] are divided into NUM_SEGMENTS even segments. From each segment, a random start-index is sampled from which FRAMES_PER_SEGMENT consecutive indices are loaded.
135+
1. The frame index range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS even segments. From each segment, a random start-index is sampled from which FRAMES_PER_SEGMENT consecutive indices are loaded.
124136
This results in NUM_SEGMENTS*FRAMES_PER_SEGMENT chosen indices, whose frames are loaded as PIL images and put into a list and returned when calling
125137
`dataset[i]`.
126138
![alt text](https://github.com/RaivoKoot/images/blob/main/Sparse_Temporal_Sampling.jpg "Sparse-Temporal-Sampling-Strategy")
@@ -129,27 +141,29 @@ This results in NUM_SEGMENTS*FRAMES_PER_SEGMENT chosen indices, whose frames are
129141
If you do not want to use sparse temporal sampling and instead want to sample a single N-frame continuous
130142
clip from a video, this is possible. Set `NUM_SEGMENTS=1` and `FRAMES_PER_SEGMENT=N`. Because VideoFrameDataset
131143
will chose a random start index per segment and take `NUM_SEGMENTS` continuous frames from each sampled start
132-
index, this will result in a single N-frame continuous clip per video. An example of this is in `demo.py`.
144+
index, this will result in a single N-frame continuous clip per video that starts at a random index.
145+
An example of this is in `demo.py`.
133146

134147
### 5. Using VideoFrameDataset for training
135148
As demonstrated in `demo.py`, we can use PyTorch's `torch.utils.data.DataLoader` class with VideoFrameDataset to take care of shuffling, batching, and more.
136149
To turn the lists of PIL images returned by VideoFrameDataset into tensors, the transform `video_dataset.ImglistToTensor()` can be supplied
137150
as the `transform` parameter to VideoFrameDataset. This turns a list of N PIL images into a batch of images/frames of shape `N x CHANNELS x HEIGHT x WIDTH`.
138-
We can further chain preprocessing and augmentation functions that act on batches of images onto the end of `ImglistToTensor()`.
151+
We can further chain preprocessing and augmentation functions that act on batches of images onto the end of `ImglistToTensor()`, as seen in `demo.py`
139152

140153
As of `torchvision 0.8.0`, all torchvision transforms can now also operate on batches of images, and they apply deterministic or random transformations
141-
on the batch identically on all images of the batch. Therefore, any torchvision transform can be used here to apply video-uniform preprocessing and augmentation.
154+
on the batch identically on all images of the batch. Because a single video-tensor (FRAMES x CHANNELS x HEIGHT x WIDTH)
155+
has the same shape as an image batch tensor (BATCH x CHANNELS x HEIGHT x WIDTH), any torchvision transform can be used here to apply video-uniform preprocessing and augmentation.
142156

143157
REMEMBER:
144-
Pytorch transforms are applied to individual dataset samples (in this case a video frame PIL list, or a frame tensor after `ImglistToTensor()`) before
158+
Pytorch transforms are applied to individual dataset samples (in this case a list of PIL images of a video, or a video-frame tensor after `ImglistToTensor()`) before
145159
batching. So, any transforms used here must expect its input to be a frame tensor of shape `FRAMES x CHANNELS x HEIGHT x WIDTH` or a list of PIL images if `ImglistToTensor()` is not used.
146160
### 6. Conclusion
147161
A proper code-based explanation on how to use VideoFrameDataset for training is provided in `demo.py`
148162

149163
### 7. Upcoming Features
150164
- [x] Add demo for sampling a single continous-frame clip from videos.
151165
- [ ] Add support for arbitrary labels that are more than just a single integer.
152-
- [ ] Add support for specifying START_FRAME and END_FRAME for a video instead of NUM_FRAMES.
166+
- [x] Add support for specifying START_FRAME and END_FRAME for a video instead of NUM_FRAMES.
153167

154168
### 8. Acknowledgements
155169
We thank the authors of TSN for their [codebase](https://github.com/yjxiong/tsn-pytorch), from which we took VideoFrameDataset and adapted it

demo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ def denormalize(video_tensor):
136136
dataset=dataset,
137137
batch_size=2,
138138
shuffle=True,
139-
num_workers=8,
139+
num_workers=4,
140140
pin_memory=True
141141
)
142142

demo_dataset/annotations.txt

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
jumping/0001 17 0
2-
jumping/0002 18 0
3-
running/0001 15 1
4-
running/0002 15 1
1+
jumping/0001 1 17 0
2+
jumping/0002 1 18 0
3+
running/0001 1 15 1
4+
running/0002 1 15 1

video_dataset.py

Lines changed: 34 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from PIL import Image
55
from torchvision import transforms
66
import torch
7-
from collections.abc import Callable
87

98
class VideoRecord(object):
109
"""
@@ -14,10 +13,11 @@ class VideoRecord(object):
1413
Args:
1514
root_datapath: the system path to the root folder
1615
of the videos.
17-
row: A list with three elements where 1) The first
16+
row: A list with four elements where 1) The first
1817
element is the path to the video sample's frames excluding
19-
the root_datapath prefix 2) The second element is the number
20-
of frames in the video 3) The third element is the label index.
18+
the root_datapath prefix 2) The second element is the starting frame id of the video
19+
3) The third element is the inclusive ending frame id of the video
20+
4) The fourth element is the label index.
2121
"""
2222
def __init__(self, row, root_datapath):
2323
self._data = row
@@ -30,11 +30,17 @@ def path(self):
3030

3131
@property
3232
def num_frames(self):
33+
return self.end_frame() - self.start_frame() + 1 # +1 because end frame is inclusive
34+
35+
def start_frame(self):
3336
return int(self._data[1])
3437

38+
def end_frame(self):
39+
return int(self._data[2])
40+
3541
@property
3642
def label(self):
37-
return int(self._data[2])
43+
return int(self._data[3])
3844

3945
class VideoFrameDataset(torch.utils.data.Dataset):
4046
r"""
@@ -46,8 +52,8 @@ class VideoFrameDataset(torch.utils.data.Dataset):
4652
tensors where FRAMES=x if the ``ImglistToTensor()``
4753
transform is used.
4854
49-
More specifically, the frame range [0,N] is divided into NUM_SEGMENTS
50-
segments and FRAMES_PER_SEGMENT frames are taken from each segment.
55+
More specifically, the frame range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS
56+
segments and FRAMES_PER_SEGMENT consecutive frames are taken from each segment.
5157
5258
Note:
5359
A demonstration of using this class can be seen
@@ -65,11 +71,11 @@ class VideoFrameDataset(torch.utils.data.Dataset):
6571
inside a ``ROOT_DATA`` folder, each video lies in its own folder,
6672
where each video folder contains the frames of the video as
6773
individual files with a naming convention such as
68-
img_001.jpg ... img_059.jpg. Numbering must start at 1.
74+
img_001.jpg ... img_059.jpg.
6975
For enumeration and annotations, this class expects to receive
70-
the path to a .txt file where each video sample has a row with three
76+
the path to a .txt file where each video sample has a row with four
7177
space separated values:
72-
``VIDEO_FOLDER_PATH NUM_FRAMES LABEL_INDEX``.
78+
``VIDEO_FOLDER_PATH START_FRAME END_FRAME LABEL_INDEX``.
7379
``VIDEO_FOLDER_PATH`` is expected to be the path of a video folder
7480
excluding the ``ROOT_DATA`` prefix. For example, ``ROOT_DATA`` might
7581
be ``home\data\datasetxyz\videos\``, inside of which a ``VIDEO_FOLDER_PATH``
@@ -138,16 +144,16 @@ def _sample_indices(self, record):
138144
segment are to be loaded from.
139145
"""
140146

141-
average_duration = (record.num_frames - self.frames_per_segment + 1) // self.num_segments
142-
if average_duration > 0:
143-
offsets = np.multiply(list(range(self.num_segments)), average_duration) + np.random.randint(average_duration, size=self.num_segments)
147+
segment_duration = (record.num_frames - self.frames_per_segment + 1) // self.num_segments
148+
if segment_duration > 0:
149+
offsets = np.multiply(list(range(self.num_segments)), segment_duration) + np.random.randint(segment_duration, size=self.num_segments)
144150

145-
# edge cases for when a video only has a tiny number of frames.
146-
elif record.num_frames > self.num_segments:
147-
offsets = np.sort(np.random.randint(record.num_frames - self.frames_per_segment + 1, size=self.num_segments))
151+
# edge cases for when a video has approximately less than (num_frames*frames_per_segment) frames.
152+
# random sampling in that case, which will lead to repeated frames.
148153
else:
149-
offsets = np.zeros((self.num_segments,))
150-
return offsets + 1
154+
offsets = np.sort(np.random.randint(record.num_frames, size=self.num_segments))
155+
156+
return offsets
151157

152158
def _get_val_indices(self, record):
153159
"""
@@ -163,7 +169,8 @@ def _get_val_indices(self, record):
163169

164170
# edge case for when a video does not have enough frames
165171
else:
166-
offsets = np.zeros((self.num_segments,)) + 1
172+
offsets = np.sort(np.random.randint(record.num_frames, size=self.num_segments))
173+
167174
return offsets
168175

169176
def _get_test_indices(self, record):
@@ -180,7 +187,7 @@ def _get_test_indices(self, record):
180187

181188
offsets = np.array([int(tick / 2.0 + tick * x) for x in range(self.num_segments)])
182189

183-
return offsets + 1
190+
return offsets
184191

185192
def __getitem__(self, index):
186193
"""
@@ -218,15 +225,22 @@ def _get(self, record, indices):
218225
2) An integer denoting the video label.
219226
"""
220227

228+
indices = indices + record.start_frame()
221229
images = list()
230+
image_indices = list()
222231
for seg_ind in indices:
223232
frame_index = int(seg_ind)
224233
for i in range(self.frames_per_segment):
225234
seg_img = self._load_image(record.path, frame_index)
226235
images.extend(seg_img)
236+
image_indices.append(frame_index)
227237
if frame_index < record.num_frames:
228238
frame_index += 1
229239

240+
# sort images by index in case of edge cases where segments overlap each other because the overall
241+
# video is too short for num_segments*frames_per_segment indices.
242+
_, images = (list(sorted_list) for sorted_list in zip(*sorted(zip(image_indices, images))))
243+
230244
if self.transform is not None:
231245
images = self.transform(images)
232246

0 commit comments

Comments
 (0)