You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+31-17Lines changed: 31 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ For a demo, visit `demo.py`.
28
28
### QuickDemo (demo.py)
29
29
```python
30
30
root = os.path.join(os.getcwd(), 'demo_dataset') # Folder in which all videos lie in a specific structure
31
-
annotation_file = os.path.join(root, 'annotations.txt') # A row for each video sample as: (VIDEO_PATH NUM_FRAMES CLASS_INDEX)
31
+
annotation_file = os.path.join(root, 'annotations.txt') # A row for each video sample as: (VIDEO_PATH START_FRAME END_FRAME CLASS_INDEX)
32
32
33
33
""" DEMO 1 WITHOUT IMAGE TRANSFORMS """
34
34
dataset = VideoFrameDataset(
@@ -73,12 +73,13 @@ python >= 3.6
73
73
### 2. Custom Dataset
74
74
To use any dataset, two conditions must be met.
75
75
1) The video data must be supplied as RGB frames, each frame saved as an image file. Each video must have its own folder, in which the frames of
76
-
that video lie. The frames of a video inside its folder must be named uniformly as `img_00001.jpg` ... `img_00120.jpg`, if there are 120 frames. The filename template
77
-
for frames is then "img_{:05d}.jpg" (python string formatting, specifying 5 digits after the underscore), and must be supplied to the
78
-
constructor of VideoFrameDataset as a parameter. Each video folder lies inside a `root` folder of this dataset.
76
+
that video lie. The frames of a video inside its folder must be named uniformly with consecutive indices such as `img_00001.jpg` ... `img_00120.jpg`, if there are 120 frames.
77
+
Indices can start at zero or any other number and the exact file name template can be chosen freely. The filename template
78
+
for frames in this example is "img_{:05d}.jpg" (python string formatting, specifying 5 digits after the underscore), and must be supplied to the
79
+
constructor of VideoFrameDataset as a parameter. Each video folder must lie inside some `root` folder.
79
80
2) To enumerate all video samples in the dataset and their required metadata, a `.txt` annotation file must be manually created that contains a row for each
80
-
video sample in the dataset. The training, validation, and testing datasets must have separate annotation files. Each row must be a space-separated list that contains
81
-
`VIDEO_PATH NUM_FRAMES CLASS_INDEX`. The `VIDEO_PATH` of a video sample should be provided without the `root` prefix of this dataset.
81
+
video sample or video clip (in case of clips for action recognition for example) in the dataset. The training, validation, and testing datasets must have separate annotation files. Each row must be a space-separated list that contains
82
+
`VIDEO_PATH START_FRAME END_FRAME CLASS_INDEX`. The `VIDEO_PATH` of a video sample should be provided without the `root` prefix of this dataset.
82
83
83
84
This example project demonstrates this using a dummy dataset inside of `demo_dataset/`, which is the `root` dataset folder of this example. The folder
84
85
structure looks as follows:
@@ -108,19 +109,30 @@ demo_dataset
108
109
109
110
110
111
```
111
-
The accompanying annotation `.txt` file contains the following rows
112
+
The accompanying annotation `.txt` file contains the following rows (PATH, START_FRAME, END_FRAME, LABEL_ID)
112
113
```
113
-
jumping/0001 17 0
114
-
jumping/0002 18 0
115
-
running/0001 15 1
116
-
running/0002 15 1
114
+
jumping/0001 1 17 0
115
+
jumping/0002 1 18 0
116
+
running/0001 1 15 1
117
+
running/0002 1 15 1
117
118
```
119
+
Another annotations file that uses multiple clips from each video could be
120
+
```
121
+
jumping/0001 1 8 0
122
+
jumping/0001 5 17 0
123
+
jumping/0002 1 18 0
124
+
running/0001 10 15 1
125
+
running/0001 5 10 1
126
+
running/0002 1 15 1
127
+
```
128
+
(END_FRAME is inclusive)
129
+
118
130
Instantiating a VideoFrameDataset with the `root_path` parameter pointing to `demo_dataset`, the `annotationsfile_path` parameter pointing to the annotation file, and
119
131
the `imagefile_template` parameter as "img_{:05d}.jpg", is all that it takes to start using the VideoFrameDataset class.
120
132
121
133
### 3. Video Frame Sampling Method
122
134
When loading a video, only a number of its frames are loaded. They are chosen in the following way:
123
-
1. The frame indices [1,N] are divided into NUM_SEGMENTS even segments. From each segment, a random start-index is sampled from which FRAMES_PER_SEGMENT consecutive indices are loaded.
135
+
1. The frame index range [START_FRAME, END_FRAME] is divided into NUM_SEGMENTS even segments. From each segment, a random start-index is sampled from which FRAMES_PER_SEGMENT consecutive indices are loaded.
124
136
This results in NUM_SEGMENTS*FRAMES_PER_SEGMENT chosen indices, whose frames are loaded as PIL images and put into a list and returned when calling
@@ -129,27 +141,29 @@ This results in NUM_SEGMENTS*FRAMES_PER_SEGMENT chosen indices, whose frames are
129
141
If you do not want to use sparse temporal sampling and instead want to sample a single N-frame continuous
130
142
clip from a video, this is possible. Set `NUM_SEGMENTS=1` and `FRAMES_PER_SEGMENT=N`. Because VideoFrameDataset
131
143
will chose a random start index per segment and take `NUM_SEGMENTS` continuous frames from each sampled start
132
-
index, this will result in a single N-frame continuous clip per video. An example of this is in `demo.py`.
144
+
index, this will result in a single N-frame continuous clip per video that starts at a random index.
145
+
An example of this is in `demo.py`.
133
146
134
147
### 5. Using VideoFrameDataset for training
135
148
As demonstrated in `demo.py`, we can use PyTorch's `torch.utils.data.DataLoader` class with VideoFrameDataset to take care of shuffling, batching, and more.
136
149
To turn the lists of PIL images returned by VideoFrameDataset into tensors, the transform `video_dataset.ImglistToTensor()` can be supplied
137
150
as the `transform` parameter to VideoFrameDataset. This turns a list of N PIL images into a batch of images/frames of shape `N x CHANNELS x HEIGHT x WIDTH`.
138
-
We can further chain preprocessing and augmentation functions that act on batches of images onto the end of `ImglistToTensor()`.
151
+
We can further chain preprocessing and augmentation functions that act on batches of images onto the end of `ImglistToTensor()`, as seen in `demo.py`
139
152
140
153
As of `torchvision 0.8.0`, all torchvision transforms can now also operate on batches of images, and they apply deterministic or random transformations
141
-
on the batch identically on all images of the batch. Therefore, any torchvision transform can be used here to apply video-uniform preprocessing and augmentation.
154
+
on the batch identically on all images of the batch. Because a single video-tensor (FRAMES x CHANNELS x HEIGHT x WIDTH)
155
+
has the same shape as an image batch tensor (BATCH x CHANNELS x HEIGHT x WIDTH), any torchvision transform can be used here to apply video-uniform preprocessing and augmentation.
142
156
143
157
REMEMBER:
144
-
Pytorch transforms are applied to individual dataset samples (in this case a video frame PIL list, or a frame tensor after `ImglistToTensor()`) before
158
+
Pytorch transforms are applied to individual dataset samples (in this case a list of PIL images of a video, or a video-frame tensor after `ImglistToTensor()`) before
145
159
batching. So, any transforms used here must expect its input to be a frame tensor of shape `FRAMES x CHANNELS x HEIGHT x WIDTH` or a list of PIL images if `ImglistToTensor()` is not used.
146
160
### 6. Conclusion
147
161
A proper code-based explanation on how to use VideoFrameDataset for training is provided in `demo.py`
148
162
149
163
### 7. Upcoming Features
150
164
-[x] Add demo for sampling a single continous-frame clip from videos.
151
165
-[ ] Add support for arbitrary labels that are more than just a single integer.
152
-
-[] Add support for specifying START_FRAME and END_FRAME for a video instead of NUM_FRAMES.
166
+
-[x] Add support for specifying START_FRAME and END_FRAME for a video instead of NUM_FRAMES.
153
167
154
168
### 8. Acknowledgements
155
169
We thank the authors of TSN for their [codebase](https://github.com/yjxiong/tsn-pytorch), from which we took VideoFrameDataset and adapted it
0 commit comments