-
Notifications
You must be signed in to change notification settings - Fork 17
Performance experiment on discretize real features into chunks (tokenize)
This experiment aims to compare the performances of some basic LBJava algorithms on a dataset of real values. We want to find out of discretizing the real features helps the performance or time consumption.
The dataset is a 5000 examples set. Each example has 100 real valued features value from -100 to 100 and a tag of either positive or negative. The set was randomly generated by https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/discretization-experiment/data/genData.py. Randomly around 10% of the 5000 examples were marked as noise and contains the incorrect tag. 4500 examples were randomly drafted for training while the rest were used for testing.
| rounds\algorithm | real features (time) | discrete features with chunk of 5 (time) | discrete features with chunk of 10 (time) | discrete features with chunk of 20 (time) |
|---|---|---|---|---|
| 10 | 84 (8.35 s) | 73.8 (5.21 s) | 73.6 (5.38 s) | 73.4 (4.84 s) |
| 50 | 83.4 (9.98 s) | 73.8 (6.96 s) | 73.8 (6.9 s) | 73.6 (7.76 s) |
| 100 | 83.6 (12.63 s) | 73.8 (8.38 s) | 74.0 (8.38 s) | 73.8 (8.16 s) |
| rounds\algorithm | real features (time) | discrete features with chunk of 5 (time) | discrete features with chunk of 10 (time) | discrete features with chunk of 20 (time) |
|---|---|---|---|---|
| 10 | 84.4 (7.48 s) | 74.8 (6.2 s) | 73.2 (6.2 s) | 73.6 (5.6 s) |
| 50 | 84.4 (17.87 s) | 74.8 (13.41 s) | 73.2 (10.5 s) | 73.6 (8.9 s) |
| 100 | 84.4 (31.12 s) | 74.8 (20.46 s) | 73.2 (15.4 s) | 73.8 (13.31 s) |
The code and data I used can be found at https://github.com/Slash0BZ/Cogcomp-Utils/tree/master/discretization-experiment
First, the discretized features performances very closely given a variation of chunk size, and they are all significantly lower than the real features. But a good thing is that they all takes less time. I think that is due to that the number of features are lower.
Vertically, SVMs generally takes more time and more memory to train.
The time relies on the specific conditions of the machine. There is definitely a speed decay as I run the scripts for a long time on the same machine, thus the time measure may not be correct and representative.
Also the noise generation relies on a random basis, thus even though the theoretic limit for any classifier on this dataset is 90%, it is reasonable that no one actually reaches 90% since there are noise both in the training set and in the testing set.