-
Notifications
You must be signed in to change notification settings - Fork 17
Performance experiment on discretize real features into chunks (tokenize)
This experiment aims to compare the performances of some basic LBJava algorithms on a dataset of real values. We want to find out of discretizing the real features helps the performance or time consumption.
The dataset is a 5000 examples set. Each example has 100 real valued features value from -100 to 100 and a tag of either positive or negative. The set was randomly generated by https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/discretization-experiment/data/genData.py. Randomly around 10% of the 5000 examples were marked as noise and contains the incorrect tag. 4500 examples were randomly drafted for training while the rest were used for testing.
In this experiment we used two kinds of discretizing methods.
The first method I used is to assign a string value for each real number of original features. This string value is simply the index of the feature combined with the chunk index. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have a feature of "Index5_3".
Unlike the previous simple extraction, this version extracts the discrete features of the current real number along with all the discrete features that every number less than the current number belongs to. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have discrete features of "Index5_0", "Index5_1", "Index5_2" and "Index5_3".
In this case the discrete representation of each number now has a sense of the position of the original real number comparing to other real numbers. i.e., the meaning of "5 > 4" can now be represented.
| rounds\algorithm | real features (time) | discrete features with chunk of 5 (time) | discrete features with chunk of 20 (time) | discrete features with chunk of 50 (time) |
|---|---|---|---|---|
| 10 | 84 (8.35 s) | 66.2 (5.61 s) | 74.4 (5.43 s) | 78.6 (6.27 s) |
| 50 | 83.4 (9.98 s) | 65.8 (8.24 s) | 74.0 (7.84 s) | 77.8 (7.67 s) |
| 100 | 83.6 (12.63 s) | 66.0 (8.38 s) | 74.2 (9.62 s) | 78.8 (8.61 s) |
| rounds\algorithm | real features (time) | discrete features with chunk of 5 (time) | discrete features with chunk of 10 (time) | discrete features with chunk of 20 (time) |
|---|---|---|---|---|
| 10 | 84.4 (7.48 s) | 74.8 (6.2 s) | 73.2 (6.2 s) | 73.6 (5.6 s) |
| 50 | 84.4 (17.87 s) | 74.8 (13.41 s) | 73.2 (10.5 s) | 73.6 (8.9 s) |
| 100 | 84.4 (31.12 s) | 74.8 (20.46 s) | 73.2 (15.4 s) | 73.8 (13.31 s) |
The code and data I used can be found at https://github.com/Slash0BZ/Cogcomp-Utils/tree/master/discretization-experiment.
The time relies on the specific conditions of the machine. There is definitely a speed decay as I run the scripts for a long time on the same machine, thus the time measure may not be correct and representative.
Also the noise generation relies on a random basis, thus even though the theoretic limit for any classifier on this dataset is 90%, it is reasonable that no one actually reaches 90% since there are noise both in the training set and in the testing set.