Performance experiment on discretize real features into chunks (tokenize)

Introduction

This experiment aims to compare the performances of some basic LBJava algorithms on a dataset of real values. We want to find out of discretizing the real features helps the performance or time consumption.

Methodology

The dataset is a 5000 examples set. Each example has 100 real valued features value from -100 to 100 and a tag of either positive or negative. The set was randomly generated by https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/discretization-experiment/data/genData.py. Randomly around 10% of the 5000 examples were marked as noise and contains the incorrect tag. 4500 examples were randomly drafted for training while the rest were used for testing.

Result tables

Sparse Perceptron

rounds\algorithm	real features (time)	discrete features with chunk of 5 (time)	discrete features with chunk of 10 (time)	discrete features with chunk of 20 (time)
10	84 (8.35 s)	73.8 (5.21 s)	73.6 (5.38 s)	73.4 (4.84 s)
50	83.4 (9.98 s)	73.8 (6.96 s)	73.8 (6.9 s)	73.6 (7.76 s)
100	83.6 (12.63 s)	73.8 (8.38 s)	74.0 (8.38 s)	73.8 (8.16 s)

Support Vector Machine

rounds\algorithm	real features (time)	discrete features with chunk of 5 (time)	discrete features with chunk of 10 (time)	discrete features with chunk of 20 (time)
10	84.4 (7.48 s)	74.8 (6.2 s)	73.2 (6.2 s)	73.6 (5.6 s)
50	84.4 (17.87 s)	74.8 (13.41 s)	73.2 (10.5 s)	73.6 (8.9 s)
100	84.4 (31.12 s)	74.8 (20.46 s)	73.2 (15.4 s)	73.8 (13.31 s)

Code and observations

The code and data I used can be found at https://github.com/Slash0BZ/Cogcomp-Utils/tree/master/discretization-experiment

First, the discretized features performances very closely given a variation of chunk size, and they are all significantly lower than the real features. But a good thing is that they all takes less time. I think that is due to that the number of features are lower.

Vertically, SVMs generally takes more time and more memory to train.

Improvements

The time relies on the specific conditions of the machine. There is definitely a speed decay as I run the scripts for a long time on the same machine, thus the time measure may not be correct and representative.

Also the noise generation relies on a random basis, thus even though the theoretic limit for any classifier on this dataset is 90%, it is reasonable that no one actually reaches 90% since there are noise both in the training set and in the testing set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance experiment on discretize real features into chunks (tokenize)

Introduction

Methodology

Result tables

Sparse Perceptron

Support Vector Machine

Code and observations

Improvements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally