Performance experiment on discretize real features into chunks (tokenize)

Introduction

This experiment aims to compare the performances of some basic LBJava algorithms on a dataset of real values. We want to find out of discretizing the real features helps the performance or time consumption.

Methodology

The dataset is a 5000 examples set. Each example has 100 real valued features value from -100 to 100 and a tag of either positive or negative. The set was randomly generated by https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/discretization-experiment/data/genData.py. Randomly around 10% of the 5000 examples were marked as noise and contains the incorrect tag. 4500 examples were randomly drafted for training while the rest were used for testing.

In this experiment we used two kinds of discretizing methods.

Normal discrete feature extraction

The first method I used is to assign a string value for each real number of original features. This string value is simply the index of the feature combined with the chunk index. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have a feature of "Index5_3".

Continuous discrete feature extraction

Unlike the previous simple extraction, this version extracts the discrete features of the current real number along with all the discrete features that every number less than the current number belongs to. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have discrete features of "Index5_0", "Index5_1", "Index5_2" and "Index5_3".

In this case the discrete representation of each number now has a sense of the position of the original real number comparing to other real numbers. i.e., the meaning of "5 > 4" can now be represented.

Result tables

Sparse Perceptron

Normal extraction

rounds\algorithm	real features (time)	discrete features with chunk of 5 (time)	discrete features with chunk of 20 (time)	discrete features with chunk of 50 (time)
10	84 (8.35 s)	66.2 (5.61 s)	74.4 (5.43 s)	78.6 (6.27 s)
50	83.4 (9.98 s)	65.8 (8.24 s)	74.0 (7.84 s)	77.8 (7.67 s)
100	83.6 (12.63 s)	66.0 (8.38 s)	74.2 (9.62 s)	78.8 (8.61 s)

Continuous extraction

Support Vector Machine

rounds\algorithm	real features (time)	discrete features with chunk of 5 (time)	discrete features with chunk of 10 (time)	discrete features with chunk of 20 (time)
10	84.4 (7.48 s)	74.8 (6.2 s)	73.2 (6.2 s)	73.6 (5.6 s)
50	84.4 (17.87 s)	74.8 (13.41 s)	73.2 (10.5 s)	73.6 (8.9 s)
100	84.4 (31.12 s)	74.8 (20.46 s)	73.2 (15.4 s)	73.8 (13.31 s)

Code and observations

The code and data I used can be found at https://github.com/Slash0BZ/Cogcomp-Utils/tree/master/discretization-experiment.

Improvements

The time relies on the specific conditions of the machine. There is definitely a speed decay as I run the scripts for a long time on the same machine, thus the time measure may not be correct and representative.

Also the noise generation relies on a random basis, thus even though the theoretic limit for any classifier on this dataset is 90%, it is reasonable that no one actually reaches 90% since there are noise both in the training set and in the testing set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance experiment on discretize real features into chunks (tokenize)

Introduction

Methodology

Normal discrete feature extraction

Continuous discrete feature extraction

Result tables

Sparse Perceptron

Normal extraction

Continuous extraction

Support Vector Machine

Code and observations

Improvements

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally