Skip to content

Performance experiment on discretize real features into chunks (tokenize)

Xuanyu Zhou edited this page Mar 21, 2017 · 4 revisions

Introduction

This experiment aims to compare the performances of some basic LBJava algorithms on a dataset of real values. We want to find out of discretizing the real features helps the performance or time consumption.

Methodology

The dataset is a 5000 examples set. Each example has 100 real valued features value from -100 to 100 and a tag of either positive or negative. The set was randomly generated by https://github.com/Slash0BZ/Cogcomp-Utils/blob/master/discretization-experiment/data/genData.py. Randomly around 10% of the 5000 examples were marked as noise and contains the incorrect tag. 4500 examples were randomly drafted for training while the rest were used for testing.

In this experiment we used two kinds of discretizing methods.

Normal discrete feature extraction

The first method I used is to assign a string value for each real number of original features. This string value is simply the index of the feature combined with the chunk index. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have a feature of "Index5_3".

Continuous discrete feature extraction

Unlike the previous simple extraction, this version extracts the discrete features of the current real number along with all the discrete features that every number less than the current number belongs to. For example, if the original real values are from 0 to 1, and the chunk size is 0.1, a real number of 0.32 positioned at original index 5 will have discrete features of "Index5_0", "Index5_1", "Index5_2" and "Index5_3".

In this case the discrete representation of each number now has a sense of the position of the original real number comparing to other real numbers. i.e., the meaning of "5 > 4" can now be represented.

Result tables

Sparse Perceptron

Normal extraction

rounds\algorithm real features (time) discrete features with chunk of 5 (time) discrete features with chunk of 20 (time) discrete features with chunk of 50 (time)
10 84 (8.35 s) 66.2 (5.61 s) 74.4 (5.43 s) 78.6 (6.27 s)
50 83.4 (9.98 s) 65.8 (8.24 s) 74.0 (7.84 s) 77.8 (7.67 s)
100 83.6 (12.63 s) 66.0 (8.38 s) 74.2 (9.62 s) 78.8 (8.61 s)

Continuous extraction

rounds\algorithm real features (time) discrete features with chunk of 5 (time) discrete features with chunk of 20 (time) discrete features with chunk of 50 (time)
10 84 (8.35 s) 75.8 (27.02 s) 76.6 (7.8 s) 75.8 (6.19 s)
50 83.4 (9.98 s) 74.4 (40.71 s) 75.2 (14.91 s) 76.0 (8.95 s)
100 83.6 (12.63 s) 73.0 (66 s) 72.8 (20.29 s) 75.6 (12.47 s)

Support Vector Machine

Normal extraction

rounds\algorithm real features (time) discrete features with chunk of 5 (time) discrete features with chunk of 20 (time) discrete features with chunk of 50 (time)
10 84.4 (7.48 s) 64.0 (8.3 s) 75.2 (7.83 s) 78.6 (7.60 s)
50 84.4 (17.87 s) 64.0 (20.06 s) 75.2 (19.88 s) 78.6 (16.51s)
100 84.4 (31.12 s) 64.2 (43.5 s) 75.2 (33.75 s) 78.6 (39.1 s)

Continuous extraction

rounds\algorithm real features (time) discrete features with chunk of 5 (time) discrete features with chunk of 20 (time) discrete features with chunk of 50 (time)
10 84.4 (7.48 s) FAILED 75.4 (27.32 s) 79.0 (12.51 s)
50 84.4 (17.87 s) FAILED FAILED 79.0 (37.74 s)
100 84.4 (31.12 s) FAILED FAILED FAILED

FAILED are due to timeout.

Code and observations

The code and data I used can be found at https://github.com/Slash0BZ/Cogcomp-Utils/tree/master/discretization-experiment.

Through observations, we can see that generally the learner performs better with the chunk size increased in the normal extraction.

My thinking is that for the normal extraction, the features are too sparse if the chunk size is small. There is a large probability that the feature extracted in the evaluation set is never seen in the training set, so the new features mean nothing while they should have a quantity meaning in the original real feature set.

For the continuous extraction, since every real number now addresses all previous chunk discrete features, making thus the overlap of evaluation set features and training set features is occupying a larger portion, making the effect described above less significant.

Another observation is that generally the continuous extraction takes more time, and it makes sense since there are much more features in this extracting methodology.

In conclusion, while the discretized features can produce pretty close results to the real features, (79 comparing to 84.4 at the closest point), it does not make sense to use discretized features in a task like this.

When we are using Sparse Averaged Perceptron with the normal extraction, the learner takes around the similar amount of time, assigning one feature to every original real number but producing a lower accuracy.

Sparse Averaged Perceptron with continuous extraction does not perform nearly as well, while taking a significantly longer time due to more than one features for each original real feature.

For the closest result, SVM with continuous features constantly fail due to time out, since the original 100 features are now having a maximum 4000 distinct features, making it extremely time-consuming and not accurate enough to make it worth.

Another reason not to discretize is that different chunk sizes give different results, and thus it takes time to research on what is the optimal chunk size to use, making using the original real feature set a more practical way.

Improvements

The time relies on the specific conditions of the machine. There is definitely a speed decay as I run the scripts for a long time on the same machine, thus the time measure may not be correct and representative.

Also the noise generation relies on a random basis, thus even though the theoretic limit for any classifier on this dataset is 90%, it is reasonable that no one actually reaches 90% since there are noise both in the training set and in the testing set.