USER GUIDE

Nil satis nisi optimum

Classifium algorithms give comparatively very accurate results and are easy to use. They typically give exceptionally little overfitting, which means that they work very well for new and previously unseen data. In general, overfitting is the greatest impediment to getting good results with machine learning and can take many forms. Even experienced statisticians can sometimes be fooled by overfitting since it can be subtle and occur even for very big datasets and with thorough cross validation that is repeated many times.

The input must be a CSV file with one training example per line. One of the columns is designated as the output, i.e., the response, whereas a selection of the others are inputs, also known as predictors or attributes.

For example, consider the small hypothyroid dataset donated to the UCI Machine Learning Repository by Ross Quinlan. Two of the lines in this dataset are as follows.

hypothyroid,40,F,f,f,f,f,f,f,f,f,f,f,f,y,70,y,0.40,y,3.90,y,0.83,y,5,y,28
negative,27,M,f,f,f,f,f,f,f,f,f,f,f,y,0,y,2.10,y,148,y,1.08,y,137,n,?

On each line in this specific dataset, the correct output is given in the first field and is either hypothyroid or negative, where the latter means that the patient does not have the disease. The other fields are the inputs. Each input is either nominal or ordinal, where the latter is an ordered value, typically a number, and the former is a categorical value such as false or true. A ? in the data file indicates an unknown input value.

After the output, the hypothyroid dataset has an ordinal input followed by 13 nominal inputs. Classifium needs to be informed about this as well as the rest of the inputs. Thus, the hypothyroid dataset is specified as follows for Classifium.

output 
ordinal 
13 nominal
ordinal
nominal
ordinal
nominal
ordinal
nominal
ordinal
nominal
ordinal
nominal
ordinal

Since the data is stored in a file called hypothyroid.data, you should store the above description of the data in a file called hypothyroid.names. Classifium recognizes nom as an abbreviation of nominal, ord as an abbreviation of ordinal and out as an abbreviation of output. It does not care if you use spaces, newlines or tabs in your names file. Thus, the names file could contain only the following line.

out ord 13 nom ord nom ord nom ord nom ord nom ord nom ord

You can use the keyword ignore to specify that a column should not be used for anything.

In contrast to many other machine learning algorithms, Classifium has built-in mechanisms for handling a nominal input with three or more possible values. Therefore, use your nominal data as it is and avoid one-hot encoding.

When you carefully have prepared the data and the names files, you can just upload them to classifium.com and start a run. When the run is finished, Classifium will show you the error rate from n-fold stratified cross validation and the confusion matrix, for example as follows for 10-fold cross validation repeated 30 times.

Error rate = 0.84% using 10-fold cross validation repeated 30 times.

Confusion matrix

Actual:
hypothyroid
Actual:
negative
Predicted:
hypothyroid
139.1 14.8
Predicted:
negative
11.9 2997.2

You will also be able to download a file called hypothyroid.forest which contains the machine learning model that you can run on your own computers. This is useful for two purposes, namely if you wish to verify the accuracy of the model on an independent test set and if you wish to use the model in your own application.

In order to do that, download the source code to a Linux machine and compile it using the compile script. The resulting executable file is called predict and used as follows.

./predict hypothyroid

The only command line argument to predict is the file stem, in this case hypothyroid. If there is a file called hypothyroid.test, predict will compute the error rate and the confusion matrix for this file. If you on the other hand wish to compute the outputs, you just place your data in a file called hypothyroid.cases, in which the output column may contain ? values. No matter what the output column contains, it will be replaced by predicted values. The source code of predict was written to be easy to understand and also to modify as you see fit.

For more advanced machine learning, you can add a so called weight column to your dataset. This column is then specified by the keyword weight in the names file and assigns a weight to each individual training example where a higher weight means that an example is more important. For example, this may be employed to alter the confusion matrix or to handle imbalanced datasets.

Let us reconsider the hypothyroid example where the error rate was 0.84%, which is not so bad. However, among the 151 sick patients, there were 11.9 false negatives on average, which is more critical than the error rate and which would kill any attempt to use this model for clinical screening. To reduce the number of false negatives, we give a higher weight to sick patients. Assume that a weight is added after each output, which means that the following names file can be used.
out weight ord 13 nom ord nom ord nom ord nom ord nom ord nom ord
If we give weight 10, say, to each sick patient and weight 1 to a healthy patient, the two lines in the dataset become as follows.
hypothyroid,10,40,F,f,f,f,f,f,f,f,f,f,f,f,y,70,y,0.40,y,3.90,y,0.83,y,5,y,28
negative,1,27,M,f,f,f,f,f,f,f,f,f,f,f,y,0,y,2.10,y,148,y,1.08,y,137,n,?
As seen below, the number of false negatives is now reduced from 11.9 to 4.5 on average.

Error rate = 2.63% using 10-fold cross validation repeated 30 times.

Confusion matrix

Actual:
hypothyroid
Actual:
negative
Predicted:
hypothyroid
146.5 78.5
Predicted:
negative
4.5 2933.4

To find suitable weights can be a finicky business and require some experimentation.

Please note that Classifium does not have any built-in feature engineering methods such as PCA or autoencoders. If you have previously used XGBoost or Random Forest, you can employ the same feature engineering for Classifium as you did for these algorithms and often get better results.

In general, it may be suitable to use 2-fold cross validation for rough feature engineering to save computation time and then switch to 5-fold or 10-fold for fine tuning and generation of the final model. Assume that the dataset contains l lines. For small datasets, we recommend choosing the number of repetitions, say r, so that r ⋅ l ≥ 50000 for fine tuning whereas smaller values may be used during initial experimentation. The chosen number of repetitions is used for internal parameter optimization, but the final cross validation is actually done with 4 times as many repetitions.

If you have a dataset with images, sound or natural language, you may get better results with CNNs or LSTMs than with Classifium, but for tabular data, Classifium is typically superior to any type of neural net.