Feature Subset Pipelines

In PHOTONAI, you can create individual data streams very easily. If, for example, you like to apply different preprocessing steps to distinct subsets of your features, you can create multiple branches within your ML pipeline that will hold any kind of preprocessing. Similarly, you could train different classifiers on different feature subsets.

To add a branch to your pipeline, you can simply create a PHOTONAI Branch and then add any number of elements to it. If you only add transformer elements to your branch, the transformed data will be passed to the next element after your branch (or stacked in case of a PHOTONAI Stack). If, however, you add a final estimator to your branch, the prediction of this estimator will be passed to the next element. You could now add your created branch to a Hyperpipe, however, creating branches only really makes sense when having multiple ones and adding those to either a Stack or Switch. Otherwise, why create a branch in the first place?

Importantly, a branch will always receive all of your features if you don't add a PHOTONAI DataFilter. A DataFilter can be added as first element of a branch to make sure only a specific subset of the features will be passed to the remaining elements of the branch. It only takes a parameter called indices that specifies the data columns that are ultimately passed to the next element.

In this example, we create three branches to process three feature subsets of the breast cancer dataset separately. For all three branches, we add an SVC to predict the classification label. This way, PHOTONAI can find the optimal SVC hyperparameter for the three data modalities. All predictions are then stacked and passed to a final Switch that will decide between a Random Forest or another SVC.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold

from photonai.base import Hyperpipe, PipelineElement, Stack, Branch, Switch, DataFilter
from photonai.optimization import FloatRange, IntegerRange

# LOAD DATA FROM SKLEARN
X, y = load_breast_cancer(return_X_y=True)

my_pipe = Hyperpipe('data_integration',
                    optimizer='random_grid_search',
                    optimizer_params={'n_configurations': 20},
                    metrics=['accuracy', 'precision', 'recall'],
                    best_config_metric='f1_score',
                    outer_cv=KFold(n_splits=3),
                    inner_cv=KFold(n_splits=3),
                    verbosity=0,
                    project_folder='./tmp/')

my_pipe += PipelineElement('SimpleImputer')
my_pipe += PipelineElement('StandardScaler', {}, with_mean=True)

# Use only "mean" features: [mean_radius, mean_texture, mean_perimeter, mean_area, mean_smoothness, mean_compactness,
# mean_concavity, mean_concave_points, mean_symmetry, mean_fractal_dimension
mean_branch = Branch('MeanFeature')
mean_branch += DataFilter(indices=range(0, 10))
mean_branch += PipelineElement('SVC', {'C': FloatRange(0.1, 200)}, kernel='linear')

# Use only "error" features
error_branch = Branch('ErrorFeature')
error_branch += DataFilter(indices=range(10, 20))
error_branch += PipelineElement('SVC', {'C': FloatRange(0.1, 200)}, kernel='linear')

# use only "worst" features: [worst_radius, worst_texture, ..., worst_fractal_dimension]
worst_branch = Branch('WorstFeature')
worst_branch += DataFilter(indices=range(20, 30))
worst_branch += PipelineElement('SVC', {'C': FloatRange(0.1, 200)}, kernel='linear')

my_pipe += Stack('SourceStack', [mean_branch, error_branch, worst_branch])

my_pipe += Switch('EstimatorSwitch', [PipelineElement('RandomForestClassifier', {'n_estimators': IntegerRange(2, 5)}),
                                      PipelineElement('SVC')])

my_pipe.fit(X, y)