Sub-population-based feature selection (SBPFS) | Computational Cardiovascular Research Group

In health informatics, feature selection refers to the process of analyzing a large collection of data, typically within the electronic medical record, to identify a relatively small set of variables (or features) that are most important for some predictive task. While a number of feature selection methods are available, they are not always suited for many real-world tasks, especially when the goal is to identify features that are particularly important for prognostication within a particular subgroup. Collaborating with Uri Kartoun and Kenney Ng from IBM Research we developed a new type of feature selection method incorporating propensity matching, iteratively applied to subpopulations. Our method holds significant advantages. First and foremost, it has comparable performance to existing methods while identifying a smaller number of prognostic features – a characteristic that can save data collection efforts, time, as it identifies non-informative features that need not be acquired. Moreover, these features are identified without leveraging any domain specific knowledge and/or manual review. Our method is publicly available as an R package.

The R package: https://github.com/IBM/spbfs