0
$\begingroup$

I have a complex dataset, number of features is much bigger than number of samples. The question is - which features are important for classification into 2 groups.

I think that (after some engeneering of features taking into account possible interactions) ctree is a good instrument for doing this. However I need to present results in a paper.

Do I need to cross-validate ctree in order to be able to present some "significance", e.g. "feature X appears 10 times out of 12 as a root split - may be it is important"? I would go with random forest feature importance (and shuffle the labels to find p-values), but as far as I know RF is parametric and ctree is non-parametric which is preferable...

$\endgroup$
8
  • 1
    $\begingroup$ To my knowledge random forest is non-parametric, why would you assume otherwise? $\endgroup$ Commented Feb 1, 2019 at 10:40
  • $\begingroup$ but are not the decision trees that random forest build based on parametric assumptions? I am sure that regression trees yes, each split is performed according to distribution of residuals. I also know that - theoretically - random forest can be built based on any type of trees, ctree also, but I do not know where it was implemented... $\endgroup$ Commented Feb 1, 2019 at 11:25
  • $\begingroup$ What assumptions? $\endgroup$ Commented Feb 1, 2019 at 13:32
  • 1
    $\begingroup$ Perhaps you should step back and look at the bigger picture: parametric vs non-parametric statistics: projecteuclid.org/download/pdf_1/euclid.ss/1009213726 $\endgroup$ Commented Feb 2, 2019 at 3:29
  • 1
    $\begingroup$ stats.stackexchange.com/questions/147587/… $\endgroup$ Commented Feb 2, 2019 at 3:30

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.