Given 3 training sets : $(X_1,y_1),(X_2,y_2)$ and $(X_3,y_3)$.
These three datasets are separated as it is being manually tagged in the preprocessing.
Based on the datasets, three classifiers can be trained:
P(class of x = 1) = $f_1(x)$
P(class of x = 1) = $f_2(x)$
P(class of x = 1) = $f_3(x)$
When a new data coming in the manually tagged label is missing so that we dont know which classifier we should use, but we know the distribution of the manually tagged labels:
P(x belongs to tag 1), P(x belongs to tag 2) and P(x belongs to tag 3)
Does it make sense to say that the optimal model is
P(class of x = 1) = P(x belongs to tag 1)$* f_1(x)+$P(x belongs to tag 2)$* f_2(x)+$P(x belongs to tag 3)$* f_3(x)$
If yes, any references are related to this approach?
If no, how should I embed this information into the model?
Note: Due to technical constrains, the size of the 3 training sets doest not match P(x belongs to tag 1), P(x belongs to tag 2) and P(x belongs to tag 3).
For instance, the three datasets may have the same size but
P(x belongs to tag 1)=0.5
P(x belongs to tag 2)=0.3
P(x belongs to tag 3)=0.2
Update: Maybe I can formula the question in a more mathematical precise way:
Given $k$ datasets $D_1,...,D_k$, each dataset consists of a a collection of features $X_i$ and the corresponding labels $y_i$.
The seperation of datasets is based on a preprocessing of the data.
Suppose that $k$ classifiers are trained based on each datasets, the classification probability of an input $x$ belongs to class $c$ in the classifier $i$ is written as
$P(x$ belongs to $ c | D_i)$
Due to technical limitation, for a new input x, we do not know whether we should use which classifier but we know the chance:
$P(D_i)$
Does it make sense to say that overall probability can be written as:
$P(x$ belongs to $ c | D) = \sum_i P(x$ belongs to $ c | D_i)P(D_i)$