### INTRODUCTION

### MATERIALS AND METHODS

### 1) Data source

#### (1) Tissue sample preparation

#### (2) Microarray

### 2) Data normalization

### 3) Data transformation

*AN*and

*AT*are the normal and tumor groups in data set A.

*AN'*and

*AT'*are the transformed expression ratios of the normal and tumor groups in data set A.

*n*,

_{AN}*n*are the number of experiments of the normal and tumor groups in data set A.

_{AT}*n*,

_{BN}*n*are the number of experiments of the normal and tumor groups in data set B.

_{BT}### 4) Evaluation of data set integration

*mixture score*.

#### (1) Boxplot

#### (2) Dendrogram

#### (3) Density plot for the gene expression distribution

#### (4) Plots for the two principal components (PC)

#### (5) Correlation coefficient

*x*and

*y*are two experiments.

*χ*is the

_{i}*i*gene in experiment

^{th}*x*, and

*n*is the number of genes in experiment x.

#### (6) Mixture score

*Mixture score*was defined to evaluate the efficiency of the proposed integration method. The principle of this metric is to measure the ratio of the number of experiments in data set A that belong to the

*k*-nearest neighbours (

*k*NNs) of each experiment of data set B. The metric was calculated as follows when

*k*is the number of nearest neighbors (NNs).

*Mixture score*= #{

*x*/

*x*∈

*k*NNs (data set B) ∩ (data set A)}/

*k*

*x*is any experiment belonging to kNNs (data set B) and data set A.

### RESULTS

*mixture scores*increased by as much as 24.2% as the number of NNs were increased, suggesting that the two different data sets were well intermixed. In addition, the values were similar whether the euclidean distance or the Pearson correlation coefficient was used as a similarity measure.

### DISCUSSION

*mixture score*), can be interpreted that the experimental bias was removed as the value was large. The two data sets were mixed well by the proposed method, but the

*mixture score*was less than 25%, which is lower than the ideal perfect mixture value, suggesting that the two data sets were not yet perfectly intermingled. This may have been caused by the characteristics of the experiments included in the two data sets. The proposed method was more effective in the tumor group than in the normal group (Table 1), which is a more heterogeneous population biologically. Therefore, the current method might be more effective in those experiments with larger variations among the experiments, as in the tumor group. In addition, on comparison of the average correlation coefficients, the tumor groups had lower correlation coefficients than did the normal groups, suggesting that the tumor groups were more heterogeneous and this may have been due to various tumor stages within the group. Consequentially, the tumor groups were intermixed better by the proposed integration method than the normal groups.