Документ взят из кэша поисковой машины. Адрес оригинального документа : http://mccmb.belozersky.msu.ru/2015/proceedings/abstracts/51.pdf
Дата изменения: Mon Jun 15 15:40:06 2015
Дата индексирования: Sat Apr 9 23:24:53 2016
Кодировка:
Method to predict the percentage of cell types in human blood

Anna Igolkina, Maria Samsonova
St.Petersburg Polytechnical University, Polytechnicheskaya 29, igolkinaanna11@gmail.com

Motivat ion and Aim: Blood is the most invest igated heterogeneous tissue. It contains a variet y of cell t ypes, of whic h the major types are Lymphocytes, Monocytes, Granulocytes, Erythrocytes, Megakaryocyte (Lymphocytes and Granulocytes are co mplex cell groups in turn). Gene expressio n data from blood genomics studies is widely used in medical diagnosis. Most of these studies are based on the analysis o f total peripheral blood mononuclear cells (PBMCs). PBMCs are composed of over a dozen cell types, the proportion of which varies in blood samples fro m individual people. This variabilit y significant ly influences geno me-wide gene expressio n data. The heterogeneit y of blood distorts the data, however, it is often discarded due to the lack o f data on the composit ion of the samples. The application of experimental methods to separate or quant ify const ituents fro m each sample is time-consuming and does not solve the problem. Therefore, an attractive alternat ive is to accurately deconvo lve gene expressio n data. Here we develop a method to predict the percentage of cell types in a blood sample fro m who le geno me gene expressio n data. Materials: We check our approach on two independent studies that we arrange by co mbing the available data from databases. The first study contained mouse gene expressio n samples obtained as mixtures of liver, brain and kidney wit h known proportions and pure cell-t ype samples. Mixture samples were bisected into test and training sets pure cell-t ype samples were defined as validat ion set. In the second study we worked wit h 4 human blood datasets. The largest of them contains 2000 patients with known gene expressio n levels and percentages of 5 cell types in blood samples. The samples were divided into training set (300 samples), testing set and set for predict ion. Validat ion in the second study was apply to remain 3 datasets with pure cell t ype


samples and who le blood samples (mixtures o f 5 blood cell-t ypes). Methods and Algorithms: We built and tested various predict ive models based on PCA, linear regression model (with and without prior knowledge of cell type specific signatures obtained from pure cell t ypes [1-2]) and SVM wit h different kernel t ypes and two level linear regressio n approach. The last method showed the best predict ive abilit y. To select a gene subset which provides the best predict ion o f cell proportions we construct heurist ic feature selection algorithm consisted of censoring, filtrat ion by object ive funct ion, and consistent subsampling. Prediction on genes obtained by this feature select ion procedure showed better results than predict ion on specific marker genes for blood cell-t ypes[3]. It is noteworthy that both feature select ion and predict ive methods were constructed for each individual cell t ype independent ly. To estimate the performance of different approaches a the Pearson correlation coefficient between estimated and true cell type proportion in data was calculated. Results: We achieved the best estimat ion of cell t ype proportions using our heuristic feature select ion procedure, and two level linear regression approach. This approach significant ly improved the Pearson correlat ion between true and estimated cell type proportions to approximately 0,8-0.95 in both studies (mouse and human samples). This result is high enough for further studies. Conclusio n: We have developed a method that can accurately predict the percentage of cell t ypes fro m who le geno me gene expressio n data in mouse artificial samples and human blood samples. Our approach can be used to predict the percentage of cell t ypes in other tissues. Availabilit y: The MATLAB script is available on request fro m the author. The text of the abstract: up to four pages, Times New Roman, 12 pt, 1.5 interval. Acknowledgements:


We thank Lude Franke (UMCG, Groningen) for provisio n wit h main human blood dataset of our work, Alexandra Zhernakova(UMCG, Groningen) for comments that improved bio logica l understanding of the problem and Natalia Kadyrova (St.Petersburg Polytechnical Universit y) for the great consultation in SVM theory. 1. Alexander R Abbas et al. (2009) Deconvo lution of blood microarray data ident ifies cellular activat ion patterns insystemic lupus erythematosus, PloS one 4.7. 2. Ting Gong et al. (2011) Optimal deconvo lution of transcript ional pro filing data using quadratic programming with applicat ion to complex clinical blood samples, PloS one 6.11. 3. Renaud Gaujoux (2013) An introduction to gene expressio n deconvolut ion and the CellM ix package