Abstrakt
Ma'lumot: Proteinlar tasnifi, struktura ma'lumotlar bazasidagi barcha ma'lum oqsillarga nisbatan oqsil molekulasining funktsiyasini tushunishda markaziy rol o'ynaydi. Yangi oqsil tuzilmalari sonining tez o'sishi bilan oqsillarni tasniflashning avtomatlashtirilgan va aniq usullariga bo'lgan ehtiyoj tobora muhim ahamiyat kasb etmoqda.
Natijalar: Ushbu maqolada biz oqsil strukturasini tasniflash va yangi protein tuzilmalarini identifikatsiya qilish uchun yagona asosni taqdim etamiz. Ramka oqsil tuzilmalarini taqqoslash, tasniflash va klasterlash uchun komponentlar to'plamidan iborat. Ushbu komponentlar bizga oqsillarni ma'lum burmalarga aniq tasniflash, yangi oqsil burmalarini aniqlash va yangi burmalarni to'plash usulini ta'minlash imkonini beradi. SCOP 1.69 bilan baholashda bizning usulimiz yangi domenlarning 86,0%, 87,7% va 90,5% ni oila, superfamily va qatlam darajalarida to'g'ri tasniflaydi. Bundan tashqari, yangi domen oilalariga mansub protein domenlari uchun bizning usulimiz SCOP1.69 dagi yangi oilalarga chambarchas mos keladigan klasterlarni yaratishga qodir. Natijada, bizning usulimiz yangi qatlamlarni o'z ichiga olgan yangi tasnif guruhlarini taklif qilish uchun ham qo'llaniladi.
Xulosa: Biz domenlarni avtomatik tasniflash va klasterlash uchun proCC deb nomlangan usulni ishlab chiqdik. Usul yangi domenlarni tasniflashda va yangi domen oilalarini taklif qilishda samarali bo‘lib, bundan tashqari samarali bo‘ladi.
/www.eecs.umich.edu/periscope/procc
Fon
Protein domenlarini uchinchi tuzilishga qarab tasniflash oqsil funktsiyasi va evolyutsion munosabatlarni tushunish uchun ishlatilishi mumkin bo'lgan qimmatli manbalarni ta'minlaydi [1]. Natijada, bir nechta tasniflash ma'lumotlar bazalari [1-3] ishlab chiqilgan bo'lib, ulardan SCOP [1] va CATH [2] eng ko'p qo'llaniladigan ma'lumotlar bazalari hisoblanadi. Ikkala ma'lumotlar bazasi ham ierarxik tarzda tashkil etilgan va oqsil domenlaridan abasikunit tasnifi sifatida foydalanadi.SCOPandCATH biologlar uchun qimmatli manba bo'lsa-da, bu ma'lumotlar bazalari vaqti-vaqti bilan yangilanadi, masalan, o'tmishda.
uch yil ichida SCOP taxminan har olti oyda bir yangilanadi va CATH har yili yangilanadi. Ushbu ma'lumotlar bazalarini yangilash turli darajadagi yarim avtomatik usullarni va qo'lda talqin qilishni talab qiladi. Natijada, yangi joylashtirilgan oqsil tuzilmasi faqat ushbu ma'lumotlar bazalarining keyingi chiqarish siklida tasniflash ierarxiyasida namoyon bo'ladi. Shu bilan birga, yangi aniqlangan oqsil tuzilmalari soni tez sur'atlar bilan o'sib bormoqda. Misol uchun, o'tgan yil davomida 5000 dan ortiq tuzilmalar PDBga yotqizildi. Shuningdek, bugungi kunda PDBdagi tuzilmalar soni tuzilmalar sonini ikki baravar oshiradi
2000 yil[4]. Yangi oqsil tuzilmalari sonining ortib borishi avtomatlashtirilgan sinflarni aniqlash vositasiga bo'lgan ehtiyojni yettita muhimroq qiladi.
SCOP va CATHda qo‘llaniladigan qo‘llanilgan yarim avtomatlashtirilgan usullar yuqori sifatli tasniflash ierarxiyalarini ishlab chiqarishini va avtomatlashtirilgan usullarni tajribali biologning tasniflash vazifasiga olib kelishi mumkin bo‘lgan nozik mulohazalar bilan birlasha olishi dargumon. Protein tuzilmalarining soni, avtomatlashtirilgan usullar qo'lda sozlangan klassifikatsiya ierarxiyasini ishlab chiqarish uchun dastlabki ishlov berish bosqichi sifatida muhim rol o'ynashi mumkin (va hozirda).
Ushbu ehtiyojni e'tirof etgan holda, yaqinda bir nechta avtomatik domen tasniflash usullari [5-9] ishlab chiqildi. Superfamily [5] faqat ketma-ketlik taqqoslash mezonlariga asoslanadi. Samarali, lekin ko'pincha tizimli ravishda o'xshash oqsillarning uzoq homologlarini noto'g'ri tasniflashda muvaffaqiyatsizlikka uchraydi. F2CS[6]va SGM[7]usullari faqat tuzilmaviy taqqoslashga asoslangan. Ular hisoblash jihatidan juda samarali va qatlam darajasida tasniflash uchun to'g'ri, lekin superfamily va oilaviy darajalarda bo'lishi shart emas. So'nggi usullar [8,9] tasniflash uchun ketma-ketlik va tuzilish ma'lumotlarini birlashtiradi va bir nechta ketma-ketlik va tuzilma taqqoslashlari konsensusiga asoslangan tasniflash qarorini qabul qiladi. Umuman olganda, bu usullar oldingi usullardan aniqroqdir, lekin ular hisoblash jihatidan qimmatroqdir.
Avtomatik oqsil tasnifidagi muhim masala - bu vositaning yangi sinflarni aniqlash qobiliyati (ya'ni, yangi burmalarni aniqlash). Bunday yangi sinflarni aniqlash muhim ahamiyatga ega, chunki yangi domen tuzilmalari doimiy ravishda yangi aniqlangan protein tuzilmalarida topiladi va yangi sinflar haqidagi ma'lumotlardan yangi tuzilmalar va kanallarni yaxshiroq tushunish uchun samarali foydalanish mumkin, shuningdek, yangi tuzilmalarni tasniflash ierarxiyasining keyingi versiyasida odamlarga o'tkazishda yordam berish uchun ishlatiladi. Mavjud oqsillarni tasniflash usullarining ko'pchiligi yangi domenlarni mavjud sinflarga tasniflashda juda yaxshi bo'lsa-da, yangi sinflarni aniqlashda bu usullarning samaradorligi juda kam. Mavjud tasniflash vositalaridan SGM[7]va SCOPmap
[8] yangi sinflarni aniqlash uchun ishlatilishi mumkin va biz ushbu maqolada ko'rsatganimizdek, bizning usulimiz yangi sinflarni aniqlash uchun ikkita to'plamga nisbatan ancha samaraliroq.
Ushbu maqolada biz proCCni taqdim etamiz - avtomatik, aniq va samarali tasniflash tizimi, u uchta komponentdan iborat.
turlar. Keyin, ushbu natijalarga asoslanib, tasnif komponenti so'rovni mavjud sinf yorlig'iga tayinlaydi yoki so'rovni tasniflanmagan deb belgilaydi, bu so'rovlar domenining potentsial yangi qavat ekanligini ko'rsatadi. Nihoyat, klasterlash komponenti tasniflanmagan deb belgilangan barcha domenlarni oladi va yangi qatlamlarni aniqlash uchun klasterlash usulini ishga tushiradi.
Birgalikda, bu komponentlar yagona va avtomatlashtirilgan protein domenini tasniflash vositasini taqdim etadi. Usullarimiz imkoniyatlarini namoyish qilish uchun biz SCOP ning oldingi versiyasi (versiya 1.67) boʻyicha oldingi bilimlar asosida SCOP 1.69 uchun yangi domenlar uchun tasnifni bashorat qilish uslubimizni sinab koʻrdik. Bizning eksperimental natijalarimiz shuni ko'rsatadiki, bizning uslubimiz aniqligi oila, superfamila va qatlam darajasida 86,0%, 87,7% va 90,5% ni tashkil qiladi. Shuningdek, bizning usulimizni SGM va SCOPmap bilan solishtiring va bizning usullarimiz SGM va SCOPmap bilan solishtirishga qaraganda taxminan 15–19% aniqroq ekanligini ko‘rsating. Biroq, SCOPmap faqat superoila va qatlam darajalarida tasniflanadi, bizning asboblar esa oila darajasida tasniflashni ta’minlaydi. Yangi katlamlarni aniqlash uchun proCC tomonidan qilingan bashoratlar SCOPmapga qaraganda 20% yaxshiroq. Eksperimental baholashimiz shuni ko'rsatadiki, bizning usulimiz SCOP1.69 dagi yangi oilalarga chambarchas mos keladigan klasterlarni ishlab chiqaradi.
Natijalar
Eksperimental sozlash va ma'lumotlar to'plami
Ushbu bo'limda biz tasniflash usullarining samaradorligini o'lchaydigan natijalarni taqdim etamiz. Tajribaviy baholash uchun biz avvalgi tadqiqotlarda qo'llanilgan eksperimental strategiyalardan foydalandik [8,9]: ya'ni SCOPning eski versiyasidagi domenlar ma'lumotlarga asoslangan ma'lumotlar to'plamidan ma'lum sinf yorliqlari bilan foydalaniladi va yangi domenlar to'plamidagi SC domenlaridan foydalanadi. .Tasniflashning aniqligi bashorat qilingan yorliqlarni SCOPning yangi versiyasidagi (maʼlum) teglar bilan solishtirish orqali oʻlchanadi. Bizning tajribalarimizda SCOP 1.67 va SCOP 1.69 mos ravishda maʼlumotlar bazasi va soʻrov sifatida foydalaniladi.
SCOP 1.67 va 1.69 65122 va 70859 domenni oʻz ichiga oladi, ular mos ravishda 2630 va 2845 oilaga birlashtirilgan. Biroq, bizning baholashimizda 3SSEdan kam bo'lgan nazariy domenlar va domenlar chiqarib tashlandi. Ushbu istisnolardan so'ng biz SCOP 1.67 va 1.69 da mos ravishda 58456 va 63745 domenlarga ega bo'lamiz. Bizning maʼlumotlar bazamiz SCOP 1.67 da 58456 ta domenlar toʻplamidir va bizning soʻrovlarimiz SCOP1.69da 5289newlyaddeddomainsindir. Biz ushbu SCOP domenlari uchun PDB uslubi koordinata maʼlumotlari uchun ASTRAL Compendium [10] dan foydalandik. Bundan tashqari, biz STRIDE dasturidan[11]har bir domen uchun ikkinchi tuzilma tayinlashlarini yaratish uchun foydalandik.
Bizning amalga oshirishimiz C++ da yozilgan va maksimal ikki qismli grafik moslashuvi uchun LEDA3.2R paketidan foydalanadi,
va SVMlight[12] paketi. SVM modeli SCOP1.65 va SCOP1.67 (Usullar boʻlimida tuzilma sinfini aniqlash boʻlimiga qarang) yordamida oʻqitilgan. Biz radial asosli funktsiyadan foydalandik - uning og'irligi bilan ijobiy misollar sonining salbiy misollar soniga nisbatan o'rnatilgan. Barcha tajribalar 4 Gb tezkor xotiraga ega va Linux 2.6.9 yadrosi bilan ishlaydigan 2,2 GHz Opteron mashinasida o'tkazildi. Ushbu bo'lim davomida biz tasniflash sxemasidagi sinfga murojaat qilish uchun sinf atamasidan foydalanamiz.
Eksperimental baholash
Aniqlik va hisoblash narxi
Tasniflash usulining samaradorligini o'lchash uchun biz bashorat qilingan tasnif yorlig'ini (qatlam, o'ta oila va oilaviy darajalarda) SCOP1.69dagi haqiqiy yorliq bilan quyidagi ko'rsatkichlardan foydalangan holda solishtiramiz:
Umumiy aniqlik=(CC+UN)/(TE+TN)Tasniflash xato nisbati=CI/(CC+CI)Yangi sinfni aniqlash nisbati=UN/TN
Yuqoridagi tenglamalarda CC to'g'ri tasniflangan domenlar soni va CI noto'g'ri tasniflangan domenlar soni. UN mavjud sinflarda bo'lmagan yangi tuzilmalarning domenlari sonini ifodalaydi va shuning uchun tasniflanmagan deb to'g'ri belgilangan. UE - mavjud sinflarga tasniflanishi kerak bo'lgan, ammo bizning usulimiz bo'yicha tasniflanmagan deb belgilangan domenlar soni. (Eslatma CC + CI - metodingiz bo'yicha ba'zi yorliqlar tayinlangan domenlarning umumiy soni va UN+UE - bizning usulimiz bo'yicha tasniflanmagan deb belgilangan domenlarning umumiy soni.) TE SCOP1.67 va SCOP 1.69dagi umumiy sinflardagi domenlarning umumiy sonini, TN esa 19.SCOPc.dagi jami domenlar sonini ifodalaydi.
Umumiy aniqlik qancha oqsilning toʻgʻri tasniflanganligini yoki toʻgʻri yorliqlanganligini oʻlchaydi.
soʻrovlar domenlari tayinlangan haqiqiy yorliqlar. Yangi sinfni aniqlash nisbati yangi tasniflash sinflaridagi domenlarni usul qanchalik samarali aniqlashini oʻlchaydi.
Ushbu eksperiment natijalari 1-jadvalda ko'rsatilgan. Ushbu jadvaldan ko'rinib turibdiki, bizning tasniflash usulimiz yangi tasniflash sinflarida bo'lgan domenlarni aniqlaydigan juda aniq va noto'g'ri. Tasniflash uchun hisoblash vaqtiga kelsak, hisoblash narxi so'rovdagi SSE-triplletlar soniga chiziqli proportsionaldir. Domendagi SSE larning oʻrtacha soni taxminan 77 ni tashkil qiladi va bu oʻlchamdagi soʻrovlar uchun usulimiz 30 soniya davomida bajarish vaqtini talab qiladi. Ushbu hisoblash vaqtining indeksni moslashtiruvchi komponenti vaqtning taxminan 38% ni oladi (Ushbu qidiruv vaqti barcha SSEuchliklari boʻlgan faylni toʻliq oʻrganishdan 8 marta tezroq). Hisoblash vaqtining taxminan 56% umumiy tuzilmaga mos keladigan komponentga (ikki tomonlama grafik moslashtirish usuli) va dasturning% chiqish vaqti va oʻrnatishga sarflanadi. qayta ishlash va SVM tasnifi (ushbu komponentlarning tavsifi uchun Usullar bo'limiga qarang).
Boshqa usullar bilan taqqoslash
Avtomatik tasniflash uchun bir qancha usullar ilgari taklif qilingan edi [5,7-9]. Samaradorlikni baholashda biz o'z uslubimizni ushbu usullarning har biri bilan solishtirishni ko'rib chiqdik. Biroq, ushbu usullarning ba'zilari quyidagi sabablarga ko'ra taqqoslash uchun mos emas. Hozirda Superfamily[5] bilan to'g'ri taqqoslash mumkin emas, chunki solishtirish uchun SCOP 1.67 Hidden Markov Modeli talab qilinadi va bu model hozirda mavjud emas (PersonalCommunication, DerekWilson,). [9] bilan solishtirish mumkin emas, chunki uni amalga oshirish yoki natija maʼlumotlar toʻplami mavjud emas.
Shuning uchun, bu bo'limda biz usulimizni SGM [7] va SCOPmap [8] bilan solishtiramiz. SGM usuli oqsil tuzilmalarining 30 o'lchovli Gauss integrallari va eng yaqin qo'shnilar tasnifiga asoslangan tasniflash usulidir. SGM usuli CATH.SCOPmapisa tasnifi uchun juda samarali ekanligi ko'rsatildi.
1-jadval: SCOP1.69-da yangi domenlardan foydalanish uchun tasniflash natijasi
Tasniflangan domenlar tasniflanmagan domenlar Umumiy domenlar
aniqlik
(CC + BMT) (TN+TE)
Tasniflash xatosi
CI(CC+CI)
Yangi sinfni aniqlash nisbati
UNTN
Oila
4008
347
555
379
726
4563
86,3%
8,0%
76,5%
Superoila
4321
154
292
522
353
4936
87,2%
3,4%
82,7%
Katlama
4597
159
153
380
209
5080
90,1%
3,3%
75,0%
Bu jadval SCOP1.69 proCCdan foydalangan holda 5298 yangi domenni tasniflash natijasini ko'rsatadi.
yetti xil ketma-ketlik va tuzilmani taqqoslash usullaridan foydalanadigan konsensusga asoslangan usul. SCOPmafaslari Superoila bilan keng miqyosda qiyoslangan va Superoilaga qaraganda ancha to'g'ri ekanligi ko'rsatilgan [8].
SGM bilan taqqoslash
Natijalarni SGM bilan taqdim etishdan oldin shuni ta'kidlaymizki, SGM unumdorligi sozlanishi mumkin bo'lgan parametrlarga bog'liq ravishda o'zgarishi mumkin, masalan, SGMda masofa nisbati chegarasi. Biz SGMand uchun turli xil parametr sozlamalari bilan tajriba o‘tkazdik, bu sozlama yangi sinfni aniqlash nisbatini oshiruvchi (yoki tasniflash xatosi koeffitsientini kamaytiruvchi), umumiy aniqlikni pasaytiradi. Taqqoslash uchun asosli baza chizig‘ini tanlash uchun biz yangi tasniflash usuliga o‘xshash SGM parametr qiymatlarini tanladik. Ushbu usul yordamida biz oila, super-oila va katlama darajalarida mos ravishda 1,22, 1,23 va 1,23 masofa koeffitsientining kesish qiymatlariga erishamiz.
5289 yangi SCOP1.69 domenlari uchun SGM va proCC solishtirgan natijalar 2-jadvalda ko'rsatilgan. Garchi SGM CATHni tasniflash uchun juda samarali bo'lsa-da, bu usul SCOP bilan unchalik muvaffaqiyatli emas. Natija shuni ko'rsatadiki, bizning uslubimiz SGMatthefamily, superfamily va katlama darajalariga qaraganda 15-19% ko'proq aniqroqdir va kamroq noto'g'ri tasniflash xatolariga yo'l qo'yadi.
Biz proCCni ham baholadik va CATH (SGMwasoriginalyonlytestedgainstCATH) yordamida SGM bilan solishtirdik. Biz maʼlumotlar bazasi va soʻrov domenlari sifatida CATH 2.0 va CATH 2.4 dan foydalandik. CATHni tasniflashda SGM usulining umumiy aniqligi H, T, A va Clevelsda 93,9%, 94,5%, 94,7% va 97,1% ni tashkil qiladi, bizning usulimizning umumiy aniqligi esa 94,1%, 95,6%, 95,2% va 97,6% ni tashkil qiladi. H, T, A va C darajalari. SCOP bilan solishtirganda, ikkala usul ham CATH bilan koʻproq aniq natijalar beradi. Biroq, CATH bilan yuqori aniqlik kutilmoqda, chunki CATH chet elda koʻproq taʼrifdan foydalanadi, yaʼni SCOP bilan solishtirganda CATH tasnifida kamroq qatlamlar mavjud[13].
Bundan tashqari, biz proCC sezgirligi va o'ziga xosligini SGM bilan solishtirdik va standart ROC egri chizmalarini tuzdik.
SCOPmap bilan solishtirish
Ushbu bo'limda biz SCOPmapandourproCC usulini taqqoslash natijalarini taqdim etamiz.SCOPmap bilan solishtirganda, biz SCOPmap so'rovlar oqsil zanjirini kiritishini, so'rov oqsil zanjirini o'z ma'lumotlar bazasidagi ketma-ketlik va tuzilmalarga moslashtirish orqali domenlarni aniqlab olishini va CCOPmap ma'lumotlari bo'yicha tasnifni belgilashini ta'kidlaymiz. Shunday qilib, SCOPmap bilan taqqoslash uchun biz birinchi navbatda domen chegaralarini aniqlash uchun so'rovlar zanjiri bilan domenni bashorat qilish usulini ishga tushirdik. Keyin aniqlangan domenlar boʻyicha tasniflash usulimizdan foydalandik. Domen chegaralarini bashorat qilish uchun biz CAFASP4-DP raqobatida toʻgʻri koʻrinmaydigan SSEP-domen usuli[14]dan foydalandik[15].
Biz SCOP 1.69 da 2773 ta yangi yagona domen zanjirlaridan foydalangan holda SCOPmap va proCC ni solishtirdik. Ushbu tajriba uchun ko'p domenli zanjirlar istisno qilinadi, samaradorlikni ob'ektiv o'lchashda qiyinchilik tug'diradi (ko'p domenli zanjirlar bo'lsa, bashorat qilingan domenlar soni, oldindan belgilangan domen chegaralari va to'g'ri domen tasnifi tayinlashlari soni ko'rib chiqilishi kerak va tizimli usul yo'q. bu ta'sirlarni baholash uchun mo'ljallangan haqiqiy tasniflash samaradorligidan farqlash).
Dastlab, biz 2773 zanjirida SCOPmap-ni ishga tushirishga harakat qildik. Biroq, ushbu 2773 zanjirda SCOPmap-ni ishga tushirish har bir alohida zanjirni qayta ishlash uchun taxminan 2-3 soatni talab qiladigan juda katta hisoblash resurslarini oladi (Personal Communication, SaraCheek, 2006). Ushbu yuqori hisoblash narxi tufayli yangi oqsillar odatda katta klasterlar yordamida tasniflanadi va tasniflash natijalari ftp://iole.swmed.edu/pub/scopmap manzilida e'lon qilinadi. Shuning uchun biz o'z uslubimizni SCOP-mapftpsitesidagi so'nggi natijalarga asoslangan SCOPmap bilan solishtirdik.
Nihoyat, bizning usulimiz oila, super-oila va katlama yorliqlarini bashorat qila olsa-da, SCOPmap birinchi navbatda o‘ta oila yorlig‘ini bashorat qiladi va faqat o‘ta oila yorlig‘ini tayinlay olmaydigan katlama yorlig‘i mavjudligini bashorat qiladi. SCOP xaritasi hech qachon oila a'zolarini bashorat qilmaydi. SCOP xaritasi tomonidan tuzilgan asosiy tasnif bashorati o'ta oilaviy darajaga mos keladi,
2-jadval: SGMandproCC o'rtasidagi taqqoslash
Umumiy aniqlik tasnifi xato nisbati Yangi sinfni aniqlash nisbati
SGM
proCC
SGM
proCC
SGM
proCC
Oila
71,3%
86,3%
19,7%
8,0%
77,4%
76,5%
Superoila
69,6%
87,2%
17,0%
3,4%
82,2%
82,7%
Katlama
71,3%
90,1%
15,7%
3,3%
76,6%
75,0%
Bu jadval SCOP1.69da 5298 yangi domenni tasniflash uchun SGMandproCC solishtirish natijalarini ko'rsatadi.
Ushbu baholash uchun biz tasniflash samaradorligini faqat shu darajada solishtirdik. Ushbu baholash natijalari 3-jadvalda ko'rsatilgan. Bu barqarorlikdan biz quyidagi kuzatishlarni amalga oshirishimiz mumkin:
Umumiy aniqlik
3-jadvaldagi 5-ustunni tekshirib, SCOPmapning umumiy aniqligi SSEP-domenini bashorat qilish usuli bilan proCCdan bir oz past ekanligini kuzatamiz. 4-ustundan, shuningdek, SSEP-domenini bashorat qilish usuli yagona domen zanjirlarini aniqlashda SCOPmapga qaraganda yaxshiroq ishlashini kuzatamiz. Domen bashoratining ta'sirini tasniflash aniqligidan ajratib olish uchun, juda yaxshi - umumiy aniqlik (CC+CI)/(2773–ID). Ushbu sozlangan umumiy aniqlik mos ravishda SCOP-mapand proCC uchun 89,1% va 86,7% ni tashkil qiladi. SCOPmapga qaraganda umumiy aniqlik. Bundan tashqari, bizning yondashuvimizning qo'shimcha afzalligi shundaki, u har qanday domenni oldindan belgilash usuli bilan birlashtirilishi mumkin, bu bizning yondashuvimizga domenni bashorat qilish usullaridan osongina foydalanish imkonini beradi.
Yangi tuzilmalarni aniqlash
3-jadvaldagi 6-ustundan biz yangi tuzilmalarni aniqlashga nisbatan usulimiz SCOPmapga qaraganda 20% aniqroq ekanligini kuzatamiz. Bu farqning sababi SCOPmap soʻrovni 7-ketma-ketlikning maʼlum klassifikatsiyasi sinfiga agressiv tarzda tasniflaydi va tuzilmani solishtirish usullari soʻrovga sezilarli moslikni topishi mumkin. Ushbu yondashuv so'rov ma'lum bir sinfga tegishli bo'lsa, lekin yangi tuzilmalarga ega bo'lgan so'rovlar uchun noto'g'ri bashorat qilish uchun zaif bo'lsa, ayniqsa, ushbu tuzilmalarni tasniflash chegaralari noaniq bo'lsa samarali bo'lishi mumkin. Boshqa tomondan, bizning usulimiz tasniflash qaroriga asoslanadi.
Ma'lum oqsil tuzilmalaridan oldingi tasniflash ma'lumotlar bazasidan o'rganilgan bilimlarga asoslangan tuzilmalar.
Hisoblash xarajati
Hisoblash vaqtiga kelsak (3-jadvalning oxirgi ustuniga qarang), bizning usulimiz SCOP-xaritasini aniqlaydi. SCOPmap har bir soʻrov uchun oʻrtacha 2–3 soat vaqt talab qilsa-da, bizning usulimiz soʻrovni oʻrtacha 9 daqiqada tasniflashi mumkin. Bu 9 daqiqaning oʻrtacha 8 daqiqasi SSEP domenini bashorat qilish veb-xizmatiga toʻgʻri keladi, va bizning tasniflash uslubimizga oʻrtacha 1 daqiqa sarflanadi. SCOPning harajatlari katta masʼuliyatni toʻldirish uchun yuqori ekanligini tushunamiz. Garchi bu yechim ba'zi hollarda (juda qimmat bo'lsa ham) amaliy bo'lsa-da, yangi tuzilmalar ishlab chiqarish sur'ati ortib borishi bilan taqqoslanadigan aniqlikka ega va ko'proq moslashuvchanlikni ta'minlaydigan proCC kabi ancha arzonroq echimdan foydalanish amaliyroq bo'lishi mumkin, chunki u domenni bashorat qilish vositasi bilan birlashishi mumkin. Bunday bashoratlar foydalidir, ma'lumki, bir xil superoilada bir nechta domenlar funksional jihatdan bir-biridan farq qilishi mumkin va oilaviy darajadagi yanada nozik tasnif domen funksiyalarini bashorat qilish uchun foydaliroqdir[16].
Detection and clustering of novel families, superfamilies, and foldsFromthequerysetof5289domains,ourclassificationmethodlabels934domainsasunclassified.Asawayofidentifyinganddescribingnovelfamilies,superfamilies,andfoldsamongtheseunclassifieddomains,werantheMCLclusteringalgorithmonagraphconstructedusingtheseunclassifieddomains.Toconstructagraphforclus-tering,athresholdvalueforstructuresimilarityisrequired(seetheidentificationandclusteringofnovelstructuressectioninMethods).Inaddition,fortheclusteringatthedifferentSCOPlevels,differentthresholdvaluesare
3-jadval: Superoilaning bashorat qilingan SCOP belgilaridan foydalangan holda SCOP xaritasi va proCC o'rtasidagi taqqoslash
To'g'ri domen chegarasi bilan tasniflangan
To'g'ri domen chegarasi bilan tasniflanmagan
Noto'g'ri domen chegarasi
Umumiy aniqlik
Yangi sinfni aniqlash nisbati
Taxminiy o'rtacha bajarilish vaqti
CorrectCC InorrectCI yangi sinflari
BMT
Mavjud sinflarUE
ID
(CC+UN)2773
UN307
SCOPmap
2069
65
190
212
237
81,5%
61,9%
2-3 soatlik so'rov
proCC
2025
75
246
275
152
81,9%
80,1%
9 daqiqalik so'rov
Bu jadval SCOP1.69-dagi 2773ta bitta domen zanjiri tasnifining natijasini ko'rsatadi.2-6-ustunlarda xabar qilingan barcha raqamlar zanjirlar soniga (yoki biz bitta domen zanjiridan foydalanadigan faktga tegishli) kiradi). 4-5-ustunda bitta domen zanjiri sifatida to'g'ri belgilangan va tasniflanmagan deb belgilangan yagona domen zanjirlari soni ko'rsatilgan. 6-ustunda ko'p domenli zanjirlar sifatida noto'g'ri aniqlangan bitta zanjirli domenlar soni ko'rsatilgan.
Kerakli yangi proteinni ajratib turadigan aniq qaror modeli. Ushbu tajriba uchun biz chegara qiymatini belgilaymiz
Toʻgʻri tasniflangan oqsillarning 90% dan ortigʻi SCOP oilasi, oʻta oilasi va qatlamlari uchun eng yaqin tuzilmasi bilan ushbu qiymatlardan oʻxshash ballga ega ekanligini kuzatish asosida mos ravishda 0.4, 0.32 va 0.3 oila, oʻta oila va qatlam darajalari uchun.
Yangi SCOP oilalarini aniqlashning avtomatlashtirilgan usulining imkoniyatlarini o‘lchash uchun biz avtomatik tarzda ishlab chiqarilgan klasterlarni SCOP1.69dagi oilaviy darajalar bilan solishtirdik. 934 ta tasniflanmagan domenlar tarqalgan.
318 aniqlangan klasterlar, 46 klasterlar aslida SCOP1.69 dagi 46 yangi oilalarga mos keladi, bu SCOP 1.69 da kiritilgan 61 yangi oilalarning umumiy sonining 75% ni tashkil qiladi. Bundan tashqari, bir xillik darajasini o'lchash uchun biz avtomatik ravishda yaratilgan klasterlar sifatini o'lchaymiz. Bir xil klasterdagi barcha domenlar o'zlarining sinf belgilari bo'yicha o'zaro kelishuvga ega bo'lsa va quyidagilar bilan belgilanadi:
SCOP1.69 bo'yicha 320 oila bo'ylab. Forthesedomains, avtomatlashtirilgan usulda 358 ta klaster hosil bo'ldi. SCOP va avtomatik ravishda umumiy o'rtasidagi kelishuvni tekshirish uchun
ClusterPurity(C,S)=1
N
c∈C
maxS∈S|C∩S|
klasterlarga ajratilgan holda, biz klasterdagi eng keng tarqalgan oila yorlig'i asosida klasterlar uchun sinf yorliqlarini yaratdik. Ushbu sinf yorlig'i tayinlash asosida, har bir SCOP oilasi bir xil sinf yorlig'iga ega bo'lgan bir yoki nol klaster bilan bog'langan. Agar bir xil SCOP oilasiga bitta klaster xaritasi tayinlangan bo'lsa, biz faqat bitta avtomatik klasterning avtomatik hisobini yaratamiz. bu klaster SCOP oilaviy yorlig'iga to'g'ri mos keladigan domenlar soni bir xil SCOPfamilylabelga ega bo'lgan klasterlar to'plami ichida eng yuqori bo'lgan domendir. Biz "to'g'ri" xaritalangan umumiy klasterlar/oilalar sonini hisoblab chiqdik va biz ikkita klaster o'rtasidagi 301 ta umumiy klaster mavjudligini aniqladik. Keyin, har bir toʻgʻri koʻrsatilgan klaster uchun biz SCOP oilasining bir xil yorligʻi boʻlgan klasterdagi haqiqiy domenlar sonini hisobladik. Bu jami 822 ta, bu tasniflanmagan domenlar umumiy sonining 88% ni tashkil qiladi.
Xuddi shu usuldan foydalanib, biz superfamily va katlama darajalarida klasterlash samaradorligini ham hisobladik. Bu natijalar 4-jadvalda ko'rsatilgan.
4-jadvalda 358 oila darajasida aniqlangan klasterdan 159 ta klaster SCOP 1.69 da 159 ta yangi oilaga toʻgʻri keladi, bu SCOP 1.69 da kiritilgan 215 jami yangi oilalar sonining 74% ni tashkil qiladi. Super-oila darajasida aniqlangan 327 ta klasterdan 62 ta klaster SCOP1.69 boʻyicha 62 ta yangi super oilaga toʻgʻri keladi, bu SCOP1.69da kiritilgan 95 ta yangi super oilalar sonining 65% ni tashkil qiladi.
Yuqoridagi tenglamada C, S MCL klasterlari to'plamidagi tsisaklaster S SCOP oilalari to'plamidagi oila va S domenlarning umumiy soni.
Ushbu o'lchovdan foydalanib, MCL klasterining klaster tozaligi mos ravishda SCOP oilasi, superfamily va qatlam darajalarida 0,96, 0,95 va 0,96 ga teng. Bu yuqori klaster sofligi qiymatlari bizning klasterlash usulimiz SCOP sinflari bilan yuqori darajadagi kelishuvga ega boʻlgan klasterlarni ishlab chiqarishini koʻrsatadi. Avtomatik klasterlangan yangi SCOP oilasiga misol 1-rasm.
Munozara
Samarali tuzilmalarni solishtirish uchun ilovalar
Umuman olganda, oqsillarni tasniflashning mavjud usullari yangi domenlarni mavjud tasniflash ierarxiyasiga tasniflashga qaratilgan. Biroq, SCOP-da ilgari tasniflangan domenlar ko'pincha keyingi nashrlarda qayta tartibga solinganligi kuzatildi, chunki yangi tuzilmalar ba'zida yangi va mavjud domenlar o'rtasidagi ko'proq munosabatlarni ochib beradi[18]. Shuning uchun, yangi tuzilmalarni tasniflashdan tashqari, bunday potentsial o'zgartirishlarni avtomatik ravishda aniqlash kerak.
Ushbu muammoga yondashishning bir yo'li mavjud va yangi domenlar bilan yaxlitlik bilan taqqoslash va keyin klasterlash usuli yordamida klasterlarni yaratish. Masalan, agar yangi domenning kiritilishi ilgari bog'lanmagan domenlarni bog'lovchi dalil bo'lsa, topilishi mumkin bo'lgan ushbu domenlardan tashkil topgan klaster:
4-jadval: SCOP oilasi, superoilasi va qatlamlarida klasterlash samaradorligi
SCOPC sinflari(A)
MCLCclusters(B)
#umumiy klasterlar/sinflar(C)
#ofto'g'ri etiketlangan domen(C)
Oila
320
358
301
822 (88%)
Superoila
260
327
234
731 (78%)
Katlama
200
318
191
670 (72%)
Thistableshowstheresultofclustering934unclassifieddomainsattheSCOPfamily,superfamily,andfoldlevels.Column2showsthenumberofSCOPfamilies,superfamilies,andfoldsthatthese934domainsarespreadacross.Column3showsthenumberofautomaticallygeneratedclustersateachSCOPlevel.Column4showsthenumberofcommonclusters/SCOPclassesthatwerecorrectlymapped.Column5showsthenumberofactualdomainsintheclusterthathadthesamelabelasthecorrespondingSCOPclass.
1-rasm
Assessingthequalityoftheautomaticallygeneratedclusters.Thisfigureshowstheautomaticallygeneratedfamily-levelclustersfortheunclassifieddomainsintheSCOP1.69"d"class(i.e.thealphaandbetaproteins(a+b)).Thisfigurealsoshowstherepresentativedomainstructuresforeachcluster.AconnectedgraphcorrespondstoanautomaticallydetectedMCLcluster.TheellipsesindicatethenovelfamiliesinSCOP1.69.TheMCLclustersareassignedafamily-levellabelbasedonthemostcommonfamily-levellabelinthecluster.Withinacluster,thenodeswiththesamecolorindicatethatallthesenodeshavethesamefamily-levellabel.Tokeepthisfiguresimple,onlyclusterswithmorethanfourdomainsareshown.Thereareanadditionalof79clustersthatmatchedtheSCOPfamilylabel,andofthese30clusterscorrespondtonewfamiliesinSCOP
1.69.Bu raqamBioLayout[31]va PyMol[32] yordamida yaratilgan.
ushbu domenlarni o'z ichiga olgan ba'zi potentsial o'zgartirishlarni qabul qilish.
Ushbu vazifani bajarishda klasterlash usullari bilan bir qatorda tuzilmalarni solishtirishning samarali va aniq usuli muhim ahamiyatga ega, chunki har bir tuzilma juftligini solishtirish kerak (O(n2)taqqoslash).Tuzilishimizni taqqoslash usulimiz (Usullar bo'limidagi tuzilmani taqqoslash bo'limiga qarang) juda samarali va bu tanlov uchun mos bo'lishi mumkin.
Tasniflash ierarxiyasining turli diapazonlarini doimiy ravishda aniqlash uchun ushbu funksiyani bizning tasniflash tizimimizga kiritish kelajakdagi ishimizning bir qismi bo'ladi.
Domenni bashorat qilish usullari bilan integratsiya
Domen tasnifini chinakam avtomatik qilish uchun aprotein tuzilishini hisobga olgan holda, avvalo domen chegaralarini aniqlash kerak. Domen chegaralarini bashorat qilish muammosi funktsional klassifikatsiya va tuzilmani bashorat qilishning hal qiluvchi komponenti sifatida tan olingan [14] va raqobatdosh domenlarni bashorat qilish usullarining soni [19]. ProCC usuli bizning tasniflash usulini istalgan domen bilan bog'lash imkonini beruvchi asosni taqdim etadi. -diksiya vositasi. Hozirgi tadqiqotimizda SSEP-domen usulidan foydalangan bo'lsak-da, boshqa domenlarni bashorat qilish usullari, masalan, aniqroq, ammo sekinroq bo'lgan Rosseta-Ginzu [20] potentsial ravishda yanada yaxshi tasniflash natijalarini olish uchun ishlatilishi mumkin. Bundan tashqari, tasniflash va domenni bashorat qilish komponentlari o'rtasidagi bo'sh bog'liqlik bizga domen chegaralarini bashorat qilish usullarida amalga oshirilishi mumkin bo'lgan kelajakdagi yutuqlardan osongina foydalanishga imkon beradi.
Xulosa
Ushbu maqolada biz oqsillarni avtomatik ravishda tasniflash uchun proCC deb nomlangan usulni tasvirlab berdik. Keng ko'lamli eksperimental baholashdan foydalanib, biz bizning usulimiz ko'pincha mavjud avtomatlashtirilgan usullarga nisbatan yuqori aniqlikka ega ekanligini ko'rsatdik. Bizning usulimiz yangi burmalarni bashorat qilishda ham juda samarali va juda samarali. Bizning usulimiz SCOP va CATH kabi yuqori sifatli tasniflash ierarxiyalarini ishlab chiqarishda doimo zarur bo'lgan rasmiy aralashuvni to'liq bartaraf eta olmasa-da, bu ma'lumotlar bazalarining so'nggi nashrlariga kiritilmagan yangi domenlarni tasniflash uchun qimmatli qo'shimcha usulni taqdim etishi mumkin. Bundan tashqari, bizning uslubimiz ushbu ma'lumotlar bazalari kuratorlariga mavjud tasniflash ierarxiyasini yangi oqsil tuzilmalarini joylashtirishda qayta tashkil etishda yordam berishi mumkin.
Usullari
Bizning proCC usulimiz quyidagi uchta moduldan iborat: strukturani taqqoslash, tuzilmani tasniflash va klasterlash. Yangi protein so'rov domenini hisobga olgan holda, taqqoslash moduli yuqori strukturani topadi.
so'rovga o'xshash turlar. Keyinchalik, ushbu natijalarga asoslanib, so'rovlar domenining tasnifi eng yaqin k eng yaqin strukturaviy qo'shnilar sinf yorlig'i ma'lumotlari va qo'llab-quvvatlovchi vektor mashinasi (SVM) yordamida amalga oshiriladi. etarlicha ishonch bilan belgilang. Nihoyat, oldingi bosqichda tasniflanmagan deb belgilangan domenlar uchun klasterlash moduli potentsial yangi burmalarda mavjud bo'lgan domenlar guruhlarini taklif qilishdan tashqari klaster chegaralarini aniqlaydi.
Strukturani taqqoslash
Proteinning asosiy tuzilmasini taqqoslash moduli va indeks tuzilmasi tezkor topish tuzilmasi so'rovga o'xshash. Tuzilish o'xshashligini taqqoslashning asosiy birligi ikkilamchi tuzilma elementlarining (SSE) uchligidir. Protein tuzilmalarini SSE yordamida solishtirish, avval ishlatilgan [3,22,23]azitlar, tuzilmalarning o'xshashligi bilan solishtirganda hisoblash uchun samaraliroq.
DALI [24] va Idoralar [25] katatomlarining aktual atomik koordinatalari, asidonein. Bundan tashqari, domenlar tarkibi va fazoviyligiga ko'ra tasniflanadi
SCOP va CATH kabi umumiy tasniflash ma'lumotlar bazalarida SSE'larni joylashtirish, SSE'larga asoslangan tuzilmalarni taqqoslash strukturani tasniflash uchun tabiiyroqdir. Berilgan domenga o'xshash eng yaxshi k domenni topish uchun quyidagi bosqichlarni bajaring.
Ma'lumotlar bazasidagi har bir domen SSE uchliklari to'plamiga bo'linadi. SSE tripletini yuborish uchun 10 oʻlchovli vektordan foydalaniladi va indeks maʼlumotlar bazasidagi barcha SSEtripletler ustida tuziladi. Protein domeni soʻrovi SSE tripletlariga ham parchalanadi.
So'rovdagi har bir SSE tripleti uchun mos keladigan SSE tripletlari indeks yordamida olinadi. Ushbu indeks tekshiruvidan olingan xitlarga asoslanib, mos uchlik o'rtasidagi o'xshashlik balli hisoblangan.
Ma'lumotlar bazasidagi har bir maqsadli domen uchun SSE uchlik moslamasi natijalari asosida vaznli ikki qismli grafik yaratiladi. For each target graph, a maximumweighted bipartite graph matching algorithm is run tocompute an overall similarity score between the queryand the target. Finally, the top kscoring targets arereturnedastheresultofthesearch.
Each of these three steps is described in detail in the fol-lowingthreesubsections.
We note that our method finds all the domains in thedatabase that have at least one or more SSE triplet matchestothequery,andkisthenumberofsuchdomains inthe
database. Therefore, the value for k varies depending onthe query structure and is automatically determined byourmethod.
Structurerepresentationandindexing
We model each protein domain as a set of SSEs, and rep-resenteachSSEusingitsassociatedtype,length,anda
direction vector. Given a SSE Si, its type, denoted as Ti, iseitheranαhelixoraβstrand.Foraconciserepresenta-
tion,loopsandturnsareexcluded.ThelengthoftheSSE,denotedasLi,isthenumberofresiduescontributingtotheformationofthatSSE.Thedirectionvector,denotedas
Vi, is a unit vector, Vi = (Xs- Xe)/||Xs- Xe||, where Xs and XerepresentthetwoendpointsoftheSSE.XsandXearecal-culatedusingthefollowingequationsdefinedin[26].
Foranαhelix,XsandXearecalculatedas:Xs= (0.74Xi+ Xi+1 + Xi+2 + 0.74Xi+3)/3.48Xe=(0.74Xj+Xj-1+Xj-2+0.74Xj_3)/3.48
where Xi and Xj represent the beginning and ending resi-duesoftheSSE.
Foraβstrand,XsandXearecalculatedas:
Xs=(Xi+Xi+1)/2Xe=(Xj+Xj-1)/2
Since we are interested in indexing SSE triplets, we use theaboverepresentationofasingleSSEtodeveloparepresen-
tation for an SSE triplet. Given three SSEs, Si, Sj and Sk, thetriplet containing these three SSE contains the followinginformation:
SSEtypes:Ti,Tj,Tk.
SSElengths:Li,Lj,Lk.
Angles between each pair of SSEs: θij, θik, θjkwhere θijisthe angle formed by Si and Sj and it is calculated as: cos-1(Vi·Vj)mod180.(Notethatthemod180componentof the equation is used to allow for similarity matchingundercoordinateinversion.)
Distances between each pair of SSEs: Dij, Dik, Djk whereDij is the average of the minimum distances between resi-duesinSiandSj.TocalculateDijthesmallerSSE(betweenSi and Sj) is selected. Then, the minimum distances fromeveryresidueinSitoeveryresidueinSjarecalculated(Si≤Sj)andtheaverageoftheseminimumdistancesisusedastheSSEdistanceDij.Intuitively,thismeasureaimstocon-
cisely capture the distance between two SSEs. The indexsearch (described below) will use these distances toremove pairs of SSEs that have very different inter-SSE dis-tances.
The information describing an SSE triplet is encoded intoa compact 10-dimensional vector, which serves as theactual representation of the SSE triplet in an index. This10-dimensionalvectoris:
(TC,Xi,Yi,Xj,Yj,Xk,Yk,Djk,Dik,Dij)
Inthisvectorrepresentation,amongstthethreeSSEs,theith SSE is closest to the N-terminal, and the kth SSE is closestto the C-terminal. TC is a three bit value that encodes thetypesofthethreeSSEs.Thenextsixvalues,Xi,Yi,Xj,Yj,Xk
andYkrepresentthe lengthsandanglesofSi,SjandSk.
Each SSE, for instance Si is mapped to a point (Xi, Yi) in a2DEuclideanspacewhereXi=LicosθjkandYi=Lisinθjk.Thistransformationtoa2-DEuclideancoordinateallowsusto
use a conventional spatial index for efficiently locatingcloseneighbors.Thelastthreevalues,Djk,Dik,andDij,arethepairwiseSSEdistancevaluesasdefinedbefore.
The 10-dimensional vector representation serves as thekey for indexing the SSE triplets. For a given proteindomain, rather than inserting an index entry for every SSEtriplet, we only insert SSE triplets that have all inter-SSEdistances less than 20 Å. This cutoff value is based on asimilar cutoff that is used in DALI [24]. For the indexingstructure,weusethepopularR*-tree[27].
In our method, a SSE triplet is used as a basic search unit.While different cardinality for SSE can also be considered(forexampleaquadrupletorapairinsteadofatriplet),the SSE cardinality directly affects the efficiency and thesensitivityofthesearches.UsingaSSEpairincreasesthesensitivityofsearches,butdegradesthesearchefficiency,especially when searching a large database of domains.Using a SSE quadruplet is more efficient, but may be tooconservative in detecting distantly related structures, suchasdomainsinthesuperfamilyorfoldlevels.WeuseaSSEtriplet to strike a balance between sensitivity and searchefficiency. We also note that the use of SSE triplet has beenmadeforsimilarreasonsinpreviousworks[28].
SSEtripletmatchingandindexprobing
To match a query against a database of proteins, we firstdecompose the query into all SSE triplets with inter-SSEdistanceslessthan20Å.Wethenprobetheindexwitheach query triplet and retrieve target triplets in the data-base that are "similar" to the query triplet. Similaritybetweenaquerytripletandatargettripletisdefinedusingthescoringmodeldescribedbelow.
SSEtripletsimilarityscoring
Giventwomatchingtriplets,TqinaqueryandTtinatarget(database), the SSE triplet similarity, denoted by SCtri-plet(Tq,Tt)iscomputedusingthefollowingequation.
The index probe retrieves all matching entries using thefollowing criteria: Given a query triplet Tq and a target tri-pletTt,whicharedefinedasbelow,
Tq=(TCq ,Xq,Yq,Xq,Yq,Xq,Yq,Dq,Dq,Dq)
SC (Tq,Tt)=SC (Sq,St)+SC (Sq,St)+SC (Sq,St)
i i j j k k jk ik ij
uchlik
pairijij
pairjkjk
pairik ik
Tt=(TCt,Xt,Yt,Xt,Yt,Xt,Yt,Dt
,Dt,Dt)
whereSq
andSt
denotetheequivalentSSEpairsofSi t
i i j j k k
q
jk ik ij
ij ij
andSj
TisamatchforTwhenthefollowingconditionsaremet.
1.TCq=TCt
Thescore,SC
(Sq,St),is theSSE pairsimilarity score,
2.∀r=i,j,k(
≤sin(θ)×
Xq+Yq),and
pair ij ij r r
andiscomputedas:
3.|Dq−Dt|≤d
Λ|Dq−Dt|≤d
Λ|Dq−Dt|≤d
⎧ q t⎫
SC (Sq,St)=θE− ij ij×w(D∗)×(l×l)p
The first condition checks to ensure that the two SSE tri-pletshavethesameSSEtypesandthesameorderforthe
pairijij ⎨
⎩
where ∗istheaverageof
∗ ⎬
ij ⎪⎭
qand
ij i j
Dtw(r)=exp(-(r/
SSEs.ThesecondconditioncheckstoseeifthethreeSSEs
inTtarewithininasmalldistance(sin(θ)× )ofthecorrespondingSSEsinTq.Inourimplementation,θ
20)2), li = min( Lq , Lt ), lj = min( Lq , Lt), p = 0.6, and θE =0.2.
In the above equation, the first term measures the distancedeviation between two SSE pairs. The second term de-emphasizes the significance of matches between two dis-tant SSE pairs, since "distant SSE pairs are abundant andless discriminate" [24]. The last term scales the score bythe maximum aligned portion between the SSE pairs andaparameterp.Theparameterpissetto0.6,whichwasempiricallydeterminedbyrandomlychoosing400
domains from ASTRAL and computing our SCpair scoreandDALIscoreforeachSSEpair.Thevalueofp=0.6pro-ducedthemaximumcorrelationbetweenthetwoscores
(correlationcoefficientof0.6).
We note that our scoring equation has a strong similarityto the DALI scoring model. In the DALI model, a similar-ity score is calculated using all pairwise residue distances,whereas in our model the basic unit of comparison is anSSE rather than individual residues. The scoring using theSSEusesonlytheinformationintheindex,andiscompu-tationally much faster than the scoring function used inDALI(whichcostsO(N2)whereNisthenumberofresi-dues).
SSEtripletindexsearch
When matching a query triplet, rather than scanning allthe SSE triplets in the database (which can be slow), weuse an index search to find all database triplets that aresimilar to the query triplet. For each database triplet, wethencomputethesimilarityscorewiththequerytripletusingtheSCtripletequationdescribedbelow.
is set to 30°. The final condition checks if the distancebetweeneachmatchingSSEpairiswithinasmallthresh-old, dε. As in the DALI scoring model [24], the exact valuefor this threshold depends on the types of the SSEs beingcompared. The distance cutoff is set to 3 Å for a β-strandpair,4Åforanα-helixandβ-strandpair,and5Åforanαhelixpair.WenotethatthesecutoffvaluesarehigherthantheonesusedintheDALImodelaswearematchingSSEsin a triplet, rather than just individual pair without con-sideringatripletconfiguration(asisdoneinDALI).TheoriginalDALIcutoffswouldbetoostrictformatchingSSEtriplets.
Proteinstructurematching
Thepreviousstepproducesmatchingtargettripletsinthedatabase for every triplet in the query, and the associatedmatchingscore(SCtriplet).Next,weneedtoassemblethese
triplethitsintomatchesfortheentireproteindomain.For
thisstep,weconstructaweightedbipartitegraphforeverytargetproteindomainthathassometripletmatches.Ineach graph, nodes on one side of the bipartite graph rep-resenttripletsinthequeryandnodesontheothersiderepresent triplets in the database entry. An edge betweentwonodesindicatesthatthetwotripletswerematchedbythepreviousstep,andtheweightoftheedgerepresents
the SCtriplet score. A maximum weighted bipartite graphmatchingalgorithmisrunonthisgraphtoproducean
injective (one-to-one) mapping from the query SSE tri-plets to the triplets in the target. Then, using this mapping,anoverallstructuresimilarityscoreiscomputedas:
SCraw(q,t)=∑MSCtriplet(Tq,Tt)
known clusters are probably new folds" [7]. The distanceratio can be effectively used to detect whether a protein isrelativelyclosertoaspecificclusterthantotheremaining
whereTq
andTt
areequivalentSSEtripletsintheone-
clusters.However,itcannotbeusedtodetectwhetheraproteinisabsolutelyclosetoaspecificcluster.
to-onemapping,MisthetotalnumberofequivalentSSEtripletpairs,andSCtriplet(Tq,Tt)isthetripletsimilarity
scorebetween qandTt.Thisrawsimilarityscore
dependsonthesizesofthequeryandtargetproteindomains,andisnormalizedasfollows:
SCnorm(q,t)=(SCraw(q,t)×R*)/SCraw(q,q),where
R*=RTotal(q,t)+RSSE(q,t),
2
⎛ Nq ⎞
RTotal(q,t)=1+max⎜⎝−1,log10max(N,N)⎟⎠,and
⎛ SNt ⎞
Our classification method is also based on the nearestneighbor classification, and adopts the same observationasisusedinSGMtodetectunknownand/orpossiblynewfolds. However, our method improves classification accu-racy by using additional measures and a more sophisti-cated class boundary detection method, namely an SVM[21]. Furthermore, instead of reporting proteins labels as"unknown and/or possibly new", our method also identi-fies clusters among unclassified proteins to further auto-matetheclassificationprocess.
ClassificationusinganSVM
Ourmethodforassigningaclasslabelusesthreepiecesof
information,namely:anabsolutesimilarityratio(F1),arelativesimilarityratio(F2),andthenearestclusterclassi-
RSSE(q,t)=1+max⎜−1,log10max(SN
,SN)⎟
ficationlabel(C1).Thisinformationiscollectedusingthe
⎝ q t⎠
In the equations above, Nq and Nt are the total number ofresidues in the query and the target respectively. SNq andSNt are the number of residues contributing to the forma-tionofα-helicesandβ-strandsinthequeryandthetarget
respectively.Theratios,RTotalandRSSE,scaledowntherawscore inversely proportional to the size difference betweenthetwoproteins,andproduceascorethatislesssensitive
to the size differences between the query and the target.The division by the self-similarity score, SCraw(q, q) pro-duces a normalized similarity score between 0 and 1,whichrepresentshowsimilarthequeryproteinistoatar-
get, compared to itself. The normalized score is reportedasthefinalstructuresimilarityscore.
Structureclassification
Existing automatic classification methods [5,7-9] employa nearest neighbor classification strategy. Given a queryprotein domain, they find the structurally closest neigh-bor that has a known classification label. Then the queryis assigned the same label as its nearest neighbor.Although the nearest classification strategy is effective inmanycases,ithasasignificantlimitationasproteinswithnovelfoldsareguaranteedtobemisclassified.
To resolve this problem, the SGM method [7], whichemploys a modified nearest-neighbor approach, reports alabel of unknown and/or possibly new when it cannot clas-sifyproteinswithhighconfidence.Todetecttheboundarybetween classification and non-classification, it uses anintertointraclusterdistanceratio,basedontheobserva-tion that "chains that are equidistant to several clusters arehardtoclassifyandchainsthatarefarawayfromany
followingprocedure:
First, given a protein domain q, the structure comparisonmethod,describedinthestructurecomparisonsection,isusedtofindthetopkstructureneighborsinthedatabase.Fromthistopklist,weremoveanyhitstothequeryitself.
Then, we pick the top structure, n1as the nearest neighbor.LetC1denoten1'sclassificationlabel.Wethengodownthe list and find the next structure that has a different labelfromC1.Letuscallthisentryn2,andletC2denotethelabelforn2.Next,wecomputethescores,SCnorm(q,n1)andSCnorm(q,n2).
Then we use these scores to compute F1 and F2 as: F1 =SCnorm(q,n1)andF2=SCnorm(q,n1)/SCnorm(q,n2).Finally,wereturnthevaluesF1,F2andC1.
Intuitively, high F1 and F2 values indicate that the queryis structurally similar to its nearest neighbor, and is alsorelatively closer to its nearest neighbor than to any otherdomain, which in turn suggests a high confidence in theassignment of the classification label. On the other hand,low F1 and F2 values imply that the domain is not partic-ularly similar to any existing domains, which suggests thatthedomainispotentiallyanewfold.
Toautomatetheclassificationprocess,weneedaclassifi-cation decision model which defines clear boundariesbetweenclassificationandnon-classification.UsingSCOP as the gold standard, we generated a classificationdecision model that reflects the rules used for creatingnewfoldsinSCOP.Inordertocreatesuchclassification
decision model, we used a support vector machine (SVM)to capture nonlinear classification decision boundaries inSCOP. As a training set for the decision model, we pickedSCOP version 1.65 and version 1.67 and trained themodel as follows: Using domains in SCOP 1.67 as thequeries, and domains in SCOP 1.65 as the database, weperformstructurecomparisonusingthemethoddescribed in the structure comparison section. For eachquery,wecalculatetheF1andF2scores.IftheSCOPlabelofaqueryproteindomainisthesameasitsnearestneigh-bor'sSCOPlabel,thequerywithitsF1andF2isusedasapositiveexample,otherwise,itisusedasanegativeexam-ple in the training set for the SVM. The resulting trainingdatasetisshowninFigure2.
The classification label assignment step simply uses thetrained SVM to determine if a query should be labeled asunclassified.ForqueriesthattheSVMdeterminescanbeclassified, the label of the nearest-neighbor in the data-baseisusedasthepredictedclasslabel.
Identificationandclusteringofnovelstructures
Ourclassificationmethodtakestheapproachofassigningan "unclassified" label to protein domains that have novelfolds or have subjective and fuzzy classification bounda-ries. Assigning an actual class label to such domains oftenrequiresadditionalbiologicalinformationandmanual
5
interpretation [1,29]. Since such manual intervention islikely to continue to be unavoidable even in the foreseea-ble future, it is useful if additional information is pro-vided to make a more informed (and potentially faster)manual assignment. In this section, we outline ourmethod for aiding this manual assignment process byemployingaclusteringmethodforgroupingtheproteindomainsthatarelabeledas"unclassified"withourclassi-ficationmethod.Thebasicintuitionbehindusingcluster-ing is that protein domains that are in the same cluster arelikely to have stronger similarities to each other, sharingsimilarproteinstructures,comparedtodomainsindiffer-ent clusters. In addition, it is often likely that well-segre-gated clusters correspond to novel folds. To detect thesenovel folds, we first perform an all-to-all comparisonusingalltheproteindomainsthatarelabeledasunclassi-fiedbythepreviousstructureclassificationstep.Then,weconstruct a graph that has a node for every unclassifieddomain.Inthisgraphtwonodesareconnectedbyanedgeif the similarity score between the protein domains corre-sponding to the nodes is above a certain threshold. Eachedge has a weight, which is equal to the similarity score.Once this graph is constructed, the MCL [30] algorithm isrun on the graph to detect clusters. (MCL is a clusteringalgorithm that is specifically designed to work withgraphs.) The computed clusters are then reported asgroupsthatpotentiallycorrespondtonovelfolds.Inaddi-
4
3
2
1
Figure2
0
0 0.2 0.4 0.6 0.8 1
F1
Visualization of the classification decision boundary. This figure shows the classification boundary created for entries inSCOP 1.67 using SCOP 1.65 as the database. The SVM is used to detect the boundary between "Classified" and "Unclassified"entries.ThistrainedSVMwillthenbeusedtopredictclasslabelsforSCOP1.69.
tion, for each computed cluster we also produce a repre-sentative structure, which is simply the graph center forthatcluster(iftherearemorethanonecenters,weran-domly select one of the centers as the representative struc-ture).
Mavjudligi
Awebsiteofferingaccessforproteinclassificationisfreelyavailableathttp://www.eecs.umich.edu/periscope/procc.
Authors'contributions
YJcarriedoutthedesign,implementation,andevaluationofproCC,anddraftedthemanuscript.JPconceivedofthestudy, participated in its design and evaluation, andhelped to draft the manuscript. All authors read andapprovedthefinalmanuscript.
Additionalmaterial
Acknowledgements
This research was supported in part by the National Institutes of Healthunder grant 1U54-DA-021519, by Michigan Technology Tri-CorridorundergrantMTTCGR687,andbyaresearchgiftdonationfromMicrosoft.We would like to thank Peter Røgen and Boris Fain for sharing a copy oftheir SGM software. We thank Sara Cheek for various discussions onSCOPmap, and Derek Wilson for various discussions on Superfamily. Wealsothank thereviewersfortheir productivecomments.
Ma'lumotnomalar
Murzin AC: SCOP: A Structural Classification of ProteinsDatabasefortheInvestigationofSequencesandStructures.JMolBiol 1995, 247:536-540.
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, ThorntonJM: CATH – a hierarchic classification of protein domainstructures.Structure 1997, 5:1093-1108.
HolmL,SanderC: TouringproteinfoldspacewithDali/FSSP.
NucleicAcidsRes1998,26:316-319.
ProteinDataBank[http://www.rcsb.org/pdb/]
GoughJ,KarplusK,HugheyR,ChothiaC:AssignmentofHomol-ogytoGenomeSequencesusingaLibraryofHiddenMarkovModelsthatRepresentallProteinsofKnownStructure.JMolBiol2001,313(4):903-919.
Getz G, Vendruscolo M, Sachs D, Domany E: Automated assign-ment of SCOP and CATH protein structure classificationsfromFSSPscores.Proteins2002,46:405-415.
Røgen P, Fain B: Automatic classification of protein structurebyusinggaussintegrals.ProcNatlAcadSci2003,100(1):119-114.
Cheek S, Qi Y, Krishna SS, Kinch LN, Grishin NV: SCOPmap:Automated assignment of protein structures to evolution-arysuperfamilies.BMCBioinformatics2004,5(1):197.
Çamoglu O, Can T, Singh AK, Wang YF: Decision tree basedinformationintegrationforautomatedproteinclassification.JBioinform ComputBiol2005,3(3):717-742.
Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M,Brenner SE: The ASTRAL Compendium in 2004.Nucleic AcidsRes2004, 32:D189-92.
FrishmanD,ArgosP:Knowledge-basedproteinsecondarystructureassignment.Proteins 1995, 23(4):566-579.
SVMlightSupportVectorMachine[http://svmlight.joachims.org]
Day R, Beck DA, Armen RS, Daggett V: A consensus view of foldspace: Combining SCOP, CATH, and the Dali Domain Dic-tionary.ProteinSci 2003, 12:2150-2160.
Gewehr JE, Zimmer R: SSEP-Domain: protein domain predic-tion by alignment of secondary structure elements and pro-files.Bioinformatics2006,22(2):181-187.
Critical Assessment of Fully Automated Structure Predic-tion[http://cafasp4.cse.buffalo.edu/dp/update.html]
Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J: TheSUPERFAMILY database in 2004: additions and improve-ments.NucleicAcidsRes2004, 32:D235-239.
GrossmanRL,KamathC,KegelmeyerP,KumarV,NamburuRR,Eds:DataMiningforScientificandEngineeringApplicationsKluwerAcademicPublishers;2001.
LoConteL,BrennerSE,HubbardTJ,ChothiaC,MurzinAG:SCOPdatabasein2002:refinementsaccommodatestructuralgenomics.NucleicAcidsRes2002,30(1):264-267.
SainiHK,FischerD:Meta-DP:domainpredictionmeta-server.
Bioinformatics2005,21(12):2917-2920.
ChivianD,KimDE,MalmstromL,BradleyP,RobertsonT,MurphyP,StraussCE,BonneauR,RohlCA,BakerD:Automatedpredictionof CASP-5 structures using the Robetta server.Proteins 2003,53(6):524-533.
Cortes C,Vapnik V:Supportvector networks.MachineLearning
1995,20:273-297.
Madej T, Gibrat JF, Bryant SH: Threading a database of proteincores.Proteins 1995, 23(3):356-369.
Martin AC: The ups and downs of protein topology: rapidcomparisonofproteinstructure.ProteinEng2000,13:829-837.
HolmL,SanderC:Proteinstructurecomparisonbyalignmentofdistancematrices.JMolBiol 1993,233:123-138.
ShindyalovIN,BournePE:Proteinstructurealignmentbyincre-mental combinatorial extension (CE) of the optimal path.ProteinEng 1998, 11(9):739-747.
SinghAP,BrutlagDL:Hierarchicalproteinstructuresuperposi-tion using both secondary structure and atomic representa-tion.Proc IntConf IntellSyst Mol Biol1997,5:284-293.
Beckmann N: The R*-tree: An efficient and robust accessmethod for points and rectangles.Proceedings of the 1990 ACMSIGMODInternationalConferenceonManagementofData1990:322-331.
Çamoglu O, Kahveci T, Singh AK: Index-based Similarity Searchfor Protein Structure Databases.J Bioinform Comput Biol 2004,2(1):99-126.
HouJ,SimsGE,ZhangC,KimSH:Aglobalrepresentationoftheproteinfoldspace.ProcNatlAcadSci 2003,100:2386-2390.
VanDongenS:Graphclusteringbyflowsimulation.InPhDthesis
UniversityofUtrecht;2000.
Enright AJ, Ouzounis CA: BioLayout-- an automatic graph lay-out algorithm for similarity visualization.Bioinformatics 2001,17(9):853-854.
PyMol[http://www.pymol.org
Do'stlaringiz bilan baham: |