Что такое findslide.org?

FindSlide.org - это сайт презентаций, докладов, шаблонов в формате PowerPoint.

Для правообладателей

Обратная связь

Email: Нажмите что бы посмотреть

Презентация на тему HММ, поиск генов и профилей

Содержание

2. Поиск генов
3. Jan 23, 2003Computational Gene FindingGene Structure
4. What is it about genes that we
5. Статистика кодирующей последовательностиНеравное использование кодонов в кодирующих
6. An Example of Coding Statistics
7. Codon Adaptation Index (CAI)the geometric mean of
8. CAI Example: Counts per 1000 codons
9. Splice signals (mice): GT , AG
10. HMMs and Prokaryotics Gene StructureNucleotides {A,C,G,T} are
11. ParseFor a given sequence, a parse is
12. The HMM Matrixes: Φ and Hxm(i) =
13. A eukaryotic geneThis is the human p53
14. A eukaryotic gene3’ untranslated regionFinal exonInitial exonIntronsInternal exonsThis particular gene lies on the reverse strand.
15. An Intron3’ splice site5’ splice siterevcomp(CT)=AGrevcomp(AC)=GTGT: signals start of intronAG: signals end of intron
16. Signals vs contentsIn gene finding, a small
17. Prior knowledgeWe want to build a probabilistic
18. Prior knowledgeThe translated region must have a
19. Цепи Маркова высокого порядкаk th-order Markov model
20. Цепи Маркова высокого порядкаAdvantages:Easy to train. Count
21. Genscan ExampleUses explicit state duration HMM to
22. E0E1E2E2E1E0NPEtermPEinitpolyA5’ UTRI0I1I2I0I1I2EsnglEsnglEinitEtermforward strandbackward strand3’ UTR5’ UTR3’ UTRpolyAE-
23. http://nar.oxfordjournals.org/content/26/4/1107
24. GeneMarkBorodovsky & McIninch, Comp. Chem 17, 1993.Uses
25. Interpolated Markov Models (IMM)Introduced in Glimmer 1.0
26. Real IMMsModel has additional probabilities, λ, that
27. Real IMMsResult is a linear combination of
28. IMMs vs Fixed-Order ModelsPerformanceIMM generally should do
29. GLIMMER-HMMNth-order interpolated Markov models (IMM) (N=8)
30. General Things to Remember about (Protein-coding) Gene
31. Профильные HMM Profile HMM Берем множественное выравнивание и делаем из него статистическую модель.
33. Profile HMMsМоделирует семейство последовательностейВычисляется из множественного выравнивания
34. Строим модель: состояния совпадения (Match States)Если нам
35. Состояния вставки Insertion StatesВо множественном выравнивании
36. Состояние делиции Deletion StatesДелициями во множественном выравнивании
37. Profile HMMsСуществует также переход из состоянии вставки
38. Profile HMMs: ExampleNote: These sequences could lead to other paths.
39. Pfam“A comprehensive collection of protein domains and
41. A Profile HMM ExampleThis is a section
42. Cоздание моделиЧто называть вставками, что делициями?>50% пропусков
43. More Set UpКолонки 2 и 3- состояния
44. ПараметризацияКакие параметры нам нужны?Эмиссионные:В каждом состояние надо
45. Эмиссионные вероятностиФоновый уровень (вероятности оснований, если бы
46. Эмиссионные псевдочастотыThe simplest way to do pseudocounts
47. Частоты переходовВсего 225 переходов, и только 9
48. Специфические переходыКолонки вставок и делиций.Колонка 2 содержит
49. Emission Probability Tables
50. Transitions
51. Scoring a SequenceWhew! We have now estimated
52. ScoringGGGGAAAAACGTATTBase 1 is G. To start the
53. More ScoringBase 3 is also a G.M2->M3
54. Still More ScoringGGG GAAAA ACGTATTThe next several
55. Yet MoreAt this point we have emitted
56. Yet Still MoreAt this point we have
57. To the End…Our path so far: M1->M2->D->M4->M5->M6->M7->M8->M9->M10->IGGG
58. Final probabilityWe need to know what the
59. Profile Hidden Markov Models Вычисление
60. Profile Hidden Markov Models Вычисление
61. Скачать презентацию
62. Похожие презентации

Поиск генов

Главная
Биология
HММ, поиск генов и профилей

Jan 23, 2003Computational Gene FindingGene Structure

What is it about genes that we can measure (and model)?Most of

Статистика кодирующей последовательностиНеравное использование кодонов в кодирующих областях – универсальная характеристика геномов.

Codon Adaptation Index (CAI)the geometric mean of the weight associated to each

HMMs and Prokaryotics Gene StructureNucleotides {A,C,G,T} are the observablesDifferent states generate nucleotides

ParseFor a given sequence, a parse is an assignment of gene structure

The HMM Matrixes: Φ and Hxm(i) = probability of being in state

A eukaryotic geneThis is the human p53 tumor suppressor gene on chromosome

A eukaryotic gene3’ untranslated regionFinal exonInitial exonIntronsInternal exonsThis particular gene lies on the reverse strand.

An Intron3’ splice site5’ splice siterevcomp(CT)=AGrevcomp(AC)=GTGT: signals start of intronAG: signals end of intron

Signals vs contentsIn gene finding, a small pattern within the genomic DNA

Prior knowledgeWe want to build a probabilistic model of a gene that

Prior knowledgeThe translated region must have a length that is a multiple

Цепи Маркова высокого порядкаk th-order Markov model bases the probability of an

Цепи Маркова высокого порядкаAdvantages:Easy to train. Count frequencies of (k+1)-mers in training

Genscan ExampleUses explicit state duration HMM to model gene structure (different length

E0E1E2E2E1E0NPEtermPEinitpolyA5’ UTRI0I1I2I0I1I2EsnglEsnglEinitEtermforward strandbackward strand3’ UTR5’ UTR3’ UTRpolyAE- exonsI- intronssingle exon 5’ UTRs

http://nar.oxfordjournals.org/content/26/4/1107

GeneMarkBorodovsky & McIninch, Comp. Chem 17, 1993.Uses 5th-order Markov model.Model is 3-periodic,

Interpolated Markov Models (IMM)Introduced in Glimmer 1.0 Salzberg, Delcher, Kasif & White,

Real IMMsModel has additional probabilities, λ, that determine which parts of the

Real IMMsResult is a linear combination of different Markov orders:

IMMs vs Fixed-Order ModelsPerformanceIMM generally should do at least as well as

GLIMMER-HMMNth-order interpolated Markov models (IMM) (N=8)

General Things to Remember about (Protein-coding) Gene Prediction SoftwareIt is, in general,

Строим модель: состояния совпадения (Match States)Если нам нужно выполнить выравнивание без пропусков,

Состояния вставки Insertion StatesВо множественном выравнивании часто встречаются колонки, являющиеся пропусками

Состояние делиции Deletion StatesДелициями во множественном выравнивании называют позиции, в которых большинство

Pfam“A comprehensive collection of protein domains and families, with a range of

Cоздание моделиЧто называть вставками, что делициями?>50% пропусков -> вставка делиция9 последовательностей имеют

More Set UpКолонки 2 и 3- состояния делиции, но в других последовательностях

ПараметризацияКакие параметры нам нужны?Эмиссионные:В каждом состояние надо задать вероятности эмиссии для всех

Эмиссионные вероятностиФоновый уровень (вероятности оснований, если бы они были выбраны случайным образом)Используются

Эмиссионные псевдочастотыThe simplest way to do pseudocounts is the Laplace method: adding

Частоты переходовВсего 225 переходов, и только 9 M->D. P(M->D) =

Специфические переходыКолонки вставок и делиций.Колонка 2 содержит 1 M->D и14 M->M.Need to

Scoring a SequenceWhew! We have now estimated parameters for all transitions and

ScoringGGGGAAAAACGTATTBase 1 is G. To start the global model off, we are

More ScoringBase 3 is also a G.M2->M3 has 0.420 probability and 0.464

Still More ScoringGGG GAAAA ACGTATTThe next several bases are easy. Since the

Yet MoreAt this point we have emitted positions 1- 8, and the

Yet Still MoreAt this point we have emitted positions 1- 8, and

To the End…Our path so far: M1->M2->D->M4->M5->M6->M7->M8->M9->M10->IGGG GAAAAAC GTATTFrom the insert state

Final probabilityWe need to know what the probability would be for the

Слайды презентации

Слайд 2 Поиск генов

Слайд 3 Jan 23, 2003
Computational Gene Finding
Gene Structure

Слайд 4 What is it about genes that we can

What is it about genes that we can measure (and model)?Most

measure (and model)?
Most of our knowledge is biased towards

protein-coding characteristics
ORF (Open Reading Frame): a sequence defined by in-frame AUG and stop codon, which in turn defines a putative amino acid sequence.
Codon Usage: most frequently measured by CAI (Codon Adaptation Index)
Other phenomena
Nucleotide frequencies and correlations:
value and structure
Functional sites:
splice sites, promoters, UTRs, polyadenylation sites

Слайд 5 Статистика кодирующей последовательности
Неравное использование кодонов в кодирующих областях

Статистика кодирующей последовательностиНеравное использование кодонов в кодирующих областях – универсальная характеристика

– универсальная характеристика геномов.

Неравное использование аминокислот в существующих

белках
Неравное использование синонимичных кодонов (коррелирует с избытком соответствующих tRNAs)

Эти характеристики могут быть использованы для разделения между кодирующими и некодирующими областями генома.

Статистика кодирования – функция, которая для данной ДНК последовательности вычисляет правдоподобие (условную вероятность) того, что последовательность является кодирующей для белка

Слайд 6 An Example of Coding Statistics

Слайд 7 Codon Adaptation Index (CAI)
the geometric mean of the

Codon Adaptation Index (CAI)the geometric mean of the weight associated to

weight associated to each codon over the length of

the gene sequence (measured in codons).

This is not perfect
Genes sometimes have unusual codons for a reason
The predictive power is dependent on length of sequence

Слайд 8 CAI Example: Counts per 1000 codons

Слайд 9 Splice signals (mice): GT , AG

Слайд 10 HMMs and Prokaryotics Gene Structure
Nucleotides {A,C,G,T} are the

HMMs and Prokaryotics Gene StructureNucleotides {A,C,G,T} are the observablesDifferent states generate

observables

Different states generate nucleotides at different frequencies

A simple HMM

for unspliced genes:

AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG
The sequence of states is an annotation of the generated string – each nucleotide is generated in intergenic, start/stop, coding state

This HMM has 4 states: x- non-coding, c- coding, start and stop

Слайд 11 Parse
For a given sequence, a parse is an

ParseFor a given sequence, a parse is an assignment of gene

assignment of gene structure to that sequence.
In a parse,

every base is labeled, corresponding to the content it (is predicted to) belongs to.
In our simple model, the parse contains only “I” (intergenic) and “G” (gene).
A more complete model would contain, e.g., “-” for intergenic, “E” for exon and “I” for intron.

S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG
P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG

TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC
GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Слайд 12 The HMM Matrixes: Φ and H
xm(i) = probability

The HMM Matrixes: Φ and Hxm(i) = probability of being in

of being in state m at position i;
H(m,yi)

= probability of emitting character yi in state m;
Φmk = probability of transition from state k to m.

Слайд 13 A eukaryotic gene
This is the human p53 tumor

A eukaryotic geneThis is the human p53 tumor suppressor gene on

suppressor gene on chromosome 17.
Genscan is one of the

most popular gene prediction algorithms.

Слайд 14 A eukaryotic gene
3’ untranslated region
Final exon
Initial exon
Introns
Internal exons
This

particular gene lies on the reverse strand.

Слайд 15 An Intron
3’ splice site
5’ splice site
revcomp(CT)=AG
revcomp(AC)=GT
GT: signals start

of intron
AG: signals end of intron

Слайд 16 Signals vs contents
In gene finding, a small pattern

Signals vs contentsIn gene finding, a small pattern within the genomic

within the genomic DNA is referred to as a

signal, whereas a region of genomic DNA is a content.
Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites
Examples of contents: exons, introns, UTRs, promoter regions

Слайд 17 Prior knowledge
We want to build a probabilistic model

Prior knowledgeWe want to build a probabilistic model of a gene

of a gene that incorporates our prior knowledge.
E.g., the

translated region must have a length that is a multiple of 3.

Слайд 18 Prior knowledge
The translated region must have a length

Prior knowledgeThe translated region must have a length that is a

that is a multiple of 3.
Some codons are more

common than others.
Exons are usually shorter than introns.
The translated region begins with a start signal and ends with a stop codon.
5’ splice sites (exon to intron) are usually GT;
3’ splice sites (intron to exon) are usually AG.
The distribution of nucleotides and dinucleotides is usually different in introns and exons.

Слайд 19 Цепи Маркова высокого порядка
k th-order Markov model bases

Цепи Маркова высокого порядкаk th-order Markov model bases the probability of

the probability of an event on the preceding k

events.
Example: With a 3rd-order model the probability of this sequence: would be:

Target

Слайд 20 Цепи Маркова высокого порядка
Advantages:
Easy to train. Count frequencies

Цепи Маркова высокого порядкаAdvantages:Easy to train. Count frequencies of (k+1)-mers in

of (k+1)-mers in training data.
Easy to compute probability of

sequence.
Disadvantages:
Many (k+1)-mers may be undersampled in training data.
Models data as fixed-length chunks.

Target

Fixed-Length Context

Слайд 21 Genscan Example

Uses explicit state duration HMM to model

Genscan ExampleUses explicit state duration HMM to model gene structure (different

gene structure (different length distributions for exons)

Different model parameters

for regions with different GC content

Слайд 22 E0
E1
E2
E2
E1
E0
N
P
Eterm
P
Einit
polyA
5’ UTR
I0
I1
I2
I0
I1
I2
Esngl
Esngl
Einit
Eterm
forward strand
backward strand
3’ UTR
5’ UTR
3’ UTR
polyA
E- exons
I-

introns
single exon
5’ UTRs
3’ UTRs
P- promoter region polyA

site N- intergenic region

Слайд 23 http://nar.oxfordjournals.org/content/26/4/1107

Слайд 24 GeneMark
Borodovsky & McIninch, Comp. Chem 17, 1993.
Uses 5th-order

GeneMarkBorodovsky & McIninch, Comp. Chem 17, 1993.Uses 5th-order Markov model.Model is

Markov model.
Model is 3-periodic, i.e., a separate model for

each nucleotide position in the codon.
DNA region gets 7 scores: 6 reading frames & non-coding―high score wins.
Lukashin & Borodovsky, Nucl. Acids Res. 26, 1998 is the HMM version.

Слайд 25 Interpolated Markov Models (IMM)
Introduced in Glimmer 1.0 Salzberg, Delcher,

Interpolated Markov Models (IMM)Introduced in Glimmer 1.0 Salzberg, Delcher, Kasif &

Kasif & White, NAR 26, 1998.
Probability of the target

position depends on a variable number of previous positions (sometimes 2 bases, sometimes 3, 4, etc.)
How many is determined by the specific context.
E.g., for context ggtta the next position might depend on previous 3 bases tta .
But for context catta all 5 bases might be used.

Слайд 26 Real IMMs
Model has additional probabilities, λ, that determine

Real IMMsModel has additional probabilities, λ, that determine which parts of

which parts of the context to use.
E.g., the probability

of g occurring after context atca is:

Слайд 27 Real IMMs
Result is a linear combination of different

Markov orders: where
Can view this as interpolating the results of

different-order models.
The probability of a sequence is still the probability of the bases in the sequence.

Слайд 28 IMMs vs Fixed-Order Models
Performance
IMM generally should do at

IMMs vs Fixed-Order ModelsPerformanceIMM generally should do at least as well

least as well as a fixed-order model.
Some risk of

overtraining.
IMM result can be stored and used like a fixed-order model.
IMM will be somewhat slower to train and will use more memory.

Target

Variable-Length Context

Слайд 29 GLIMMER-HMM
Nth-order interpolated Markov models (IMM) (N=8)

Слайд 30 General Things to Remember about (Protein-coding) Gene Prediction

General Things to Remember about (Protein-coding) Gene Prediction SoftwareIt is, in

Software
It is, in general, organism-specific

It works best on genes

that are reasonably similar to something seen previously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will not be identified

It is imperfect! (It’s biology, after all…)

Слайд 31 Профильные HMM Profile HMM
Берем множественное выравнивание и делаем из

него статистическую модель.

Слайд 32

Слайд 33 Profile HMMs
Моделирует семейство последовательностей
Вычисляется из множественного выравнивания семейства
Вероятности

переходов состояний и испускания данных зависят от позиции выравнивания

(position-specific)
Надо установить параметры модели такими, чтобы полная вероятность достигала максимума для членов семейства.
Последовательности могут быть протестированы на принадлежность семейству, используя алгоритм Витерби для оценки совпадения с профилем

Слайд 34 Строим модель: состояния совпадения (Match States)
Если нам нужно

Строим модель: состояния совпадения (Match States)Если нам нужно выполнить выравнивание без

выполнить выравнивание без пропусков, то мы можем использовать простую,

неразветвленную HMM, где из каждого состояния совпадения можно перейти в другое состояния совпадения

Для каждого состояния существует вероятность испускания аминокислоты, которые зависят от состояния совпадения

По существу это PSSM (Position Specific Scoring Matrix): вес каждой колонки PSSM может быть отмасштабирован от 0 до 1 в соответствии с вероятностями испускания.

Все вероятности переходов назначаются 1: существует только один выбор – двигаться в следующее состояния совпадения.

Слайд 35 Состояния вставки Insertion States
Во множественном выравнивании часто встречаются

колонки, являющиеся пропусками в большинстве последовательностях, но содержащие аминокислоты

в некоторых.
Такие колонки лучше обозначать как состояния вставки.
По мере продвижения по модели и генерирования искомой последовательности, состояния вставки генерируют экстра аминокислоты, находящиеся в этих колонках.
Состояния вставки обладают вероятностями испускания, которые обычно такие же, как и общая пропорция каждой аминокислоты в базе данных.
Состояния вставки замыкаются на себя, что означает, что множество позиций может быть испущено в этом состоянии.
В состояние вставки можно войти из одного состояния совпадения, но выход происходит уже в следующее: вставка происходит между соседними аминокислотами.

Слайд 36 Состояние делиции Deletion States
Делициями во множественном выравнивании называют позиции,

Состояние делиции Deletion StatesДелициями во множественном выравнивании называют позиции, в которых

в которых большинство последовательностей имеют аминокислоты, и только небольшое

количество – пропуски.
Состояния делиции используются для того, чтобы перескочить между состояниями.
Допускается пропуск состояний совпадения, переходя из одного состояния делиции в другое.
Состояния делиции действуют как афинные штрафы: вероятности перехода из состояния совпадения в состояния делиции равнозначно штрафу за открытие разрыва, и переход из одного состояния делиции в другое равнозначно штрафу за продолжения разрыва.
В противоположность состояниям совпадения и состояниям вставки, состояния делиций являются молчащими, они ничего не испускают.

Слайд 37 Profile HMMs
Существует также переход из состоянии вставки в

состояние делиции, но такие переходы считаются маловероятными, и их

существование помогает при построении модели

Слайд 38 Profile HMMs: Example
Note: These sequences could lead to

other paths.

Слайд 39 Pfam
“A comprehensive collection of protein domains and families,

Pfam“A comprehensive collection of protein domains and families, with a range

with a range of well-established uses including genome annotation.”
Each

family is represented by two multiple sequence alignments and two profile-Hidden Markov Models (profile-HMMs).
A. Bateman et al. Nucleic Acids Research (2004) Database Issue 32:D138-D141

Слайд 40

Слайд 41 A Profile HMM Example
This is a section of

a repeated sequence in Bacillus megaterium.
15 последовательностей, и выравнивание

имеет длину 16 оснований.
Сначала параметризуем модель, то есть оцениваем вероятности переходов и испускания.
После этого модель может использоваться для оценки разных последовательностей.

GG-GGAAAAACGTATT
TG-GGACAAAAGTATT
TG-GAACAAAAGTATG
TACGGACAAAATTATT
T--GAAGAAAAGTATG
TA-GAACAAAAGTAGG
TG-GAACAAACGCATT
CGGGACAAA-AGTATT
TGGGGTAAA-AGTATT
TGAGACAAA-AGTAGT
TGAGACAAA-AGTATA
TGGGACAAAGAGTATT
TG-AAACAAAGATATT
CG-GAACAAAAGTATT
TA-GGACAAAAGTGTT

Слайд 42 Cоздание модели
Что называть вставками, что делициями?
>50% пропусков ->

Cоздание моделиЧто называть вставками, что делициями?>50% пропусков -> вставка делиция9 последовательностей

вставка
делиция
9 последовательностей имеют разрыв в третьей

колонке и одна последовательность имеет разрыв в колонке 2.
По определенному правилу колонка 3 должна быть вставкой, а колонка 2 – делицией, но это означает, что у нас будет переход сразу от делиции ко вставке, а этого следует избегать.
Пусть колонка 2 и 3 будут делициями.
У четырех последовательностей разрывы в колонке 10. Это должна быть делиция, но мы сделаем это вставкой, чтобы иметь хотя бы одну вставку.

Слайд 43 More Set Up
Колонки 2 и 3- состояния делиции,

More Set UpКолонки 2 и 3- состояния делиции, но в других

но в других последовательностях – состояния совпадения.
Колонка 10

– состояние вставки – основания других последовательностей испускаются из состояния вставки, поэтому для этой колонки нет состояния совпадения.
Окончательная модель имеет 15 состояний совпадений с соответствующими состояниями вставок и делиций.
Большинство состояний вставок и делиций не используются в нашей последовательности, поэтому у них будут низкие вероятности. Но, тем не менее, они должны быть включены в модель.

Слайд 44 Параметризация
Какие параметры нам нужны?
Эмиссионные:
В каждом состояние надо задать

ПараметризацияКакие параметры нам нужны?Эмиссионные:В каждом состояние надо задать вероятности эмиссии для

вероятности эмиссии для всех 4 оснований
Состояние вставки также

нуждается в вероятностях эмиссии для всех 4 оснований.
Обычно берутся фоновые вероятности из всего генома или базы данных
Переходные:
Для колонок 2 и 3 нам нужны вероятности перехода совпадения -> делиция match -> delete (M->D), и делиция -> делиция (D->D).
Для колонки 10, нам нужна вероятность M->I, и I->I (для которой у нас нет данных).
Нам также нужны общие вероятности M->M, M->D, and M->I для других колонок
Другие вероятности будут вычислены из условия, что все вероятности переходов из данного состояния должны суммироваться в 1.

Слайд 45 Эмиссионные вероятности
Фоновый уровень (вероятности оснований, если бы они

Эмиссионные вероятностиФоновый уровень (вероятности оснований, если бы они были выбраны случайным

были выбраны случайным образом)
Используются для состояний вставки.
Можно взять частоты

из целого генома B. Megaterium. GC=38%.
G = C = 0.19 и A = T = 0.31.
Специфические эмиссионные вероятности для каждого состояния совпадения
.
Посчитать частоты каждого основания (без пробелов) в каждой колонки
Но еще нужны псеводочастоты.

Слайд 46 Эмиссионные псевдочастоты
The simplest way to do pseudocounts is

Эмиссионные псевдочастотыThe simplest way to do pseudocounts is the Laplace method:

the Laplace method: adding 1 to the numerator and

4 (i.e. total types of base) to the denominator:
Freq(C in column 1) = (count of C’s + 1) / (total number of bases + 4)
= (2 + 1) / (15 + 4) = 0.158
As compared to actual frequency = 2/15 = 0.133
There are no A’s in column 1, so the probability of A from column 1 = 1/19 = 0.052
A somewhat more sophisticated method is to use overall base frequencies for each base.
Freq(C in column 1) = (count of C’s + 0.19) / (total number of bases + 1) = 2.19/16 = 0.137
Freq(A in column 1) = 0.31/16 = 0.019
The base frequency method could be altered by multiplying the pseudocounts by some constant, as an estimate of our uncertainty of how likely we are to find a sequence with an A first.
For example, to be more equivalent to the Laplace method, multiply by 4:

Freq(C in column 1) = (count of C’s + (4 * 0.19) ) / (total number of bases + 4) = 2.76/19 = 0.145
Freq(A in column 1) = (4 * 0.31)/19 = 0.065
Note how different the probabilities are for A.
We will just say that how to apply pseudocounts is an area of heuristics and active research.
We will use the overall base frequency method.

Слайд 47 Частоты переходов
Всего 225 переходов, и только 9 M->D.

P(M->D) = 9/225 = 0.040.
Для D->D,

есть 1 случай из 9 делиций, когда последовательность продложает быть делицией, поэтому P( D->D)=1/9 = 0.111. Тогда
P(D->M) = 1 – (D->D) = 0.888

Всего 11 M->I переходов. (колонка 10).
P(M->I)= 11/225 = 0.044.
Нет случаев I->I, поэтому мы произвольно решаем сделать эту вероятность, равной D->D (0.111), поскольку мы произовльным образом решили, какие колонки трактовать как вставки, а какие как делиции.
P(I->M)= 0.888

Тогда фоновые переходы P(M->M)= 1 – (P(M->I) + P(M->D)) = 1 – (0.040 + 0.044) = 0.916.

Нам также нужны низкие вероятности для переходов I->D и D->I, которые не должны происходить, так что мы их ставим равными 0.00001

Слайд 48 Специфические переходы
Колонки вставок и делиций.
Колонка 2 содержит 1

Специфические переходыКолонки вставок и делиций.Колонка 2 содержит 1 M->D и14 M->M.Need

M->D и14 M->M.
Need to add in pseudocounts from the

overall data, so:
P(M->D| column 2) =
(M->D count + 0.04) / (total transitions in column 2 + 1) =
1.04/16 = 0.065.
--M->I in column 2 is the background level, 0.044
M->M for column 2 is 1 – 0.065 - 0.044 = 0.891
Колонка 3 содержит 8 M->D и 6 M->M (еще есть D->D, но мы его посчитали).
Prob(M->D in column 3 ) = 8.04/15 = 0.536
Prob (M->M in column 3) = 1 – 0.536 - 0.044 = 0.420
Колонка 10 содержит вставку M->I и 5 переходов M->M
Prob(M->I in column 10) = 10.044/16 = 0.628
Prob (M->D in column 10) = 0.04 (background)
Prob (M->M in column 10 is 1 – 0.628 – 0.04 = 0.332

Слайд 49 Emission Probability Tables

Слайд 50 Transitions

Слайд 51 Scoring a Sequence
Whew! We have now estimated parameters

Scoring a SequenceWhew! We have now estimated parameters for all transitions

for all transitions and emissions.
Scoring a sequence. We are

going to use both the Viterbi algorithm and the forward algorithm to determine the most likely path through the model and the overall probability of emitting that sequence.
Note that we really should convert everything to logarithms
Also, it is standard practice to express emission probabilities as odds rations, which means dividing them by the overall base frequencies.
We are not going to do either of these things here, in the interest of simplification and clarity.

Let’s just score the first sequence in the list:
GG-GGAAAAACGTATT
Remove the gap, since a sequence derived from real data is not going to come with a gap (which came from a multiple alignment program)
GGGGAAAAACGTATT

Слайд 52 Scoring
GGGGAAAAACGTATT
Base 1 is G. To start the global

ScoringGGGGAAAAACGTATTBase 1 is G. To start the global model off, we

model off, we are going to require that this

be a match state.
The emission probability for G in M1 is 0.078, so this is the initial overall probability and Viterbi probability.
Base 2 is also G. There are 3 possibilities for this base: it might be a match state (M2), or it might the result of an insert state, or it might be the result of entering a delete state (and thus match a later base. We choose the most likely:
M1->M2 has a 0.891 probability, and the probability of emitting a G in column 2 is 0.750. So, this probability is 0.891 * 0.750 = 0.668
M1->D = 0.065
M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008
M1->M2 is most likely.
So, Viterbi probability = previous prob * this prob = 0.078 * 0.668 = 0.052.
Overall prob = 0.078 * (0.668 + 0.065 + 0.008) = 0.078 * 0.741 = 0.058

Слайд 53 More Scoring
Base 3 is also a G.
M2->M3 has

More ScoringBase 3 is also a G.M2->M3 has 0.420 probability and

0.420 probability and 0.464 chance of emitting a G.

0.420 * 0.464 = 0.195
M2->D has 0.536 probability
M1->I, then emitting a G from the insert state = 0.044 * 0.19 = 0.008
Choose M2->D. Viterbi = 0.052 * 0.536 = 0.028.
Overall = 0.058 * (0.195 + 0.536 + 0.008) = 0.058 * 0.739 = 0.043.
We are now in a delete state between M2 and M4; we skipped the M3 state. Since delete states are silent, the G in position 3 hasn’t been emitted yet.
From the delete state we can either move to another delete state (skipping the M4 state in addition to M3) or we can move to M4 and emit the G.
D->M4 = 0.888 and M4 emitting a G = 0.890, so prob = 0.888 * 0.890 = 0.790
D->D = 0.111
Move to M4. Viterbi = 0.028 * 0.790 = 0.022.
Overall = 0.043 * (0.790 + 0.111) = 0.043 * 0.901 = 0.039.
We can now move on to base 4 (another G)
Our path so far: M1->M2->D->M4. We have emitted the first 3 bases.
GGGGAAAAACGTATT

Слайд 54 Still More Scoring
GGG GAAAA ACGTATT
The next several bases

Still More ScoringGGG GAAAA ACGTATTThe next several bases are easy. Since

are easy. Since the probability of moving to a

delete or insert state is low, we just have to be sure that the M->M probability times the emission probability stays above 0.044.
M4->M5 : G prob = 0.916 * 0.328 = 0.300
Viterbi prob = 0.022 * 0.300 = 0.0066
Overall prob = 0.039 * (0.300 + 0.040 + (0.044 * 0.19) ) = 0.039 * 0.3484 = 0.0136
M5->M6 : A prob = 0.916 * 0.653 = 0.598
Viterbi prob = 0.0066 * 0.598 = 0.00395
Overall prob = 0.0136 * (0.598 + 0.040 + (0.044 * 0.31) ) = 0.0136 * 0.6516 = 0.0089
M6->M7 : A prob = 0.916 * 0.403 = 0.369
Viterbi prob = 0.00395 * 0.369 = 0.00146
Overall prob = 0.0089 * (0.369 + 0.040 + (0.044 * 0.31) ) = 0.0089 * 0.423 = 0.00376
M7->M8 : A prob = 0.916 * 0.965 = 0.884
Viterbi prob = 0.00146 * 0.884 = 0.00129
Overall prob = 0.00376 * (0.884 + 0.040 + (0.044 * 0.31) ) = 0.00376 * 0.938 = 0.00353
M8->M9 : A prob = 0.916 * 0.965 = 0.884
Viterbi prob = 0.00129 * 0.884 = 0.00114
Overall prob = 0.00353 * (0.884 + 0.040 + (0.044 * 0.31) ) = 0.00353 * 0.938 = 0.00331

Слайд 55 Yet More
At this point we have emitted positions

Yet MoreAt this point we have emitted positions 1- 8, and

1- 8, and the most probable path is M1->M2->D->M4->M5->M6->M7->M8->M9
GGG

GAAAA ACGTATT
Since the transition out of M9 is not the standard one, we need to pause and think it through.
M9->M10 = 0.332. Emission prob for A from M10 is 0.778. 0.332 * 0.778 = 0.258
M9->I = 0.628. Emission prob for A from an insert state (i.e. background probability) is 0.31 0.628 * 0.31 = 0.195.
Thus our best choice, the most probable path, is M9->M10. However, looking at the aligned sequences we can see that this is the wrong choice.
Don’t despair: correction occurs in the next step.
Viterbi prob = 0.00114 * 0.258 = 0.000294
Overall prob = 0.00331 * (0.258 + 0.195 + 0.040) = 0.00331 * 0.493 = 0.00163

Слайд 56 Yet Still More
At this point we have emitted

Yet Still MoreAt this point we have emitted positions 1- 8,

positions 1- 8, and the most probable path is

M1->M2->D->M4->M5->M6->M7->M8->M9->M10
GGG GAAAAA CGTATT
At M10, we can:
move to M11 and emit a C. Prob = 0.916 * 0.005 = 0.0046
Move to an insert state and emit a C. Prob = 0.044 * 0.19 = 0.0083.
Move to a delete state. Prob = 0.04. This would be the best choice, but it leads to a mess: delete all the remaining match states, then inserting all the remaining bases in the query sequence at the end. It clearly shows the need for dynamic programming.
And while we are at it, switching to logarithms at the beginning would greater ease calculations.
So, to continue our example, we move from M10 to an insert state and emit a C.
Viterbi prob = 0.000294 * 0.0083 = 2.44 x 10-6
Overall prob = 0.00163 * (0.0046 + 0.0083) = 2.10 x 10-5

Слайд 57 To the End…
Our path so far:
M1->M2->D->M4->M5->M6->M7->M8->M9->M10->I
GGG GAAAAAC

To the End…Our path so far: M1->M2->D->M4->M5->M6->M7->M8->M9->M10->IGGG GAAAAAC GTATTFrom the insert

GTATT
From the insert state we can:
I->I and emit a

G, with probability 0.111 * 0.19 = 0.0211
I->M11, with prob 0.888 * 0.828 = 0.735
Viterbi prob = 2.44 x 10-6 * 0.735 = 1.79 x 10-6
Overall prob = 2.10 x 10-5 * (0.0211 + 0.735) =1.58 x 10-5
The remaining steps are all match states, so we skip the calculations:
Final Viterbi probability = 4.46 x 10-7
Final overall prob = 6.79 x 10-6

Слайд 58 Final probability
We need to know what the probability

Final probabilityWe need to know what the probability would be for

would be for the random model, with every base

inserted according to its overall frequency in the genome.
GGGGAAAAACGTATT has 6 G/C and 9 A/T, so the random probability is:
(0.19)6 * (0.31)9 = 1.24 x 10-9
We compare to the overall probability of 6.79 x 10-6 by dividing, giving 5459. This means that the overall score for this sequence is 5459 times more likely than chance to match the model.

Слайд 59 Profile Hidden Markov Models
Вычисление веса последовательности по профильным

HMM

Имея профильную HMM, любой путь по модели «испускает» последовательность

с некоторой вероятностью.

Вероятность пути – это произведение всех вероятностей переходов и испускания данных вдоль пути.

Слайд 60 Profile Hidden Markov Models
Вычисление веса последовательности по профильным

HMM
Алгоритм Витерби:
Имея исходную последовательность, мы можем посчитать наиболее

вероятный путь, который сгенерирует («испустит») эту последовательность.

- Предыдущая Французская революция

Следующая - Маркетинг образовательных услуг

Экологические группы растений по отношению к воде 213

Водоросли 303

Презентация по биологии Особо охраняемые природные территории в Мегино-Кангаласском улусе 234

Основные типы экологических взаимодействий 148

прикладная физика кровообращение 141

Соцветия 165

Строение нуклеотида 205

Ученые – биологи, внесшие большой вклад в развитие современных наук о человеке 169

Обмен веществ 144

Забавные животные 198

Презентация по биологии на тему Урдаклар (7 класс на узбекском языке) 195

Экологический игровой конкурс 150

Здоровье и болезнь 242

Деление клеток. Митоз 142

Крокодилы 145

Строение живых организмов 190

Бурундуки 197

Вода и питьевой режим в школе 177

Членистоногие 158

Физиология нервной системы (часть 3) 173

Организмы в Мировом океане 168

Самые необычные домашние животные 166

Основы исследований тканей растений обучающимися 139

Кольчатые черви по биологии 389

Что такое findslide.org?

Обратная связь

Презентация на тему HММ, поиск генов и профилей

Содержание

Слайд 2 Поиск генов

Слайд 3 Jan 23, 2003Computational Gene FindingGene Structure

Слайд 4 What is it about genes that we can

measure (and model)?Most of our knowledge is biased towards

Слайд 5 Статистика кодирующей последовательностиНеравное использование кодонов в кодирующих областях

– универсальная характеристика геномов. Неравное использование аминокислот в существующих

Слайд 6 An Example of Coding Statistics

Слайд 7 Codon Adaptation Index (CAI)the geometric mean of the

weight associated to each codon over the length of

Слайд 8 CAI Example: Counts per 1000 codons

Слайд 9 Splice signals (mice): GT , AG

Слайд 10 HMMs and Prokaryotics Gene StructureNucleotides {A,C,G,T} are the

observablesDifferent states generate nucleotides at different frequencies A simple HMM

Слайд 11 ParseFor a given sequence, a parse is an

assignment of gene structure to that sequence.In a parse,

Слайд 12 The HMM Matrixes: Φ and Hxm(i) = probability

of being in state m at position i; H(m,yi)

Слайд 13 A eukaryotic geneThis is the human p53 tumor

suppressor gene on chromosome 17.Genscan is one of the

Слайд 14 A eukaryotic gene3’ untranslated regionFinal exonInitial exonIntronsInternal exonsThis

particular gene lies on the reverse strand.

Слайд 15 An Intron3’ splice site5’ splice siterevcomp(CT)=AGrevcomp(AC)=GTGT: signals start

of intronAG: signals end of intron

Слайд 16 Signals vs contentsIn gene finding, a small pattern

within the genomic DNA is referred to as a

Слайд 17 Prior knowledgeWe want to build a probabilistic model

of a gene that incorporates our prior knowledge.E.g., the

Слайд 18 Prior knowledgeThe translated region must have a length

that is a multiple of 3.Some codons are more

Слайд 19 Цепи Маркова высокого порядкаk th-order Markov model bases

the probability of an event on the preceding k

Слайд 20 Цепи Маркова высокого порядкаAdvantages:Easy to train. Count frequencies

of (k+1)-mers in training data.Easy to compute probability of

Слайд 21 Genscan ExampleUses explicit state duration HMM to model

gene structure (different length distributions for exons)Different model parameters

Слайд 22 E0E1E2E2E1E0NPEtermPEinitpolyA5’ UTRI0I1I2I0I1I2EsnglEsnglEinitEtermforward strandbackward strand3’ UTR5’ UTR3’ UTRpolyAE- exonsI-

intronssingle exon 5’ UTRs 3’ UTRsP- promoter region polyA

Слайд 23 http://nar.oxfordjournals.org/content/26/4/1107

Слайд 24 GeneMarkBorodovsky & McIninch, Comp. Chem 17, 1993.Uses 5th-order

Markov model.Model is 3-periodic, i.e., a separate model for

Слайд 25 Interpolated Markov Models (IMM)Introduced in Glimmer 1.0 Salzberg, Delcher,

Kasif & White, NAR 26, 1998.Probability of the target

Слайд 26 Real IMMsModel has additional probabilities, λ, that determine

which parts of the context to use.E.g., the probability

Слайд 27 Real IMMsResult is a linear combination of different

Markov orders: where Can view this as interpolating the results of

Слайд 28 IMMs vs Fixed-Order ModelsPerformanceIMM generally should do at

least as well as a fixed-order model.Some risk of

Слайд 29 GLIMMER-HMMNth-order interpolated Markov models (IMM) (N=8)

Слайд 30 General Things to Remember about (Protein-coding) Gene Prediction

SoftwareIt is, in general, organism-specificIt works best on genes

Слайд 31 Профильные HMM Profile HMM Берем множественное выравнивание и делаем из

него статистическую модель.

Слайд 32

Слайд 33 Profile HMMsМоделирует семейство последовательностейВычисляется из множественного выравнивания семействаВероятности

переходов состояний и испускания данных зависят от позиции выравнивания

Слайд 34 Строим модель: состояния совпадения (Match States)Если нам нужно

выполнить выравнивание без пропусков, то мы можем использовать простую,

Слайд 35 Состояния вставки Insertion StatesВо множественном выравнивании часто встречаются

колонки, являющиеся пропусками в большинстве последовательностях, но содержащие аминокислоты

Слайд 36 Состояние делиции Deletion StatesДелициями во множественном выравнивании называют позиции,

в которых большинство последовательностей имеют аминокислоты, и только небольшое

Слайд 37 Profile HMMsСуществует также переход из состоянии вставки в

состояние делиции, но такие переходы считаются маловероятными, и их

Слайд 38 Profile HMMs: ExampleNote: These sequences could lead to

other paths.

Слайд 39 Pfam“A comprehensive collection of protein domains and families,

with a range of well-established uses including genome annotation.”Each

Слайд 40

Слайд 41 A Profile HMM ExampleThis is a section of

a repeated sequence in Bacillus megaterium.15 последовательностей, и выравнивание

Слайд 42 Cоздание моделиЧто называть вставками, что делициями?>50% пропусков ->

вставка делиция9 последовательностей имеют разрыв в третьей

Слайд 43 More Set UpКолонки 2 и 3- состояния делиции,

но в других последовательностях – состояния совпадения. Колонка 10

Слайд 44 ПараметризацияКакие параметры нам нужны?Эмиссионные:В каждом состояние надо задать

Слайд 3 Jan 23, 2003
Computational Gene Finding
Gene Structure

measure (and model)?
Most of our knowledge is biased towards

Слайд 5 Статистика кодирующей последовательности
Неравное использование кодонов в кодирующих областях

– универсальная характеристика геномов.

Неравное использование аминокислот в существующих

Слайд 7 Codon Adaptation Index (CAI)
the geometric mean of the

Слайд 10 HMMs and Prokaryotics Gene Structure
Nucleotides {A,C,G,T} are the

observables

Different states generate nucleotides at different frequencies

A simple HMM

Слайд 11 Parse
For a given sequence, a parse is an

assignment of gene structure to that sequence.
In a parse,

Слайд 12 The HMM Matrixes: Φ and H
xm(i) = probability

of being in state m at position i;
H(m,yi)

Слайд 13 A eukaryotic gene
This is the human p53 tumor

suppressor gene on chromosome 17.
Genscan is one of the

Слайд 14 A eukaryotic gene
3’ untranslated region
Final exon
Initial exon
Introns
Internal exons
This

Слайд 15 An Intron
3’ splice site
5’ splice site
revcomp(CT)=AG
revcomp(AC)=GT
GT: signals start

of intron
AG: signals end of intron

Слайд 16 Signals vs contents
In gene finding, a small pattern

Слайд 17 Prior knowledge
We want to build a probabilistic model

of a gene that incorporates our prior knowledge.
E.g., the

Слайд 18 Prior knowledge
The translated region must have a length

that is a multiple of 3.
Some codons are more

Слайд 19 Цепи Маркова высокого порядка
k th-order Markov model bases

Слайд 20 Цепи Маркова высокого порядка
Advantages:
Easy to train. Count frequencies

of (k+1)-mers in training data.
Easy to compute probability of

Слайд 21 Genscan Example

Uses explicit state duration HMM to model

gene structure (different length distributions for exons)

Different model parameters

Слайд 22 E0
E1
E2
E2
E1
E0
N
P
Eterm
P
Einit
polyA
5’ UTR
I0
I1
I2
I0
I1
I2
Esngl
Esngl
Einit
Eterm
forward strand
backward strand
3’ UTR
5’ UTR
3’ UTR
polyA
E- exons
I-

introns
single exon
5’ UTRs
3’ UTRs
P- promoter region polyA

Слайд 24 GeneMark
Borodovsky & McIninch, Comp. Chem 17, 1993.
Uses 5th-order

Markov model.
Model is 3-periodic, i.e., a separate model for

Слайд 25 Interpolated Markov Models (IMM)
Introduced in Glimmer 1.0 Salzberg, Delcher,

Kasif & White, NAR 26, 1998.
Probability of the target

Слайд 26 Real IMMs
Model has additional probabilities, λ, that determine

which parts of the context to use.
E.g., the probability

Слайд 27 Real IMMs
Result is a linear combination of different

Markov orders: where
Can view this as interpolating the results of

Слайд 28 IMMs vs Fixed-Order Models
Performance
IMM generally should do at

least as well as a fixed-order model.
Some risk of

Слайд 29 GLIMMER-HMM
Nth-order interpolated Markov models (IMM) (N=8)

Software
It is, in general, organism-specific

It works best on genes

Слайд 31 Профильные HMM Profile HMM
Берем множественное выравнивание и делаем из

Слайд 33 Profile HMMs
Моделирует семейство последовательностей
Вычисляется из множественного выравнивания семейства
Вероятности

Слайд 34 Строим модель: состояния совпадения (Match States)
Если нам нужно

Слайд 35 Состояния вставки Insertion States
Во множественном выравнивании часто встречаются

Слайд 36 Состояние делиции Deletion States
Делициями во множественном выравнивании называют позиции,

Слайд 37 Profile HMMs
Существует также переход из состоянии вставки в

Слайд 38 Profile HMMs: Example
Note: These sequences could lead to

Слайд 39 Pfam
“A comprehensive collection of protein domains and families,

with a range of well-established uses including genome annotation.”
Each

Слайд 41 A Profile HMM Example
This is a section of

a repeated sequence in Bacillus megaterium.
15 последовательностей, и выравнивание

Слайд 42 Cоздание модели
Что называть вставками, что делициями?
>50% пропусков ->

вставка
делиция
9 последовательностей имеют разрыв в третьей

Слайд 43 More Set Up
Колонки 2 и 3- состояния делиции,

но в других последовательностях – состояния совпадения.
Колонка 10

Слайд 44 Параметризация
Какие параметры нам нужны?
Эмиссионные:
В каждом состояние надо задать

вероятности эмиссии для всех 4 оснований
Состояние вставки также

Слайд 45 Эмиссионные вероятности
Фоновый уровень (вероятности оснований, если бы они

были выбраны случайным образом)
Используются для состояний вставки.
Можно взять частоты

Слайд 46 Эмиссионные псевдочастоты
The simplest way to do pseudocounts is

Слайд 47 Частоты переходов
Всего 225 переходов, и только 9 M->D.

P(M->D) = 9/225 = 0.040.
Для D->D,

Слайд 48 Специфические переходы
Колонки вставок и делиций.
Колонка 2 содержит 1

M->D и14 M->M.
Need to add in pseudocounts from the

Слайд 51 Scoring a Sequence
Whew! We have now estimated parameters

for all transitions and emissions.
Scoring a sequence. We are

Слайд 52 Scoring
GGGGAAAAACGTATT
Base 1 is G. To start the global

Слайд 53 More Scoring
Base 3 is also a G.
M2->M3 has

Слайд 54 Still More Scoring
GGG GAAAA ACGTATT
The next several bases

Слайд 55 Yet More
At this point we have emitted positions

1- 8, and the most probable path is M1->M2->D->M4->M5->M6->M7->M8->M9
GGG

Слайд 56 Yet Still More
At this point we have emitted

Слайд 57 To the End…
Our path so far:
M1->M2->D->M4->M5->M6->M7->M8->M9->M10->I
GGG GAAAAAC

GTATT
From the insert state we can:
I->I and emit a

Слайд 58 Final probability
We need to know what the probability

Слайд 59 Profile Hidden Markov Models
Вычисление веса последовательности по профильным

HMM

Имея профильную HMM, любой путь по модели «испускает» последовательность

Слайд 60 Profile Hidden Markov Models
Вычисление веса последовательности по профильным

HMM
Алгоритм Витерби:
Имея исходную последовательность, мы можем посчитать наиболее