Predicting the quality (Enropy score) of the best expected alignment from unaligned protein sequences :
Calulate the quality (Enropy score) of a given alignment :
Notes :
The entropy score for an alignment (a set of aligned sequences) is a measure of the variability at each position in the alignment. It quantifies the uncertainty or disorder in the sequence data. In the context of multiple sequence alignments (MSA), entropy is often used to evaluate conservation levels across aligned positions. A low entropy value indicates a highly conserved position where most sequences share the same character, suggesting functional or structural importance. Conversely, a high entropy value signifies greater variability, meaning the position is less conserved. This metric helps assess the reliability and significance of an alignment, particularly in evolutionary and structural biology studies. The function used is the one from the pymsa library in Python.
In progress bars, we consider entropy values between 0 and 7000, which cover most cases (until 102 sequences by file and sequence lengths until 1200).
Some examples of unaligned sequences and their reference aligned sequences from the Balibase R10 benchmark (218 pairs of files) are available at this link: Click here
FASTA File Criteria:
Unaligned Sequences
Valid FASTA format: Each sequence starts with > followed by an identifier.
No gaps (- or .).
Only standard amino acids.
Number of sequences: 3 - 102.
Maximum sequence length: 1200.
Aligned Sequences
Valid FASTA format: Each sequence starts with > followed by an identifier.
Aligned sequences: All sequences must have the same length.