EFFICIENT CODING OF CONTACT MATRICES
The representation and coding of genomic annotations are currently being standardised by ISO/IEC JTC 1/SC 29/WG 8 (MPEG-G). The present invention was introduced into the standardisation process and relates to a method for the efficient coding of contact matrices.
The invention is based on the coding of contact matrices. The chromosome coordinates describe the start and end positions of a gene (or genetic element) on a chromosome. A pair of chromosome coordinates that are close to each other is called a contact. A contact matrix can be represented in the form of a sparse matrix. The matrix contains not only the number of contacts within a given genomic region, but also the normalised value of this contact. The invention is an extension of the invention “Method for the Coding of Contact Matrix” (German patent application pending (Technology description, https://www.ezn.de/ezn-patent/method-for-the-coding-of-contact-matrix/).
Bilder & Videos
According to the invention, a highly efficient coding of contact matrices and new functionalities are proposed. The extended structure contains new elements such as contact matrix header and bin payload. The header contains information such as the bin interval of the contact matrix tiles, the list of chromosomes with its corresponding name and length, the sample names, and the name of methods of the normalization done to the contact matrix tile. Normalization method could be for the on-the-fly and precomputed normalized one. The bin payload contains the interval multiplier. This is necessary in the case of multi-interval and the weights correspond to the higher interval. The weights for each on-the-fly normalization are also stored in the bin payload. This is done as the weights do not require much space and therefore no compression is neccessary. Additionally, the term interval is used instead of resolution to avoid confusion. The reason is higher resolution means better details yet for a contact matrix it becomes less detailed.
- Better structure to support multiple samples
- Support for on-the-fly normalized value computation for efficient storage of normalized contact matrix
- Technique for multi-interval
Coding of genomic data
License for commercial exploitation / Research & Development cooperation
- US anhängig
StichworteCoding, Contact Matrix, Data Transmission, genomic data, MPEG-G