piRNAdb is a piRNA storage and search system, and with some others relevant informations like alignments, tissue expression, clusters, datasets, associated genes and ontology terms of these piRNAs. We store informations related to the datasets of piRNAs that are from pappers which describes these RNAs and some other databases like GEO and SRA. The interface is simplified and user-friendly to make the access to informations can be faster and with few number of steps.
This system was projected to store a enormous quantity of data and is updated constantly to create new functions, resources and data about the piRNAs.
piRNA Selection Criteria
Innitially, the piRNAs stored by the piRNAdb database are those collected on the folllowing sources.The higher preference was give to those experiments that utilize the methodology of imunopreciptation of short RNAs bound to PIWI proteins, consequently, beeing classified as piRNAs. However, some studies that made SmallRNA-Seq are also added to piRNAdb, because the used some steps further to filter the sequencing results, like the prediction of piRNAs or additional step on the early stages of the sequencing to remove other short RNAs.
I) National Center for Biotechnology Information (NCBI): - Dabase: Nucleotide. Filtered to display only the ncRNAs (non-codingRNA) with sequence length higher than 1 base and lower than 60 bases, this filter was applied only to ensure short sequences. Collected sequences are those classified as piRNAs by the authors of the submission. - Organism found: H. sapiens, R. norvegicus, M. muscullus, C. elegans, C. griseus
II) European Nucleotide Archive (ENA): - Dabase: Nucleotide. Filtered to display only the ncRNAs (non-codingRNA) with sequence length higher than 1 base and lower than 60 bases, this filter was applied only to ensure short sequences. Collected sequences are those classified as piRNAs by the authors of the submission. - Organism found: D. melanogaster
III) Supplementary Files: - piRNAs found exclusively on additionaly result materials provided by the author paper. Sequences not present on above cited databases, are collected to compose the piRNAdb database. - Organism found: H. sapiens
There as short RNAs that are classified as other names, however they have the same or similar characteristics and therefore, on the cientific community they are considered piRNAs. For instance, the 21U-RNAs present in C. elegans, we also collected this kinf od sequences to insert in the piRNAdb database.
Cross Codes Feature
This feature was developed by the author of the piRNAdb to fullfil a personal problem, and now is available to all users of the piRNAdb. Probably you already read paper that says one piRNA (DQ571500.1, código do NCBI) is overexpressed on testicle and on another paper a piRNA (piR-36378) is overexpressed on brain tissues. However, you do not know if those papers are saying about the same piRNA due this lack of official code.
To fullfil this issue, the tool "Cross Codes" receive a list of piRNA access codes from several databases and by the simple click on a button returns the codes that we use on piRNAdb database. It makes easier the verification process of different access codes that exists today on cientific community. On the above example, now we can see that the first piRNA corresponds to the piRNA hsa-piR-1588 on our database, and the second one corresponds to the piRNA hsa-piR-28400. So, their are different piRNAs.
Beyond this feature, in each piRNA scpecific information page, we provide all other aliases from this piRNA on another databases that were used to mine these kind of data. This facilitate the researcher and user work to migrate to our database or maintaing a historic events for a scpecific piRNA. We are also evaluating to make this feature an universal tool, that besides the translate to our access codes it will translate the informed access codes to the accession codes for other databases. Mainly because we want to the user access to the higher amount of information available on the internet for his research.
Mapping piRNA
To process the alignment of piRNA colected sequences, we followed this step-by-step:
I) Group all sequences of colected piRNAs from the same organism, and remove the duplicate sequences found in two or more datasets.
II) Alignment to the correspondent version of organism genome build
III) We used the software BWA aligner, that is preffered to align short sequences. We do not allow mismatches nor gaps
Initially we removed the piRNA that haven't any alignment to the genome to avoid problemns related to the downstream analysis, however if found on posterior versions of the genome or found expressed in tissues em the expression analysis step, they will be returned to the piRNAdb database. This cases will receive a information box to explain why they are added only at the time.
The genome builds used to the alignment:
I) H. sapiens: Hg38
II) M. musculus: mm10
III) R. norvegicus: rn6
IV) C. griseus: crigri1
V) C. elegans: ce11
VI) D. melanogaster: dm5
GFF3 And GTF
The GTF (General Transfer Format) and GFF3 (General Feature Format) files provide information related to the alignment and piRNAs associated. It is like a common gene feature coordinate file, contaning the feature identification and their genome coordinates. These files are compatible with several softwares, like IntersectBed from BedTools package (Quinlan & Hall, 2010).
Quinlan AR & Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics, Volume 26, Issue 6, 15 March 2010, Pages 841–842
Tissue Expression
This feature provide information about the piRNA presence and quantity on specific tissues evaluated on several sequencing projects from diverse research institutes around the world. Data was downloaded from public available databases, like Sequence Read Archive (SRA) and Gene Expression Omnibus (GEO), and analysed following the expression methodology below.
The data sequencing, differently for the case when we are looking for new piRNAs, could be done by SmallRNA-Seq or miRNA-Seq techniques with no major problemns. This is due the fact that piRNAs we will count are already well stablished.
Step-by-step for the expression evaluation:
I) Sequence adaptor removal, when it exist
II) Alignment to the organism genome build, no allowing mismatches nor gaps.
III) Utilization of the GTF coordinate file from piRNAdb to count the presence of piRNAs found in the sample
IV) Group dataset samples following the separation that the author used.
V) Extraction of Raw Counts and Trimmed Mean of M-values (TMM) values for each piRNA
VI) Display to the user
We are in constant evaluation of new expression datasets to include to piRNAdb database to increase the reliability and amount of useful information available to users and researchers.