LyRic

LyRic is a versatile automated transcriptome annotation and analysis workflow written in the Snakemake language. Its core functionality is the production of:

  1. a set of high-quality RNA Transcript Models (TMs) mapped onto a genome sequence, based on Long-Read (LR) RNA sequencing data.
  2. various summary statistics plots and analysis results that describe the input and output data in details
  3. an interactive HTML table reporting statistics for each input sample, enabling easy and intuitive sample-to-sample comparison
  4. a UCSC Track Hub to display output TMs, as well as various other tracks produced by LyRic.

(Note that features 2, 3 and 4 can be easily switched on and off).

LyRic is platform-agnostic, i.e. it can deal with FASTQ data coming from both the ONT and PacBio platforms.

Table of Contents

Installation & Dependencies

LyRic depends on the following software:

Please install those as a prerequisite. Once this is completed:

  1. cd to the directory where you intend to run the LyRic workflow (referred to as the working directory below).

  2. Clone LyRic Snakefiles:

    git clone https://github.com/julienlag/LyRic.git ./LyRic

  3. Customize the configuration variables and cluster_config.json according to your needs.

All paths mentioned below are relative to the working directory.

Execution

Note that running LyRic under a Snakemake-compatible HPC environment such as SGE/UGE is highly recommended.

An example bash script launching LyRic in cluster/DRMAA mode is provided (run_snakemake_EXAMPLE.sh), together with its accompanying workflow configuration (config_EXAMPLE.json) and cluster configuration (cluster_config.json) files. Note that you will need to customize these config files manually before your first LyRic run. Once this is done, make sure you’re in your working directory and issue the following command to run LyRic:

./LyRic/run_snakemake_EXAMPLE.sh

Please refer to Snakemake’s documentation for more advanced usage.

Input

Mandatory input

LR FASTQ files

Sample annotation file

NOTE: Never, EVER use Excel /LibreOffice or the like to edit this file!!! Here’s a good TSV editor, also available as a VSCode extension.

This tab-separated file contains all metadata associated to each sample/input LR FASTQ file, as well as some customizable, sample-specific LyRic run parameters. Its path is controlled by config variable SAMPLE_ANNOT. A mock sample annotation file, named sample_annotations_EXAMPLE.tsv is included in this repository.

All columns are optional, except (1) if otherwise stated below; (2) if said column is listed in config variable sampleRepGroupBy. The contents of each column should me mostly self-explanatory, except:

Genome sequences

To map RNA sequencing reads against. One genome assembly per file, in (multi-)FASTA format (i.e. all chromosomes in separated records of a single FASTA file). Each FASTA file should be named ‘<genome_id>.fa’, where <genome_id> is the genome assembly identifier, and should match a value in the capDesignToGenome{} config variable object (e.g. ‘hg38’ or ‘mm10’).

The path to the directory containing genome sequences is controlled by config variable GENOMESDIR (see below).

Optional input

Short-read Illumina FASTQ files

If present, pair-ended short reads contained in these files will be used to confirm splice junctions present in the LR FASTQ files. There should be one pair of Illumina FASTQ files per {capDesign} inside the fastqs/hiSeq/ subdirectory.

Only needed if config variable USE_MATCHED_ILLUMINA is True.

Reference annotation GTF

Reference gene annotation file to compare LyRic’s output annotation against.Controlled by config variable genomeToAnnotGtf (see below). This file should also contain spike-in gene annotations (e.g. SIRV/ERCC) if your samples include those.

Only needed if any of config variables produceStatPlots, produceTrackHub and produceHtmlStatsTable are True.

SIRV information file

TSV file containing SIRV information. Format should be:

<transcript_id>\t<length>\t<concentration>

Controlled by config variable SIRVinfo (see below). Only needed if any of config variables produceStatPlots and produceHtmlStatsTable are True.

Repeat annotation file

RepeatMasker BED file, containing the coordinates of repeat regions to compare TMs against. Can be easily downloaded from the UCSC Table Browser. BED files should be named ‘<genome_id>.repeatMasker.bed’, where <genome_id> is the genome assembly identifier, and should match a value in the capDesignToGenome{} config variable object (e.g. ‘hg38’ or ‘mm10’). The path to the directory containing repeat files is controlled by config variable REPEATMASKER_DIR (see below).

Only needed if any of config variables produceStatPlots and produceHtmlStatsTable are True.

Capture-targeted regions

GTF file of non-overlapping capture-targeted regions for each (post-capture) {capDesign}. Only for RNA capture samples. Each region should be identified by its transcript_id and labelled using the gene_type GFF attribute to group features into target types (e.g. by gene biotype).

Filepath is controlled by config variable capDesignToTargetsGff (see below).

Only needed if any of config variables produceStatPlots, produceTrackHub and produceHtmlStatsTable are True, and CAPTURE is True.

Workflow configuration variables

The following config variables are user-customizable. These can be set either via a config file (Snakemake’s --configfile FILE option) or directly via command line options (--config [KEY=VALUE [KEY=VALUE ...]]]). See Snakemake’s CLI’s documentation for more details.

Mandatory config variables

Optional config variables

Output

All files output by LyRic are written under ‘./output/’ in the working directory. LyRic will generate various types of output files, listed below.

The production of each output type can be turned on and off using config variables, namely ‘produceStatPlots’, produceHtmlStatsTable’ and produceTrackHub’. If all these are set to False, the workflow will only produce transcriptome GTF files (which cannot be switched off).

Transcriptome GTF files

All reads and transcripts are collapsed/merged using tmerge.

Sample-specific TMs

(Output directory: output/mappings/mergedReads/)

One GTF file per input FASTQ.

(missing section)

TMs merged across replicates

(Output directory: output/mappings/mergedReads/groupedSampleReps/)

Samples-specific TMs can be further merged across samples according to config variable sampleRepGroupBy. The output transcriptome files will be named based on the column values used to group by. For example, given config_EXAMPLE.json and sample_annotations_EXAMPLE.tsv:

Summary statistics plots

(Output directory: ./output/plots/)

Controlled by config variable ‘produceStatPlots’: boolean. If True, multiple statistics plots in PNG format will be output inside the output directory.

(incomplete section)

Interactive HTML summary stats table

(Output directory: ./output/html/)

Production of the interactive HTML summary table is controlled by config variable ‘produceHtmlStatsTable’ variable. If set to True, detailed reports containing various per-sample statistics will be produced in the output directory.

LyRic will produce one table per distinct subProject value in the input sample annotation file (as long as those correspond to actual input FASTQ files) (./output/html/summary_table_{subProject}.html), plus a global one containing info for all samples (./output/html/summary_table_ALL.html)

For each interactive HTML summary stats table, an accompanying TSV file with the same basename and the .tsv extension will also be produced. It contains the same data as the HTML table, in an easily parsable tab-separated format.

(incomplete section)

UCSC Track Hub

(Output directory: ./output/trackHub/)

Controlled by config variable ‘produceTrackHub’ (boolean). If true, LyRic will generate a UCSC Track Hub structure in the corresponding output directory. The URL of the resulting Track Hub directory (i.e. where the Genome Browser should fetch the hub.txt file is controlled by config variable TRACK_HUB_BASE_URL.

(missing section)

Glossary / Abbreviations

HCGM

High-Confidence Genome Mappings.

Filtered read-to-genome mappings characterized by:

Produced by snakemake rule highConfidenceReads.

HiSS

Hi-Seq-Supported read mappings.

Those correspond to HCGMs that have all their splice junctions supported by at least one split read in the corresponding {capDesign}-matched HiSeq sample, if use_matched_HiSeq is set to true for the corresponding sample in the sample annotation file (config[SAMPLE_ANNOT]). If use_matched_HiSeq is false, HiSS reads are exactly equivalent to HCGMs.

Produced by Snakemake rule getHiSeqSupportedHCGMs.

LR

Long sequencing Read (typically produced by the PacBio and ONT platforms).

TM

Transcript Model.

The evidence-based model of an RNA transcript represented as the genomic coordinates of its intron-exon structure. A gene model contains a set of exon-overlapping TMs.