README.md

# NLRome assemblies of Cucumis melo derived from Nanopore adaptive sampling sequencing

## Description

Repository to house the scripts used for data analysis, targeted assemblies of NLR clusters, and plots generation in the projects:

**PROJECT_1.** "NLRome assemblies of Cucumis melo derived from Nanopore adaptive sampling sequencing"

**PROJECT_2.** "Nuclear and organelle genome assemblies of five Cucumis melo L. accessions, Ananas, Canton, PI 414723, Vedrantais and Zhimali, belonging to diverse botanical groups"

**PROJECT_3.** "NLRome assemblies across diverse melon accessions: A resource for association studies and breeding for resistance"

### Associated publications

Belinchon-Moreno, J., Berard, A., Canaguier, A., Chovelon, V., Cruaud, C., Engelen, S., ... & Faivre Rampant, P. (2023). Nanopore adaptive sampling to identify the NLR-gene family in melon (Cucumis melo L.). bioRxiv, 2023-12.

Belinchon-Moreno, J., Berard, A., Canaguier, A., Le-Clainche, I., Rittener-Ruff, V., Lagnel, J., ... & Faivre Rampant, P. (2024). Nuclear and organelle genome assemblies of five Cucumis melo L. accessions, Ananas, Canton, PI 414723, Vedrantais and Zhimali, belonging to diverse botanical groups. bioRxiv, 2024-10.

### Data included

#### PROJECT_1

- The structural and functional annotations of the _de novo_ whole genome assemblies of **Anso77** and **Doublon**: Directory **PROJECT_1/annotations_Anso77_Doublon**

- The R scripts to generate some of the graphs ilustrating the publication "Nanopore adaptive sampling to identify the NLR-gene family in melon (Cucumis melo L.)": Directory **PROJECT_1/graphs.R**

- The script **coverage_per_time.sh** that was used in the publication "Nanopore adaptive sampling to identify the NLR-gene family in melon (Cucumis melo L.)" to get the sequence depth generated by NAS and WGS at each specific period of the sequencing run: Directory **PROJECT_1/coverage_per_time.sh**

#### PROJECT_2

- The scripts used for WGS assembly and annotation of five melon accessions: **Ananas, Canton, PI 414723, Vedrantais** and **Zhimali**: Directory **PROJECT_2/scripts_assemblies_5genomes** 

#### Scripts for targeted sequencing processing and assembly of NLR clusters (PROJECT_1, PROJECT_2 and PROJECT_3)

- The scripts for data analysis from raw sequences to target assemblies are located in **scripts_half_NAS_WGS_processing** (for those accessions for which we did half flowcell NAS and half flowcell WGS) and **scripts_NAS_processing** (for those accessions for which we did NAS in the whole flowcell) directories.

## Abstract

Understanding and characterizing the defense mechanisms of plants against pathogens represent a major challenge for sustainable agriculture and food production. It facilitates the creation of varieties with a very broad spectrum of resistance by maximizing the use of the known defense mechanisms. In this sense, the structural and functional characterization of the complete set of Nucleotide-binding-site-Leucine-rich-Repeat (NLR) disease resistance genes in a species (or NLRome) becomes especially interesting. 

NLR genes encode the most diverse family of plant resistance proteins. They play a central role in the so-called effector triggered immunity of plants by recognizing specific pathogen effectors. In melon (Cucumis melo L.), a highly important and widely cultivated vegetable crop belonging to the Cucurbitaceae family, the specific role of each NLR gene remains largely unknown, especially in relation to quantitative resistances. This lack of information is a consequence of a poor sequencing, assembly and annotation of these genes using short-reads sequencing, due to their complex structure and organization. NLR genes are often arranged in clusters that include a large number of repetitive elements. This fact, together with the repetitive intra-structure of NLR genes (domain leucine-rich-repeat) makes them prone to major evolutionary structural changes like duplication or transposition. For these reasons, NLR clusters commonly present a high level of presence-absence (PAV) polymorphisms. Short reads are usually ineffective to characterize them, but long reads sequencing methods may provide a very valuable information. However, they can still result expensive at wetlab, bioinformatics and data storage level when a large number of samples need to be evaluated. Nanopore Adaptive Sampling (NAS) is presumed to be a good approach here, since it can reduce the quantity of information to manage while increasing the coverage of the target regions compared to a whole genome sequencing (WGS). In addition, it is a cost and labor-effective solution compared to other complex and time-consuming targeted sequencing methods. NAS offers a promising approach for assessing genetic diversity in targeted genomic regions. 

We designed and validated an experiment to enrich a set of resistance genes in several melon cultivars as a proof of concept. We showed that each of the 15 regions we identified in two newly assembled melon genomes -ANSO-77 and DOUBLON- (subspecies melo) were successfully and accurately reconstructed as well as in a third cultivar from the agrestis subspecies (CHANGBOUGI). We obtained a fourfold enrichment, independently from the samples, but with some variations according to the enriched regions. In the agrestis cultivar, we further confirmed our assembly by PCR. By extending the use of NAS to other melon varieties, we demonstrated it as a simple and efficient approach to explore complex genomic regions. This approach finally unlocks the characterization of resistance genes for a large number of individuals, as required for breeding new cultivars responding to the agroecological transition.

## Usage of scripts for target sequencing processing and assembly

To perform a complete analysis of the samples it is just necessary to run the bash script **launcher_analysis_NAS_WGS.sh** (for those accessions for which we did half flowcell NAS and half flowcell WGS) or **launcher_analysis_NAS.sh** (for those accessions for which we did NAS in the whole flowcell)

All the rest of scripts are launched by this one, except R scripts for plot creation, or assemblers other than SMARTdenovo.

R scripts for plot creation are not launched by the pipeline. They are provided and they can be run with the data generated by the launchers.

Similar for assemblers other than SMARTdenovo. They are provided in the directory **scripts_NAS_processing**, and they can be run with the data generated by the launcher. We used them when manual inspection of reads mapped to the assembly showed some kind of inconsistency, meaning that there were assembly problems from collapsed segment duplications, unassembled clusters similar to any assembled, or clusters having a certain degree of heterozygosity where both haplotypes where collapsed. In those cases, we ran Flye, Canu, Shasta or NextDeNovo (to solve segmental duplications or unassembled contigs), or hifiasm (to separate both haplotypes). Prior to hifiasm assembly, ONT reads where corrected using the provided herro_* scripts. When the mapping problem was solved with any alternative assembler, we manually recovered those contigs and we used them to substitute those erroneously assembled with SMARTdenovo.

The .csv file **list_runs** with the information of each sample must be available in the working directory.

Multiple samples can be ran with a loop by adding their name next to the script name (sh NAS_launcher_analysis.sh SAMPLE1 SAMPLE2 ...)

## Support
javier.belinchon-moreno@inrae.fr

## Authors and acknowledgment
Javier Belinchon-Moreno, Aurelie Berard, Aurelie Canaguier, Véronique Chovelon, Corinne Cruaud, Stéfan Engelen, Rafael Feriche-Linares, Isabelle Le-Clainche, William Marande, Vincent Rittener-Ruff, Jacques Lagnel, Damien Hinsinger, Nathalie Boissot, Patricia Faivre Rampant

## License
NLRome project