Coronavirus Pandemic Emergence of Novel SARS-CoV-2 variants in India: second wave

Introduction: Globally South-East Asia reported 40% of SARS-CoV-2 infected cases in the fourth week of April 2021. It continued to show an increase with India accounting for 50% of cases worldwide and 30% of global deaths. Genomic surveillance should continue at a rapid pace because of the continuously evolving nature of the virus. The time period of sample collection from the Global Initiative on Sharing All Influenza Data database was concurrent with the surge in new cases seen in the Indian subcontinent. Methodology: 7,415 sequences were downloaded from Global Initiative on Sharing All Influenza Data between January and April 2021; out of which 4,411 were high coverage genome sequences and were considered for analysis. Phylogenetic analysis were carried out using Nextstrain. Results: 21A or B.1.617 or delta was the most prevalent lineage in India accounting for 67.7% of the genomes. Next important clades were 20A, 20B and 20I accounting for 23.6%, 11.8% and 12.1% respectively collected between January 2021 and April 2021. The remaining sequences were assigned to clade 20H, 20J, 20D, 20C, 20G,20E,19A and 19B.The spike mutation frequencies of L452R, E484Q and P681R in Indian state of Maharashtra were 62.4%, 66.5% and 61.5% respectively. Two unique N-terminal domain deletion of spike protein were found at position 67 and 68. Conclusions: The phylogenomics of the delta variant or 21A emerged in neighboring Asian countries of Thailand, Bangladesh, Indonesia and Japan. We analyzed the SARS-CoV-2 genomes from India for mutation characterization of the spike glycoprotein and the nucleocapsid protein.


Introduction
The extremely contagious severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread all over the globe since its emergence in China in 2019. Globally the pandemic is showing no signs of ebbing with increasing death rate since February 2021 [1,2]. A sharp rise in cases were observed in South-East Asia, West Pacific and Eastern Mediterranean regions with accelerated death rate. Globally the mortality which took nine months to reach 1 million last year, reached 3 million deaths in just three months in 2021. The number of weekly new cases globally showed an increasing trend with 5.2 million reported during the third week of April 2021 [2]. The largest increase was reported by the South-East Asian region mainly India.
The number of new cases reported from India, however, which was 1.5 million cases in the third week of April rose to 2.7 million new cases in the first week of May [2]. The mortality was also at an all-time high of 90,000 deaths in the week.
Comparative genomic studies have made possible the study of SARS-Cov-2 genomic sequences through the open access repositories such as Global Initiative on Sharing All Influenza Data (GISAID) [3].
The nomenclature system of GISAID classified the SARS-CoV-2 genome into seven major clades. The clade belonging to SARS-CoV-2 strains include L, S, V, G, GH, GR and GV. The GR was most prevalent worldwide followed by GV and GH [4].
The pattern of SARS-CoV2 global spread can be understood by the Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) tool [5] explaining detailed lineages. Nextclade assignment supports the PANGOLIN lineage system characterizing the mutations into nine clades [6]. New major clades are named once the frequency of a clade exceeds 20% in a representative global sample and that clade differs in at least two positions from its parent clade. The Nextstrain clades 19A, 19B, 20A, 20B, 20C and 20I are used for annotation of strains. Variants belonging to clade 20I, 20H and 20J originated in U.K., South Africa and Brazil respectively were synonymous with SARS-CoV-2 virus lineage B1.1.7 [5], B.1.351 [7] and P1.
Due to low sequencing carried out and the slow rate of genomic data availability in databases, tracking of SARS-CoV-2 variants circulating in Indian states has been limited. The current study highlights the major lineages and the important mutations and their distribution in the Indian subcontinent.

Methodology
The Indian SARS-CoV-2 genomes deposited in GISAID [3] during the period 1 st January -25 th May 2021 were selected and their metadata were retrieved for the analysis. A total of 7415 sequences were available from which 4,411 genome sequences with high coverage were considered for the analysis. We considered the sequences with high coverage for amino acid substitution analysis and phylogenetic clustering, Furthermore, for molecular analysis of sequences from non-Indian origin, high quality genome sequences with full coverage were retrieved from GISAID.
Multiple Alignment using Fast Fourier Transform (MAFFT) was used to perform the multiple sequence alignment [8] implemented via an Augur phylodynamic analysis pipeline. Afterwards, the open-source interactive tool Auspice for visualizing genomic data implemented via Nextstrain was used.
Nextclade beta [6] was used to check the quality of sequence and to generate the nextclade designated tree. The joint global effort by Nextstrain [6,9] which is an open-source project was initiated in 2018 and integrates the genomic data of SARS-CoV-2. Amino acid mutation analysis was done using CoVsurver mutation analysis with reference to hcov19/Wuhan strain.
The sequences available in GISAID from eight states in India had variable number of genome sequences deposited. The proportion of sequences available in the database are depicted ( Figure 1) which were analysed for the geographical distribution of signature mutations in spike protein L452R, E484Q, P681R from West to East of India ( Figure 1, Figure  2).The nucleoprotein substitution R203K showed a prevalence of only 0.18% (n = 202) in the western state of Maharashtra as compared to the eastern state of West-Bengal which has a prevalence of 42.3% (n = 357) ( Figure 2). The 203K substitution were predominantly found in 20I and 20B.
The mutations of spike protein H69del located in the N-terminal domain (NTD) were found with a  frequency of 80%, 40%, 25% and 16% from Europe, Oceania, U.S.A. and South America respectively. In India it was present at a frequency of 7% (n = 216) of genomic sequences between January and April 2021 ( Figure 3, Table 1). The H69del was found in higher abundance in 20I within the twelve clades and found to be emerging in 20A, 20B, 20C,20D, 19B and 21A. The other significant mutations observed in N-terminal domain (NTD) of spike protein were Y144del/Y145del (n = 129), V70del (n = 215), A67del (n = 5) and I68del (n = 3). The last two substitutions were not reported from any other country. The NTD deletion region are depicted in sequence alignment ( Table 2).
The Nextstrain clade 21A or B.1.617.2 or delta was the most prevalent lineage among the sequences reported from India. The other countries where the 21A lineage has been identified were Bangladesh, Japan, Malaysia, Thailand, Singapore.

Discussion
Although mortality is variable in geographic regions, the sudden surge in cases in South-east Asia largely contributed by India is alarming. Scientist around the globe have realized the necessity of rapid large scale genome analysis of the virus SARS-CoV-2. The classification the genomes of the second wave is essential for surveillance. With the recent nomenclature of 21A or B.  The signature mutations in spike protein receptor binding domain involved in increased ACE2 binding have been studied [10]. Substitutions at position 452, 484 and 681 of receptor binding domain have been reported in other globally circulating lineages recently. The distribution of the substitutions at the abovementioned positions from India were studied ( Figure 3). The western state of Maharashtra showed high prevalence of L452R, E484Q and P681R substitution at 62.4% (n = 559), 66.5% (n = 774), 61.5% (n = 527) as compared to eastern states of West Bengal which had frequency of 14.9% (n = 134),14.9% (n = 116) or 16.2% (n = 139) respectively (Table 3).
Of particular mention is the nucleocapsid protein (N) substitution R203K which showed lower frequency of 23% (n = 202) in Maharashtra as compared to West Bengal which had a frequency of 40.7% (n = 357). The N protein is involved in virus assembly and replication. It plays pivotal role in targeting innate host immunity and IFN inhibition [11]. The viral strategy is to shield the RNA from being detected by host immune RNA sensors [12]. Zhao et al, showed that mutation at 203 had a greater effect on suppression of IFN bringing about immune evasion. It may be hypothesized that the substitution of arginine with lysine may also be involved in immune escape. Overall, the mutation warrants further investigation to understand the effect of alteration. The other mutations found in our analysis were D63G, D377Y and Q384H, R191H and S180I in the N protein. The N protein has also been a target for drug discovery [12] and the mutation studies will help in bringing many antivirals to the market.
Antibodies have been developed with neutralizing activity towards NTD which may be of therapeutic potential [13][14][15]. Deletions in NTD of spike have been reported to bring about reduced sensitivity to neutralizing antibodies [16]. NTD deletions abolish the binding of antibody and have been termed as 'escape variants' and resist neutralization by antibody [17][18][19].
The deletion at 69-70 and 144/145 found in B.1.1.7 lineage have been studied (Figure 3). The deletion region identified by us ( Table 2) at position A67X and I68X in a total of eight sequences comprised of the N2 loop of NTD epitope recognition region [10]. Any deletion in the region may bring about loss of conformational stability. Surveillance of deletion/insertion region is important in epidemiology for monitoring vaccine efficacy and primer-based detection methods. The analysis of variant of concern 20H or B.1.351 first identified in South Africa and 20I or B.1.1.7 lineage which first emerged in UK have been studied with respect to important mutations. The appearance of clade and naming of 21A by WHO as delta and kappa variant has highlighted the need for further studies.
The frequency of the variant subsequently labelled 21A [10] or delta variant or B.1.617.2 in India increased from 0% at the beginning of March 2021 to > 33% of   The changing nomenclature of clade by Nextstrain was fruitful in tracking variants as compared to previous analysis conducted a month ago as more sequences were deposited in GISAID.

Conclusions
Focusing on the phylogenetic classification and the clade assignment it was clear that sequences from India classified as 21A or B.1.617.1/B.1.617.2 may probably be the cause of high transmission. Other variants identified were twelve in number, out of which the variant of concern were 20H, 20I and 20J. B.1.617 was named as a global Variant of Concern on 11 May 2021 and the sublineage 1 and 2 as kappa and delta respectively. Both sub lineages are associated with increased transmissibility. The tracking of mutation is also of significance in development of antigen detection kit and drug discovery.

Data Availability Statement
The data used in the study were downloaded and can be found on the official GISAID repository.The Accession ID of the sequences used in this study can be found in Supplementary Table 1.