This study design was approved by the Hirosaki University Hospital Institutional Review Board (Protocol Number Num. 2017-026). Written informed consent was obtained from all the patients in this study. All experiments were performed in accordance with relevant guidelines and regulations. No animal experiments were conducted in this study.
Sample collection and preparation
We analyzed the data obtained from the Hirosaki Center of Innovation (COI) health promotion project, which has collected cohort data from the year 2005 in Hirosaki (the population is approximately 10,000) in northern Japan. Each year, approximately 1000 people have their health status checked through a free participation program. The dataset actually has numerous missing data because the participants do not necessarily undergo a checkup each year. For example, some participants have participated in all years, while others have participated only once in any given year. Thus, the number of analyzed participants was displayed in each experiment. Questionnaires and blood tests were collected over the entire study period, and single nucleotide polymorphisms (SNPs) were identified for participants in 2014, 2015, 2016, and 2017; WGS analysis was performed for participants in 2014 and 2015; 16S rRNA sequencing of the oral and gut microbiomes was performed for participants in 2015, 2016, 2017, and 2018, and shotgun metagenomic sequencing of the oral and gut microbiomes was performed for participants in 2015 and 2016.
Participant selection and covariate assessment
The main exclusion criteria were as follows: (1) participants were younger than 18 years of age or were 55 older; (2) those with an estimated glomerular filtration rate (eGFR) below 60 [mL/min/1.73 m2] as calculated from serum creatinine levels and age at admission; (3) those with serum total bilirubin levels above 2 [mg/dL](4) those with serum albumin levels above 35 [g/dL], (5) those with a BMI above 35 or below 18, (6) those who were prescribed medicine for diabetes mellitus as per their questionnaires, (7) those who were prescribed antidiabetic or antibiotic medication as per their questionnaires, for the analysis of oral and gut microbiome; and (8) those with missing information regarding any of their covariates. For medication, we consider the presence of medication at the time of checkup. For the determination of diabetic and obese participants, we adopted the criteria with HbA1c larger than 6.5 and BMI larger than 30, respectively. The flowchart of participant selection is illustrated in the supplementary material.
The participant covariate information was obtained from the blood value tests and questionnaires. The questionnaires were administered upon induction into the study. (a) Age and (b) sex information were obtained from questionnaires (sex was also checked using genetic data), (c) smoking status was classified as never (0) and current and former (1); (d) drinking status was classified as non-drinker (0), and current and former drinker (1). Participants who used medication for (e) high blood pressure, (f) dyslipidemia, (g) steroids, and (h) antibiotics were flagged as non-use (0) and use (1). Finally, we performed principal component analysis (PCA) for genotyping using the algorithm implemented in smartpca24and used the top five principal components (PCs) as covariates to represent the population structure based on the genetic correlations among individuals.
In previous studies8,9,10, the expected effect size |r| of the correlation between AMY1-CN and obesity (BMI) has been estimated to be 0.416–0.512, although several studies have reported negative results for that correlation. Also, to our knowledge, no results have been reported on the association between AMY-CNs and diabetes. Therefore, we assume that the target effect is moderate and that Cohen’s effect size f is greater than 0.20. In this case, a regression model with a significance level of 0.05, a statistical power of 8.0E–1, and 14 coefficient would require a sample size of 471. In this study, there are 578 diabetic and 676 obese patients, of which 14 and 25 are cases, respectively. The analysis will also be conducted using BMI and HbA1c values as continuous values as the objective variables for obesity and diabetes, respectively.
SNPs and whole genome sequencing
DNA was purified from peripheral whole blood using a DNA extraction kit (QIAamp 96 DNA Blood Kit (QIAGEN, Hilden, Germany)) and was extracted from plasma pellets for SNP arrays and whole genome sequencing, respectively. SNP information was obtained from Japonica Array (Toshiba, Tokyo, Japan)25 and comprised population-specific SNP markers designed from the 1070 whole genome reference panel26and whole genome sequencing was performed by Takara Bio Corporation (Shiga, Japan).
Analysis of AMY1, AMY2A, and AMY2B copy number variations
To detect AMY1, AMY2A, and AMY2B CNVs from the WGS data, we used AMYCNE21, which was developed to estimate the AMY CNV from sequencing data. The input data for this software was prepared using TIDIT20 which was also developed to clarify the structural variants.
SNP analysis for AMY1, AMY2A, and AMY2B regions
For SNP detection, we applied the Genomon pipeline (https://github.com/Genomon-Project), which has been developed to investigate mutations from exome/whole genome sequencing data. The filter conditions were as follows: minor allele frequency of 0.02 or higher, minimum read depth higher than 8, and 10% posterior percentile of the minor allele count based on a binomial distribution greater than 0.02, and a minimum base quality of 15 or higher .
Microbiome samples were obtained from participants’ tongue plaques and stool samples for oral and gut microbiomes, respectively, and were stored at − 80 °C until use. The detailed library preparation method, including PCR conditions, has been described in a previous paper27. Briefly, the samples were mixed with zirconia beads and lysed using a FastPrep FP100A instrument (MP Biomedicals, Santa Ana, CA, USA). DNA was extracted from the bead-treated suspensions using a Magtration System 12GC and GC series MagDEA DNA 200 (Precision System Science, Chiba, Japan) and amplified. The 16S rRNA gene amplicons covering the V3–V4 region were sequenced using Pro341F and Pro805R primers. Sequencing was performed using a paired-end, 2 x 250-base pair cycle run on an Illumina MiSeq sequencing system. Shotgun metagenomic sequencing was performed on the HiSeq2500 instrument (Illumina, San Diego, CA, USA) with 101 bp paired-end reads.
Calculation of genus level microbiome abundance profiles from 16S rRNA sequences
Adapter sequences and low-quality bases at the 3′-end (threshold = 30) were trimmed using Cutadapt (version: 1.13)28 from paired-end reads obtained using the Illumina MiSeq sequencing system. In addition, the reads containing N bases and reads shorter than 150 bases were removed. The remaining reads were merged into single reads, and those with a length of 370–470 bases were removed using the VSEARCH29 fastq_mergepairs subcommand (version: version 2.4.3). After merging, we removed reads that were expected to have one or more base errors from the base quality. After removing chimeric reads using the VSEARCH29 uchime_denovo subcommand, the reads were clustered with 97% sequence identity and the taxa of each cluster were identified using the RDP Classifier (git commit hash: (commit hash: 701e229dde7cbe53d4261301e23459d91615999d))30. At this time, the result with a confidence value < 0.8, was treated as unclassified. The relative abundance of each taxon was then calculated by dividing the number of reads belonging to the cluster assigned to which the taxon was by the total number of reads.
Calculation of functional gene abundance from shotgun metagenomic sequencing
Adapter sequence and low-quality bases at the 3′-end (threshold = 30) were trimmed using Cutadapt (version: 1.13) from 2 × 100-base paired-end reads obtained using the Illumina HiSeq 2500 system. Reads containing N bases and reads shorter than 80 bp were removed. Reads from the host genome were also removed using BWA-MEM (version 0.7.15)31 with GRCh38. From the remaining reads (referred to as non-host reads), contigs were generated by metagenomic assembly using MEGAHIT (version 1.1.1)32. The coding regions of genes on these contigs were predicted using Prodigal (version 2.6.3)33. After mapping the non-host reads to contigs using BWA-MEM, the relative abundances of the genes were calculated from the average depth of the gene region. KEGG pathway annotations were mapped to genes using a sequence similarity search with MMseqs2 (version: 1d2579627f43662ecaaa0778bd348fc35048976a)34. We used 1E–10 as the E-value threshold for the similarity search.
In the first criterion for AMY1-CN, participants were divided into low (AMY1-CN range 3–10) and high (AMY1-CN range 11–24) groups, which were separated at the center line. In the second criterion for AMY1-CN, participants were divided into the low (AMY1-CN range 3–7), medium (range 8–13), and high (range 14–24) groups. The minimum copy number above 10 percentile and the maximum copy number below 90 percentile of AMY1-CN distribution was 7 and 14, respectively. Similarly, for analyzing AMY2A-CN, participants were divided into low (AMY2A-CN range 1–2) and high (AMY2A-CN range 3–5) groups at the center line. When divided into two groups, we analyzed the effect of high AMY-CNs on the low AMY-CNs group. When divided into three groups. We compared participants in the top and bottom 10% of AMY1A-CNs with the rest. Thus, we have analyzed the effect to groups with very high or low AMY-CNs relative to the subpopulation with a general number of AMY-CNs as control. We could not analyze the effect of AMY2B-CN because almost all patients had two AMY2B-CNs. For these groups, linear regression analysis for continuous values and logistic regression analysis for binary values using the above-explained covariates was applied to detect statistical significance. A p value less than 0.05 was considered to indicate statistical significance when considering the corrections for multiple tests.