Categories
Dedicated Server

Combining convolutional neural networks and self-attention for … – Nature.com

Figure 4 The overall architecture of MBSaNet. MBSaNet is proposed to improve the performance of classification models on the task of automatic recognition of multilabel fundus diseases. The main idea of MBSaNet is based on the explicit combination of convolutional layers and SA layers, which enables the model to have both the generalization ability of CNN and the global feature modeling ability of Transformer18,43

Figure 4

The overall architecture of MBSaNet.

MBSaNet is proposed to improve the performance of classification models on the task of automatic recognition of multilabel fundus diseases. The main idea of MBSaNet is based on the explicit combination of convolutional layers and SA layers, which enables the model to have both the generalization ability of CNN and the global feature modeling ability of Transformer18,43. Previous studies have demonstrated that the local prior of the convolutional layer makes it good for extracting local features from fundus images; however, we believe that long-term dependences and the global receptive field are also essential for fundus disease identification, because even an experienced ophthalmologist is unable to make an accurate diagnosis from a small part of a fundus image (e.g., using only a macula). Considering that the SA layer with global modeling ability can capture long-term dependencies, MBSaNet is implemented by adopting a building strategy similar to the CoAtNet18 architecture with vertically stacked convolutional blocks and self-attention modules. The overall framework of MBSaNet is shown in Figure4, and Table 7 shows the size of the input and output feature maps at each stage of the model. The framework comprises two parts. The first of which is a feature extractor with five stages: Stage0Stage4, where Stage0 is our proposed multiscale feature fusion stem (MFFS), Stage1Stage3 are all convolutional layers, and Stage 4 is an SA layer with relative position representations. The second part is a multilabel classifier that predicts the sample category based on the features extracted from the above structure. We use the MBConv block that includes residual connections and an SE block27 as basic building blocks in all convolutional stages due to the same reverse bottleneck design as the Feedforward Network (FFN) block of Transformers. Unlike the regular MBConv block, MBSaNet replaces the max-pooling layers in the shortcut branch with convolutional layers having stride 2 in the downsampling strategy. This is a custom neural network that needs to be implemented by training it from scratch.

The dataset obtained from the International Competition on Ocular Disease Intelligent Recognition sponsored by Peking University. This dataset contains real patient data collected from different hospitals and medical centers in China, which were jointly launched by the Nankai University School of Computer Science-Beijing Shanggong Medical Information Technology Co., Ltd. joint laboratory. The training set is a structured ophthalmology database that includes the ages of 3,500 patients, color fundus images of their left and right eyes, and diagnostic keywords from clinicians. The test set includes off-site test set and on-site test set, but as with the training set, the number of samples under each category is unbalanced. Therefore, we also constructed a balanced test set with 50 images per class by randomly sampling a total of 400 images from the training set. The specific details of the dataset can be found in Table8. Fundus images were recorded by various cameras, including Canon, Zeiss, and Kowa, with variable image resolutions. As illustrated in Figure5(a), these data categorize patients into eight categories: normal (N), DR (D), glaucoma (G), cataract (C), AMD (A), hypertension (H), Myopia (M), and other diseases/abnormalities (O). There are two points to note. First, a patient may contain one or more labels, as shown in Figure 5(b), that is, the task is a multidisease multilabel image classification task. Second, as shown in Figure5(c), the class labeled Other Diseases/Abnormalities (O) contains images related to more than 10 different diseases, and low quality images due to factors such as lens blemishes, and invisible optic discs, variability is largely expanded in. All the methods developed and experiments were carried out in accordance with the relevant guidelines and regulations associated to this publicly available dataset.

Accuracy is the proportion of correctly classified samples to the total samples, which is the most basic evaluation indicator in classification problems. Precision refers to the probability that the true label of a sample is positive among all samples predicted to be positive. Recall refers to the probability of being predicted by the model to be a positive sample among all the samples with positive labels, and given the specificity of the task, we use a micro-average of precision and recall for each category in our experiments. AUC is the area under the ROC curve, and the closer the value is to 1, the better the classification performance of the model. AUC is often used to measure model stability. The Kappa coefficient is another index calculated based on the confusion matrix, which is used to measure the classification accuracy of the model and can also be used for consistency testing, where p0 denotes the sum of the diagonal elements divided by the sum of the entire matrix elements, i.e., accuracy. pe denotes the sum of the products of the actual and predicted numbers corresponding to all categories, divided by the square of the total number of samples. F1(_)score, also known as BalancedScore, is the harmonic (weighted) average of precision and recall, and given the category imbalance in the dataset, we use micro-averaging to calculate metrics globally by counting the total true positives,false negatives and false positives. The closer the value is to 1, the better the classification performance of the model. Final(_)score is the average of F1(_)score, Kappa, and AUC.

$$begin{aligned} Accuracy= & {} frac{TP+TN}{TP+FP+TN+FN} end{aligned}$$

(1)

$$begin{aligned} Precision= & {} frac{TP}{TP+FP} end{aligned}$$

(2)

$$begin{aligned} Recall= & {} frac{TP}{TP+FN} end{aligned}$$

(3)

$$begin{aligned} F1_score= & {} frac{2Precision*Recall}{Precision+Recall} end{aligned}$$

(4)

$$begin{aligned} Kappa= & {} frac{p_0 - p_e}{1 - p_e} end{aligned}$$

(5)

$$begin{aligned} Final_score= & {} frac{F1_score + Kappa + AUC}{3} end{aligned}$$

(6)

The fundus image dataset contains some low-quality images, which are removed since it would not be helpful for training. In order to minimize the unnecessary interference to the feature extraction process due to the extra noise brought by the black area of the fundus images, the redundant black area is cropped. We use the OpenCV library to load the image as a pixel vector and use the edge position coordinates of the retinal region of the fundus image to remove the black edges. The fundus images are further resized to a 224224 image size after being cropped as shown in Figure 6. Data augmentation is the artificial generation of different versions of a real dataset to increase its data size; the images after data augmentation are shown in Figure7. Because it is necessary to expand the size of the dataset based on retaining the main features of the original image, we use operations such as random rotation by 90(^circ ), adjustment of contrast, and center cropping. Finally, the global histogram equalization operation is performed on the original and enhanced images, so that the contrast of the images is higher and the gray value distribution is more uniform.

Processing of original training image.

The predictive ability of a classifier is closely related to its ability to extract high-quality features. In the field of fundus multidisease identification, owing to the different characteristics of the lesions reflected in the fundus images of several common eye diseases, the lesion areas have the characteristics of different sizes and distributions. We propose a feature fusion module with convolution kernels of different sizes to extract multiscale primary features of images in the input stage of the network and fuse them in the channel dimension. Feature extractors with convolution kernel sizes of 3(times )3, 5(times )5, 7(times )7, and 9(times )9 are used, since the convolution stride is set to 2, we padding the input image before performing each convolution operation to ensure that the output feature maps are the same size. By employing convolution kernels with different receptive fields in the horizontal direction to broaden the stem structure, more locally or globally biased features are extracted from the original images. The batch normalization operation and ReLU activation are then performed separately and the resulting feature maps are concatenated. The experimental results show that by widening the stem structure in the horizontal direction, higher quality low-level image features can be obtained at the primary stage.

CNNs have been the dominant structure for many CV tasks. Traditionally, regular convolutional blocks, such as ResNet blocks5, are well-known in large-scale convolutional networks; meanwhile, depthwise convolutions44 can be expressed as Formula7 and are popular on mobile platforms due to their lower computation cost and smaller parameter size. Recent studies have shown that an improved inverse residual bottleneck block (MBConv)32,45 which is built on depthwise separable convolutions can achieve both high accuracy and efficiency7. Inspired by the CoAtNet18 framework, we consider the connection between the MBConv block and FFN module in the Transformer (both adopt the inverted bottleneck design: first expand the feature map to 4(times ) the size of the input channel, and after the depth separable convolutions operation, project the 4(times ) wide feature map back to the original channel size to satisfy the residual connection), and mainly adopt the improved MBConv block including the residual connection and SE27 block as the convolution building block. The convolution operation with a convolution kernel size of 2(times )2 and a stride of 2, implements the output feature map size on the shortcut branch to match the output size of the residual branch. The experimental results show that this slightly improves the performance. The convolutional building blocks we use are shown in Figure8, and the downsampling implementation can be expressed as Formula8.

$$begin{aligned} y_i = sum _{jin {mathcal {L}} (i)}^{} w_{i-j} odot x_j quad quad {(mathrm depthwisequad mathrm convolution)} end{aligned}$$

(7)

where (x_i,y_i in {R}^{D}) denote the input and output at position i, respectively, and ({mathcal {L}} (i) ) denotes a local neighborhood of i, e.g., a 3(times )3 grid centered at i in image processing.

$$begin{aligned} mathrm {xlongleftarrow Norm(Conv(x,stride=2))+Conv(DepthConv(Conv(Norm(x),stride=2)))} end{aligned}$$

(8)

In natural language processing and speech understanding, the Transformer design, which includes a crucial component of the SA module, has been widely used. SA extends the receptive field to all spatial places and computes weights based on the re-normalized pairwise similarity between the pair ((x_i,x_j)), as shown in Formula9, where ({mathcal {G}}) indicates the global spatial space. Stand-alone SA networks33 have shown that diverse CV tasks may be performed satisfactorily using SA modules alone, albeit with some practical limitations, in early research. After pretraining on the large-scale JFT dataset, ViT11 applied the vanilla Transformer to ImageNet classification and produced outstanding results. However, with insufficient training data, ViT still trails well behind SOTA CNNs. This is mainly because typical Transformer architectures lack the translation equivalence18 of CNNs, which increases the generalization on small datasets46. Therefore, we decided to adopt a method similar to CoAtNet; the global static convolution kernel is summed with the adaptive attention matrix before softmax normalization, which can be expressed as Formula10, where (i,j) denotes any position pair and (w_{i-j}) denotes the corresponding convolution weights, improve the generalization ability of the network based on the Transformer architecture by introducing the inductive bias of the CNNs.

$$begin{aligned} y_{i}= & {} sum _{j in {mathcal {G}}} underbrace{frac{exp left( x_{i}^{top } x_{j}right) }{sum _{k in {mathcal {G}}} exp left( x_{i}^{top } x_{k}right) }}_{A_{i, j}} x_{j} end{aligned}$$

(9)

$$begin{aligned} y_{i}^{text{ pre } }= & {} sum _{j in {mathcal {G}}} frac{exp left( x_{i}^{top } x_{j}+w_{i-j}right) }{sum _{k in {mathcal {G}}} exp left( x_{i}^{top } x_{k}+w_{i-k}right) } x_{j} end{aligned}$$

(10)

The receptive field size is one of the most critical differences between SA and convolutional modules. In general, a larger receptive field provides more contextual information, but this usually results in higher model capacity. The global receptive field has been a key motivation for employing SA mechanisms in vision. However, a larger receptive field requires more computation. For global attention, the complexity is quadratic w.r.t. spatial size. Therefore, in the process of designing the feature extraction backbone, considering the huge computational overhead brought by the Transformer structure and the small amount of training data for practical tasks, we use more convolution blocks, and only set up two layers of SA modules in Stage4 in the feature extraction stage. Experimental results show that this achieves a good balance between generalization performance and feature modeling ability.

Convolutional building blocks.

The fundus disease recognition task is a multilabel classification problem, so it is unsuitable for training models with traditional loss functions. We refer to the loss function used in work16,40, all classified images can be represented as (X = ){(x_1,x_2...x_i...x_N)} , where (x_i) is related to the ground truth label (y_i), and (i = 1...N), N represents the number of samples. We wish to find a classification function (F:Xlongrightarrow Y) that minimizes the loss function L, we use N sets of labeled training data ((x_i,y_i)), and apply a one-hot method to each (y_i) is encoded, (y_i = [y_i^1,y_i^2...y_i^8] ), each y contains 8 values, corresponding to the 8 categories in the dataset. We draw on the traditional multilabel classification method based on problem transformation, and transformed the multilabel classification problem into a two-class classification problem for each label. The final loss is the average of the loss values of the samples corresponding to each label. After studying weighted loss functions, such as sample balance and class balance, we decided to use weighted binary cross-entropy from Formula11 as the loss function, where W = (1,1.2,1.5,1.5,1.5,1.5,1.5,1.2) denotes the loss weight. The positive class is 1, and the negative class is 0. (p(y_i)) is the probability that sample i is predicted to be positive.

$$begin{aligned} L=-frac{1}{N} sum _{i=1}^{N} W left(y_{i} log left( pleft( y_{i}right) right) +left( 1-y_{i}right) log left( 1-pleft( y_{i}right) right) right) end{aligned}$$

(11)

After obtaining the loss function, we need to choose an appropriate optimization function to optimize the learning parameters. Different optimizers have different effects on parameter training, so we mainly consider the effects of SGD and Adam on model performance. We performed multiple comparison experiments under the same conditions. The results showed that Adam significantly outperformed SGD in terms of convergence and shortened training time, possibly because when we chose SGD as the optimizer, the gradients of the samples were updated at every epoch, which brings additional noise. Each iteration is not in the direction of the global optimum, so it can only converge to the local optimum, decreasing accuracy.

Read this article:

Combining convolutional neural networks and self-attention for ... - Nature.com

Related Post