Combining convolutional neural networks and self-attention for … – Nature.com

Figure 4

The overall architecture of MBSaNet.

MBSaNet is proposed to improve the performance of classification models on the task of automatic recognition of multilabel fundus diseases. The main idea of MBSaNet is based on the explicit combination of convolutional layers and SA layers, which enables the model to have both the generalization ability of CNN and the global feature modeling ability of Transformer18,43. Previous studies have demonstrated that the local prior of the convolutional layer makes it good for extracting local features from fundus images; however, we believe that long-term dependences and the global receptive field are also essential for fundus disease identification, because even an experienced ophthalmologist is unable to make an accurate diagnosis from a small part of a fundus image (e.g., using only a macula). Considering that the SA layer with global modeling ability can capture long-term dependencies, MBSaNet is implemented by adopting a building strategy similar to the CoAtNet18 architecture with vertically stacked convolutional blocks and self-attention modules. The overall framework of MBSaNet is shown in Figure4, and Table 7 shows the size of the input and output feature maps at each stage of the model. The framework comprises two parts. The first of which is a feature extractor with five stages: Stage0Stage4, where Stage0 is our proposed multiscale feature fusion stem (MFFS), Stage1Stage3 are all convolutional layers, and Stage 4 is an SA layer with relative position representations. The second part is a multilabel classifier that predicts the sample category based on the features extracted from the above structure. We use the MBConv block that includes residual connections and an SE block27 as basic building blocks in all convolutional stages due to the same reverse bottleneck design as the Feedforward Network (FFN) block of Transformers. Unlike the regular MBConv block, MBSaNet replaces the max-pooling layers in the shortcut branch with convolutional layers having stride 2 in the downsampling strategy. This is a custom neural network that needs to be implemented by training it from scratch.

The dataset obtained from the International Competition on Ocular Disease Intelligent Recognition sponsored by Peking University. This dataset contains real patient data collected from different hospitals and medical centers in China, which were jointly launched by the Nankai University School of Computer Science-Beijing Shanggong Medical Information Technology Co., Ltd. joint laboratory. The training set is a structured ophthalmology database that includes the ages of 3,500 patients, color fundus images of their left and right eyes, and diagnostic keywords from clinicians. The test set includes off-site test set and on-site test set, but as with the training set, the number of samples under each category is unbalanced. Therefore, we also constructed a balanced test set with 50 images per class by randomly sampling a total of 400 images from the training set. The specific details of the dataset can be found in Table8. Fundus images were recorded by various cameras, including Canon, Zeiss, and Kowa, with variable image resolutions. As illustrated in Figure5(a), these data categorize patients into eight categories: normal (N), DR (D), glaucoma (G), cataract (C), AMD (A), hypertension (H), Myopia (M), and other diseases/abnormalities (O). There are two points to note. First, a patient may contain one or more labels, as shown in Figure 5(b), that is, the task is a multidisease multilabel image classification task. Second, as shown in Figure5(c), the class labeled Other Diseases/Abnormalities (O) contains images related to more than 10 different diseases, and low quality images due to factors such as lens blemishes, and invisible optic discs, variability is largely expanded in. All the methods developed and experiments were carried out in accordance with the relevant guidelines and regulations associated to this publicly available dataset.

Accuracy is the proportion of correctly classified samples to the total samples, which is the most basic evaluation indicator in classification problems. Precision refers to the probability that the true label of a sample is positive among all samples predicted to be positive. Recall refers to the probability of being predicted by the model to be a positive sample among all the samples with positive labels, and given the specificity of the task, we use a micro-average of precision and recall for each category in our experiments. AUC is the area under the ROC curve, and the closer the value is to 1, the better the classification performance of the model. AUC is often used to measure model stability. The Kappa coefficient is another index calculated based on the confusion matrix, which is used to measure the classification accuracy of the model and can also be used for consistency testing, where p0 denotes the sum of the diagonal elements divided by the sum of the entire matrix elements, i.e., accuracy. pe denotes the sum of the products of the actual and predicted numbers corresponding to all categories, divided by the square of the total number of samples. F1(_)score, also known as BalancedScore, is the harmonic (weighted) average of precision and recall, and given the category imbalance in the dataset, we use micro-averaging to calculate metrics globally by counting the total true positives,false negatives and false positives. The closer the value is to 1, the better the classification performance of the model. Final(_)score is the average of F1(_)score, Kappa, and AUC.

$$begin{aligned} Accuracy= & {} frac{TP+TN}{TP+FP+TN+FN} end{aligned}$$

(1)

$$begin{aligned} Precision= & {} frac{TP}{TP+FP} end{aligned}$$

(2)

$$begin{aligned} Recall= & {} frac{TP}{TP+FN} end{aligned}$$

(3)

$$begin{aligned} F1_score= & {} frac{2Precision*Recall}{Precision+Recall} end{aligned}$$

(4)

$$begin{aligned} Kappa= & {} frac{p_0 - p_e}{1 - p_e} end{aligned}$$

(5)

$$begin{aligned} Final_score= & {} frac{F1_score + Kappa + AUC}{3} end{aligned}$$

(6)

The fundus image dataset contains some low-quality images, which are removed since it would not be helpful for training. In order to minimize the unnecessary interference to the feature extraction process due to the extra noise brought by the black area of the fundus images, the redundant black area is cropped. We use the OpenCV library to load the image as a pixel vector and use the edge position coordinates of the retinal region of the fundus image to remove the black edges. The fundus images are further resized to a 224224 image size after being cropped as shown in Figure 6. Data augmentation is the artificial generation of different versions of a real dataset to increase its data size; the images after data augmentation are shown in Figure7. Because it is necessary to expand the size of the dataset based on retaining the main features of the original image, we use operations such as random rotation by 90(^circ ), adjustment of contrast, and center cropping. Finally, the global histogram equalization operation is performed on the original and enhanced images, so that the contrast of the images is higher and the gray value distribution is more uniform.

Processing of original training image.

The predictive ability of a classifier is closely related to its ability to extract high-quality features. In the field of fundus multidisease identification, owing to the different characteristics of the lesions reflected in the fundus images of several common eye diseases, the lesion areas have the characteristics of different sizes and distributions. We propose a feature fusion module with convolution kernels of different sizes to extract multiscale primary features of images in the input stage of the network and fuse them in the channel dimension. Feature extractors with convolution kernel sizes of 3(times )3, 5(times )5, 7(times )7, and 9(times )9 are used, since the convolution stride is set to 2, we padding the input image before performing each convolution operation to ensure that the output feature maps are the same size. By employing convolution kernels with different receptive fields in the horizontal direction to broaden the stem structure, more locally or globally biased features are extracted from the original images. The batch normalization operation and ReLU activation are then performed separately and the resulting feature maps are concatenated. The experimental results show that by widening the stem structure in the horizontal direction, higher quality low-level image features can be obtained at the primary stage.

CNNs have been the dominant structure for many CV tasks. Traditionally, regular convolutional blocks, such as ResNet blocks5, are well-known in large-scale convolutional networks; meanwhile, depthwise convolutions44 can be expressed as Formula7 and are popular on mobile platforms due to their lower computation cost and smaller parameter size. Recent studies have shown that an improved inverse residual bottleneck block (MBConv)32,45 which is built on depthwise separable convolutions can achieve both high accuracy and efficiency7. Inspired by the CoAtNet18 framework, we consider the connection between the MBConv block and FFN module in the Transformer (both adopt the inverted bottleneck design: first expand the feature map to 4(times ) the size of the input channel, and after the depth separable convolutions operation, project the 4(times ) wide feature map back to the original channel size to satisfy the residual connection), and mainly adopt the improved MBConv block including the residual connection and SE27 block as the convolution building block. The convolution operation with a convolution kernel size of 2(times )2 and a stride of 2, implements the output feature map size on the shortcut branch to match the output size of the residual branch. The experimental results show that this slightly improves the performance. The convolutional building blocks we use are shown in Figure8, and the downsampling implementation can be expressed as Formula8.

$$begin{aligned} y_i = sum _{jin {mathcal {L}} (i)}^{} w_{i-j} odot x_j quad quad {(mathrm depthwisequad mathrm convolution)} end{aligned}$$

(7)

where (x_i,y_i in {R}^{D}) denote the input and output at position i, respectively, and ({mathcal {L}} (i) ) denotes a local neighborhood of i, e.g., a 3(times )3 grid centered at i in image processing.

$$begin{aligned} mathrm {xlongleftarrow Norm(Conv(x,stride=2))+Conv(DepthConv(Conv(Norm(x),stride=2)))} end{aligned}$$

(8)

In natural language processing and speech understanding, the Transformer design, which includes a crucial component of the SA module, has been widely used. SA extends the receptive field to all spatial places and computes weights based on the re-normalized pairwise similarity between the pair ((x_i,x_j)), as shown in Formula9, where ({mathcal {G}}) indicates the global spatial space. Stand-alone SA networks33 have shown that diverse CV tasks may be performed satisfactorily using SA modules alone, albeit with some practical limitations, in early research. After pretraining on the large-scale JFT dataset, ViT11 applied the vanilla Transformer to ImageNet classification and produced outstanding results. However, with insufficient training data, ViT still trails well behind SOTA CNNs. This is mainly because typical Transformer architectures lack the translation equivalence18 of CNNs, which increases the generalization on small datasets46. Therefore, we decided to adopt a method similar to CoAtNet; the global static convolution kernel is summed with the adaptive attention matrix before softmax normalization, which can be expressed as Formula10, where (i,j) denotes any position pair and (w_{i-j}) denotes the corresponding convolution weights, improve the generalization ability of the network based on the Transformer architecture by introducing the inductive bias of the CNNs.

$$begin{aligned} y_{i}= & {} sum _{j in {mathcal {G}}} underbrace{frac{exp left( x_{i}^{top } x_{j}right) }{sum _{k in {mathcal {G}}} exp left( x_{i}^{top } x_{k}right) }}_{A_{i, j}} x_{j} end{aligned}$$

(9)

$$begin{aligned} y_{i}^{text{ pre } }= & {} sum _{j in {mathcal {G}}} frac{exp left( x_{i}^{top } x_{j}+w_{i-j}right) }{sum _{k in {mathcal {G}}} exp left( x_{i}^{top } x_{k}+w_{i-k}right) } x_{j} end{aligned}$$

(10)

The receptive field size is one of the most critical differences between SA and convolutional modules. In general, a larger receptive field provides more contextual information, but this usually results in higher model capacity. The global receptive field has been a key motivation for employing SA mechanisms in vision. However, a larger receptive field requires more computation. For global attention, the complexity is quadratic w.r.t. spatial size. Therefore, in the process of designing the feature extraction backbone, considering the huge computational overhead brought by the Transformer structure and the small amount of training data for practical tasks, we use more convolution blocks, and only set up two layers of SA modules in Stage4 in the feature extraction stage. Experimental results show that this achieves a good balance between generalization performance and feature modeling ability.

Convolutional building blocks.

The fundus disease recognition task is a multilabel classification problem, so it is unsuitable for training models with traditional loss functions. We refer to the loss function used in work16,40, all classified images can be represented as (X = ){(x_1,x_2...x_i...x_N)} , where (x_i) is related to the ground truth label (y_i), and (i = 1...N), N represents the number of samples. We wish to find a classification function (F:Xlongrightarrow Y) that minimizes the loss function L, we use N sets of labeled training data ((x_i,y_i)), and apply a one-hot method to each (y_i) is encoded, (y_i = [y_i^1,y_i^2...y_i^8] ), each y contains 8 values, corresponding to the 8 categories in the dataset. We draw on the traditional multilabel classification method based on problem transformation, and transformed the multilabel classification problem into a two-class classification problem for each label. The final loss is the average of the loss values of the samples corresponding to each label. After studying weighted loss functions, such as sample balance and class balance, we decided to use weighted binary cross-entropy from Formula11 as the loss function, where W = (1,1.2,1.5,1.5,1.5,1.5,1.5,1.2) denotes the loss weight. The positive class is 1, and the negative class is 0. (p(y_i)) is the probability that sample i is predicted to be positive.

$$begin{aligned} L=-frac{1}{N} sum _{i=1}^{N} W left(y_{i} log left( pleft( y_{i}right) right) +left( 1-y_{i}right) log left( 1-pleft( y_{i}right) right) right) end{aligned}$$

(11)

After obtaining the loss function, we need to choose an appropriate optimization function to optimize the learning parameters. Different optimizers have different effects on parameter training, so we mainly consider the effects of SGD and Adam on model performance. We performed multiple comparison experiments under the same conditions. The results showed that Adam significantly outperformed SGD in terms of convergence and shortened training time, possibly because when we chose SGD as the optimizer, the gradients of the samples were updated at every epoch, which brings additional noise. Each iteration is not in the direction of the global optimum, so it can only converge to the local optimum, decreasing accuracy.

Read this article:

Combining convolutional neural networks and self-attention for ... - Nature.com

Best India VPS Hosting by Onlive Server in Noida: Powerfully Fitted Plans with Dedicated IPs - Borok Times - May 28th, 2025
What to look for in USA-based dedicated server solutions - latesthackingnews.com - May 28th, 2025
The best cheap game server hosts of 2025 - TechRadar - May 18th, 2025
How SSD caching improves dedicated server response times - The Hans India - May 18th, 2025
Onlive Server - VPS, Dedicated, and Cloud Hosting Solutions - openPR.com - April 23rd, 2025
Dedicated Servers in Netherlands: Why Businesses Choose Dutch Hosting? - Social News XYZ - April 23rd, 2025
Onlive Server - The Role of AI in Storage Dedicated Hosting - openPR.com - April 23rd, 2025
HostColor Expands its AI Ready AMD Dedicated Servers - GlobeNewswire - April 23rd, 2025
Onlive Server Launches High-Performance India Dedicated Server Hosting with Advanced Features - openPR - March 24th, 2025
Minecraft Gamers Demand Better Servers, and They're Willing to Pay for Them -- Liquid Web Study Shows VPS & Dedicated Server Opportunities - Yahoo... - March 15th, 2025
Onlive Server Introduces Netherlands Dedicated Server with Enhanced Security Features - openPR - February 3rd, 2025
What Is Shared Hosting? Pros and Cons for Ecommerce (2024) - Shopify - December 26th, 2024
The best web hosting services of 2024: Expert tested and reviewed - ZDNet - December 17th, 2024
Thank You for Helping Us Close in on Getting a Dedicated ServerWere Almost There! - Redheaded Blackbelt - December 9th, 2024
Introducing Canada, Montreal Local IP and Data Center with Dedicated Server Hosting by TheServerHost - EIN News - July 28th, 2024
Introducing Hong Kong Local IP and Data Center with Dedicated Server Hosting by TheServerHost - EIN News - July 28th, 2024
Introducing Singapore Local IP and Data Center with Dedicated Server Hosting by TheServerHost - EIN News - July 28th, 2024
Introducing Texas, Dallas, Houston Local IP and Data Center with VPS and Dedicated Server Hosting by - EIN News - July 20th, 2024
PQ.Hosting: Leading the Way in Web Hosting with Affordable VPS and Dedicated Servers Forbes Georgia - Forbes Georgia - May 27th, 2024
ServerWhere Kicks Off Netherlands-based 10 Gbps Dedicated Servers and Cloud IaaS - Audacy - May 8th, 2024
Palworld might be getting dedicated Xbox servers sooner than you think - Gamesradar - March 28th, 2024
10 Best Dedicated Server Hosting Options in India 2024 - The New Indian Express - March 20th, 2024
'Probably one of the worst launches of all time': Star Wars: Battlefront Classic Collection players tear into Aspyr for bugs ... - PC Gamer - March 20th, 2024
Palworld Dedicated Servers and how to set them up - Gamesradar - February 11th, 2024
What are the different types of web hosting? - TechRadar - February 11th, 2024
Xbox Working Closely With Palworld Developers to Enable Faster Updates, Dedicated Servers, and More - IGN - February 3rd, 2024
Palworld Is Missing One Key PvP Feature That Would Make It So Much Better - Screen Rant - February 3rd, 2024
How to host and join a dedicated server in 'Palworld' - NME - January 25th, 2024
Palworld on Xbox Doesnt Have Dedicated Servers, Limiting Co-Op to 2-4 Players While Steam Gets Up to 32 Players - IGN - January 25th, 2024
How To Play With Friends In Enshrouded - TheGamer - January 25th, 2024
Palworld on Game Pass Is Different From the Steam Version - The Escapist - January 25th, 2024
Palworld Limits Multiplayer Numbers On Xbox, Here's Why - Kotaku - January 25th, 2024
Palworld does not support dedicated servers on Xbox and will not be seeing them anytime soon - Windows Central - January 25th, 2024
Palworld Multiplayer and how to play with friends explained - Eurogamer.net - January 25th, 2024
What is Dedicated Hosting? Learn about the benefits and drawbacks of this powerful web hosting solution. | by Dale ... - Medium - December 10th, 2023
Best web hosting 2023: Our experts review the top services - TechRadar - December 10th, 2023
How to choose the best web hosting and its importance - Arizona Big Media - December 2nd, 2023
Introducing the ToughPigs Discord Server! - ToughPigs - December 2nd, 2023
How Smarthub's Investment Is Taking Ad Tech To New Heights - The Drum - December 2nd, 2023
Restaurants for New Year's Eve Dinner That Will Dazzle Your Date - Orlando Date Night Guide - December 2nd, 2023
Who Is @BasedBeffJezos, The Leader Of The Tech Elite's 'E/Acc ... - Forbes - December 2nd, 2023
The best things to do this weekend in San Diego: Nov. 30 to Dec. 3 - The San Diego Union-Tribune - December 2nd, 2023
5 Ways to Protect Customer Information for Small Businesses - Small Business Trends - December 2nd, 2023
Remote Access Adds API to Its Feature Arsenal and Marks a Turn for ... - GlobeNewswire - November 24th, 2023
9 Best Cheap Web Hosting India Updated Nov 2023 - Analytics Insight - November 24th, 2023
The Need for Modernized, AI-Ready Server and Compute ... - Spiceworks News and Insights - November 24th, 2023
Hyve and Remarkable partner to reduce cloud downtime on Black Friday - DataCentreNews UK - November 24th, 2023
Boost Your Online Security and Privacy: The Advantages of Socks5 ... - Analytics Insight - November 24th, 2023
What are you thankful for this holiday season? Here are 11 Ocala ... - Ocala - November 24th, 2023
How to Choose the Right VPN: 10 Things to Consider in {YEAR} - CyberGhost VPN - November 24th, 2023
Save Up to 81% with InMotion Hosting's Cyber Week Sale - Yahoo Finance - November 16th, 2023
The best things to do this weekend in San Diego: Nov. 16-19 - The San Diego Union-Tribune - November 16th, 2023
Three Truths and a Lie: Modernization and Migrating to the Cloud - Newsweek - November 16th, 2023
Top 10 Business Intelligence Platforms, Features, and Pricing ... - Spiceworks News and Insights - November 16th, 2023
What you need to know about the launch of Ark: Survival Ascended - Windows Central - November 16th, 2023
What does the Wynn Casino $1 million package include for this years Las Vegas Grand Prix? - AS USA - November 16th, 2023
ARK: Survival Ascended Impressions - The Good, The Bad And The Ugly - MMORPG.com - November 16th, 2023
How to Watch BBC iPlayer Outside UK in 2023 - The Tech Report - November 16th, 2023
NordVPN Vs. Atlas VPN: Which One Is Best In 2023? - Forbes - November 16th, 2023
Raising the Bar on FIX Protocol Support - Traders Magazine - November 16th, 2023
What Is A Remote Access VPN? Forbes Advisor INDIA - Forbes - November 16th, 2023
Life is sweet at this local family business - BurlingtonToday.com - November 16th, 2023
SI-BONE to Report Third Quarter 2023 Financial Results on ... - GlobeNewswire - October 17th, 2023
There's a Film Festival Happening in Minecraft Right Now - Decrypt - October 17th, 2023
The fight over the future of encryption, explained - MIT Technology Review - October 17th, 2023
Kong Named in the Leaders Quadrant of the Gartner Magic ... - Yahoo Finance - October 17th, 2023
Microsoft to create team dedicated to data center automation and ... - DatacenterDynamics - October 17th, 2023
A brief guide to choose the right cloud solution for your law firm - Lexology - October 17th, 2023
How to Fix BLZ51903006 Error in World of Warcraft - PC Invasion - May 12th, 2023
Robot food service workers on the rise in metro - Detroit Free Press - May 12th, 2023
MySQL Database Optimisation in Simple but Effective Way - Medium - May 12th, 2023
Function-As-A-Service Market Will Accelerate Rapidly with Excellent CAGR of 26.35% in the forecast period of 2 - openPR - May 12th, 2023
Rising Trends of Cloud Server Hosting Market will Witness ... - Digital Journal - May 12th, 2023
SIGMA LITHIUM AND BRAZILIAN GOVERNMENT OFFICIALS RING ... - PR Newswire - May 12th, 2023
An Introduction to the Bun JavaScript Runtime SitePoint - SitePoint - May 12th, 2023
Introducing Cloudzupp: The One-Stop Destination for Cloud Services and Digital Marketing Solutions - openPR - May 4th, 2023
Scared of Leaking Data to ChatGPT? Microsoft Tests a Private ... - The Information - May 4th, 2023
Webyne: Revolutionizing Web Hosting Services in India With ... - Deccan Herald - May 4th, 2023
THE DISH: Special events in works for Mother's Day - The Bakersfield Californian - May 4th, 2023
Healthcare's Recent Cybercriminal Activity Attributed to ... - MedCity News - May 4th, 2023