The Future of Cardiac Imaging: Leveraging Synthetic Image Data for Improved Cardiac Function Quantification

8 min readMar 16, 2023

Within this post, I am excited to share with you my successful approach to a Kaggle competition where I employed synthetic data generation techniques to achieve state-of-the-art results. I will provide detailed information on how I utilized this method and the benefits it brought to my overall performance in the competition.

Introduction

Our heart is an integral part of who we are, providing us with the ability to imagine, create, and explore life’s moments. Despite its importance, however, we often overlook the significance of heart health. In the United States alone, approximately 1,500 individuals are diagnosed with heart failure each day.

When it comes to determining cardiac function and detecting heart disease, measuring end-systolic (ES) and end-diastolic (ED) volumes is critical.

A Brief Overview of Medical Concepts

These volumes represent the size of one chamber of the heart at the beginning and middle of each heartbeat, and are used to calculate the Ejection Fraction (EF).

The EF is the percentage of blood that’s ejected from the left ventricle with every heartbeat, and it is a key predictor of heart disease. Although there are various methods for measuring volumes or EF, Magnetic Resonance Imaging (MRI) is widely regarded as the most accurate way to evaluate the heart’s pumping function.

Motivation behind this project

Despite the superiority of MRI in accurately measuring cardiac volumes and calculating ejection fraction, the process of analysis is a laborious and time-consuming one.

A competent cardiologist is required to manually examine MRI scans to determine EF, which can take up to 20 minutes to complete. This is a significant amount of time that could be better spent with patients. Improving the efficiency of this measurement process would not only allow doctors to diagnose heart conditions earlier, but it would also have far-reaching implications for the advancement of heart disease treatment as a field.

Other benefits using an AI software to compute EF over relying solely on clinicians would be:

Speed: Think of 20 minutes VS a couple of seconds, which translates into cost reduction and significantly reduced time for patients
Consistency: AI software is not subject to the same variations in interpretation that can arise when relying on human clinicians
Accuracy: AI software can often detect nuances and patterns in medical data that may be difficult for humans to identify.

Short data description

The Kaggle Data Science Bowl Cardiac Challenge Data consists of CINE cardiac MRI including a short-axis (SAX) stack which was used for ventricular volume quantification.

We used 491 subject datasets for training, 187 for validation and the remaining 440 (same test set as in the original challenge) were reserved for testing.

Wait, CINE what?

CINE MRI data is a type of medical imaging data that captures a series of images of the heart as it beats. In CINE MRI, a series of images are captured in rapid succession, typically at a rate of about 25–30 images or less per second.

These images can then be combined to create a video-like sequence that shows the heart in motion. Visually, CINE MRI data can look similar to a video.

Each frame of the video shows a different “slice” of the heart, allowing doctors and researchers to see the heart from different angles and perspectives.

Visualizing CINE Acquisition with ED and ES Points: A Patient Example

The image above shows the acquisition data from a single subject, which includes only the end-diastolic (ED) and end-systolic (ES) time points for the sake of simplicity.

Typically, there are an additional 7–8 time points between them. The acquisition consists of a total of 6 slices, with slice 1 located at the bottom of the heart and slice 6 at the top of the heart.

Methods

Until recently, methods for determining Left Ventricle (LV) volume from cardiac images were based on segmentation. This involved creating binary image masks where the LV was visible, counting the total number of pixels within the LV region, multiplying that by the image resolution, and then calculating the resulting volume.

However, we wanted to try a novel approach where LV volumes are predicted directly from the images using regression, bypassing the need for explicit segmentation.

The challenge

The challenge we faced was that regression algorithms generally perform poorly in computer vision tasks, particularly when dealing with small datasets. This was further compounded by the highly unbalanced nature of our dataset, as can be seen from the distribution depicted in the image below.

Distribution of the real patients dataset. X-axis represents EF value [%] and Y-axis the number of subjects

As anticipated, when we initially ran a simple convolutional neural network (CNN), the model exhibited bias towards predicting end-diastolic volume (EDV) and end-systolic volume (ESV) values that resulted in an ejection fraction falling within the 50–70% range, which was where the majority of patients in our dataset were clustered.

This approach so far encountered a major setback as our focus was on patients with pathological conditions, specifically those with ejection fractions outside the healthy range of 50–70%.

We conducted several weeks of experiments, attempting various methods such as batch subsampling/oversampling, data modelling, hyperparameter tuning, class weighting, anomaly detection, ensemble methods and different model architectures — along with every other type of augmentation we could think of.

Despite all of our efforts, we ultimately found ourselves in a very similar position to where we began.

The solution

Upon analyzing more carefully the dataset (distribution, variety, total quality samples etc), we came to the conclusion that obtaining a non-biased result with the limited number of patients in a regression manner was pretty unlikely. This led me to consider how I could incorporate more samples into the dataset, with a more uniform distribution.

Fast forward, after trying numerous GANs for generating realistic synthetic data (and failing), I stumbled upon a remarkable network called GauGAN, developed by Nvidia.

Very shortly, GauGAN is an innovative GAN-based image synthesis model that has the impressive ability to produce photo-realistic images based on an input semantic layout, by using a special normalization technique called spatially-adaptive normalization and SPADE blocks.

I will not go into details here about how GauGAN works, since it is not the scope of this article, but you can read more about it in other posts.

To summarize, it is basically the opposite of segmentation: you provide a mask as input, and it will generate the corresponding real image.

And this was exactly what I needed, because I could manipulate the segmentation masks to obtain the desired volumes and implicitly ejection fraction.

So, I trained GauGAN to process segmentation masks (which had 3 classes: Left Ventricle, which is the class that determines the volumes, Right Ventricle and Myocardium. I added the last two classes to give extra contextual information for the network).

And for the first time, surprisingly, the generated images looked very real. Realistic enough such that I could use them to continue the training process.

Then, I used segmentation masks from a few normal patients and applied two different computer vision algorithms to obtain slightly different masks:

To simulate patients with a lower ejection fraction, I utilized frame interpolation between the end-systole (ES) slice frame and end-diastole (ED) slice frame and then used those interpolated slices as ED or ES.
To simulate patients with a higher ejection fraction, I utilized affine transformations on the masks.

I defined input parameters for both processes so that I could generate segmentation masks corresponding to a specific ejection fraction value for each patient.

Using these parameters, I created approximately 22,000 masks, resulting in a more evenly distributed dataset for ejection fraction values.

This is how the new dataset distribution looks like now:

So not only that the distribution of the EF is more uniform now, but the dataset is also ~20 times bigger.

The following step involved feeding all the generated masks into the GauGan generator to produce realistic synthetic MRI scans.

Returning to the original training, I trained a custom neural network (explained in a paragraph below) using only synthetic data, and then fine-tuned the model using real data.

Results

The Kaggle competition had a winning team with a CRPS score of 0.00948, which translates to 12.0 ml RMS error for EDV, 10.2 ml for ESV and 4.9 for ejection fraction.

Interestingly, the top 4 team achieved the smallest ejection fraction error of 4.7, even though the RMSE for volumes was slightly larger.

With our approach, we achieved a significant improvement of 21.2% compared to the previous state of the art, with 11.2 ml for EDV, 7.1 ml for ESV, and 3.7 for ejection fraction.

By using the network that was not fine-tuned on the synthetic dataset, we only achieved an EDV score of 23.7 ml, an ESV score of 12.6 ml, and an ejection fraction score of 7.1.

Therefore, the conclusion is that the introduction of synthetic data played a crucial role and made a significant difference in improving the results.

Network architecture

The network was designed to process a stack of CINE MR slices of variable number of slices and which outputs both EDV and ESV, further used to compute the EF.

The network input is a SAX stack of a varying number of slices, each consisting of one ED and one ES frame concatenated along the channel axis.

A 2D residual CNN is employed in the first layers for every (ED, ES) pair. There are five residual blocks building the CNN. Every block consists of multiple 2D convolutional layers, ReLU activation functions, Batch Normalization and Max Pooling layers. The first convolutional layer outputs 32 channels, and this parameter doubles in value at every convolutional block.

Before feeding the resulting features to the LSTM, they are flattened, and a linear network is used to reduce their dimensionality to 128 elements containing spatial information.

Then, a bidirectional LSTM network is applied to correlate the information between these feature vectors, resulting a vector containing both spatial and temporal information.

As a final step, a Bayesian ridge regressor is employed to predict the final EDV and ESV volumes.

The LSTM approach enables the proposed model to process a variable length of slices. The training of the model is performed using the Rectified Adam optimizer and RMSE loss function.

Conclusions

In conclusion, we consider this to be not only a novel approach to improve the accuracy of cardiac MRI volume segmentation using synthetic data, but a general methodology that can be extrapolated and be applied to other projects as well.

By using GauGan to generate synthetic MRI scans, we were able to increase the size and diversity of our dataset, which allowed for better training of our deep neural network.

The results speak for themselves, as our approach achieved a 21.2% improvement in ejection fraction compared to the previous state of the art.

If you want to read more details about this work, we have also published a detailed scientific paper about this project, which can be found here.