An image processing project using k-means clustering, random forest classification, and the U-Net CNN architecture for road segmentation.
Machine Learning
Python
Remote Sensing
Author
Alex Oh, Aiden Pape, and Liam Smith
Published
May 15, 2024
Introduction
Semantic segmentation is a process that takes an input image and returns an output image of the same size where each pixel belongs to a particular class of objects, not distinguishing between objects within a class. For this study, the resulting image is binary, true meaning that a pixel is part of the class we are trying to segment and false meaning a pixel is not part of the class.
The purpose of this project was to compare different methods to use satellite data to segment roads. Three machine learning methodologies were used and each had increasing complexity and computational demands. These methods were K-Means Clustering, Random Forests, and Deep Learning using the UNet architecture. K-Means clustering is an unsupervised machine learning algorithm that partitions the data in k clusters, where k is determined by the user based on how many categories you want to cluster. It iteratively updates the centroids of each cluster using mean coordinates until convergence is achieved. Random forest classification is an ensemble learning method involving many decision trees. Each decision tree is trained on a random subset of observations and features, and predictions of the random forest are based on a plurality vote of the decision trees. The UNet architecture for deep neural networks was used as the final method. UNet consists of two convolutional neural networks put together: an encoder and a decoder. The encoder analyzes and extracts features and important information from the input image and the decoder takes this information and generates the binary segmentation image.
To produce our model with k-means clustering, we consulted journal articles by Jinxin, Qixin, and Liguang (2006) and Maurya, Gupta, and Shukla (2011). To implement the random forest algorithm, we mainly used scikit-learn’s documentation, but to understand the algorithm, we read this article on decision trees and a follow up article on random forests. For context on deep learning with remote sensing imagery, we consulted Lv et al. (2023) and Fan et al. (2023).
Each method was implemented and trained, if necessary, and then segmentation metrics were tested to see how well each method was performing. The metrics used were precision, recall, and Dice Coefficient. In addition to comparing the performance of each method, we discuss the computational power, training data, and time needed in order to implement them. The dataset used was Massachusetts Road Dataset, which consists of 1110 training, 14 validation, and 49 testing satellite image and label pairs. Each image is 1500 by 1500 pixels at 1 meter resolution, resulting in each imaging capturing 2.25 square kilometers.
This report is a summary of our work, and does not include all code. In particular, training the U-Net model required substantial computational effort and we did not re-run the entire model in this notebook. For full documentation of our work, please visit our GitHub repository.
Methods
Packages and Functions
Before thoroughly describing our work, we import necessary packages and define a few functions.
Code
# Import packagesimport skimage.io as skioimport skimage.util as skuimport skimage.color as skolfrom skimage import filters, feature, transformimport skimage.morphology as skmimport skimage.draw as drawfrom skimage.color import rgb2grayfrom scipy.signal import convolve2dimport matplotlib.pyplot as pltimport mpl_toolkits.mplot3dimport numpy as npimport cv2from sklearn.cluster import KMeansfrom sklearn.metrics import confusion_matrixfrom scipy.ndimage import gaussian_filter, gaussian_laplacefrom sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCAfrom sklearn import datasetsimport tensorflow.keras.backend as Kimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.utils.random import sample_without_replacementfrom scipy.ndimage import gaussian_filterfrom scipy.ndimage import gaussian_laplacefrom scipy.ndimage import maximum_filterfrom scipy.ndimage import minimum_filterfrom scipy.ndimage import median_filter
Code
# Function to compute DICEsmooth=1def dice_coef(y_true, y_pred): intersection = np.sum(y_true * y_pred)return (2.* intersection + smooth) / (np.sum(y_true) + np.sum(y_pred) + smooth)# Function to print several accuracy metricsdef accuracy_metrics(y_true, y_pred):# Create confusion matrix C = confusion_matrix(y_true, y_pred, labels=(True, False))# Overall accuracy rate acc = (C[0,0] + C[1,1])/C.sum()# Recall recall = (C[0,0])/(C[0,0] + C[1,0])# Precision prec = (C[0,0])/(C[0,0] + C[0,1])# DICE dice = dice_coef(y_true, y_pred)# Print resultsprint("Confusion matrix:\n", C)print("Overall accuracy:", np.round(acc, 3), "\nPrecision:", np.round(recall, 3),"\nRecall", np.round(prec, 3), "\nDICE:", np.round(dice, 3))# Function to create input layerdef classify_gray(image):# Compute the standard deviation of the r, g, and b channels std_dev = np.std(image, axis =2)# Define a threshold for classifying gray pixels diff_threshold =6# Adjust as needed# Classify pixels as gray or not gray based on the standard deviation gray_mask = std_dev < diff_thresholdreturn gray_mask# Function to compute layers for additional model featuresdef compute_features(img, include_categorical =True): # Range of values (gray pixels will have low range) r = img.max(axis =2) - img.min(axis =2)if include_categorical:# Canny edge detection canny_edges_r = feature.canny(img[:,:,0], sigma=4) canny_edges_g = feature.canny(img[:,:,1], sigma=4) canny_edges_b = feature.canny(img[:,:,2], sigma=4)# Calculation Canny gradient image_gray = rgb2gray(img) canny_edges = feature.canny(image_gray, sigma=3)# Create disk disk = skm.disk(1)# Area closing for hough lines closed_edges = skm.dilation(canny_edges, footprint = disk) closed_edges = closed_edges *255 lines = transform.probabilistic_hough_line(closed_edges, threshold=5, line_length=25, line_gap=3) hough_lines = np.zeros(image_gray.shape, dtype=np.uint8)# Draw the detected lines on the canvasfor line in lines: p0, p1 = line# Draw line segment rr, cc = draw.line(p0[1], p0[0], p1[1], p1[0]) hough_lines[rr, cc] =255# Set the pixel values to white (255) along the line#create gray mask gray_mask = classify_gray(img) gray_mask = gray_mask.reshape((img.shape[0], img.shape[1])) img = np.dstack([img, canny_edges_r, canny_edges_g, canny_edges_b, gray_mask, hough_lines])# Gaussian blur sigma = 1 gaus_r_1 = gaussian_filter(img[:,:,0], sigma =1) gaus_g_1 = gaussian_filter(img[:,:,1], sigma =1) gaus_b_1 = gaussian_filter(img[:,:,2], sigma =1)# Gaussian blur sigma = 3 gaus_r_3 = gaussian_filter(img[:,:,0], sigma =3) gaus_g_3 = gaussian_filter(img[:,:,1], sigma =3) gaus_b_3 = gaussian_filter(img[:,:,2], sigma =3)# Gaussian blur sigma = 5 gaus_r_5 = gaussian_filter(img[:,:,0], sigma =5) gaus_g_5 = gaussian_filter(img[:,:,1], sigma =5) gaus_b_5 = gaussian_filter(img[:,:,2], sigma =5)# LoG blur sigma = .5 log_r_5 = gaussian_laplace(img[:,:,0], sigma =.5) log_g_5 = gaussian_laplace(img[:,:,1], sigma =.5) log_b_5 = gaussian_laplace(img[:,:,2], sigma =.5)# LoG blur sigma = .6 log_r_6 = gaussian_laplace(img[:,:,0], sigma =.6) log_g_6 = gaussian_laplace(img[:,:,1], sigma =.6) log_b_6 = gaussian_laplace(img[:,:,2], sigma =.6)# LoG blur sigma = .8 log_r_8 = gaussian_laplace(img[:,:,0], sigma =.8) log_g_8 = gaussian_laplace(img[:,:,1], sigma =.8) log_b_8 = gaussian_laplace(img[:,:,2], sigma =.8)# Add layers to modelreturn np.dstack([img, r, gaus_r_1, gaus_g_1, gaus_b_1, gaus_r_3, gaus_g_3, gaus_b_3, gaus_r_5, gaus_g_5, gaus_b_5, log_r_5, log_g_5, log_b_5, log_r_6, log_g_6, log_b_6, log_r_8, log_g_8, log_b_8])def identify_road_cluster(clustered_image, image_label): cluster_labels = np.unique(clustered_image) best_recall =0 best_cluster =-1for i in cluster_labels: cluster = (clustered_image==i) C = confusion_matrix(image_label.ravel(), cluster.ravel(), labels=(True, False))# # Overall accuracy rate# acc = (C[0,0] + C[1,1])/C.sum()# # Recall recall = (C[0,0])/(C[0,0] + C[1,0])# Precision# prec = (C[0,0])/(C[0,0] + C[0,1])if recall > best_recall: best_recall = recall best_cluster = ireturn best_cluster
Import Images
First, we import the images that we will use for most of our work. For the k-means and random forest sections, we use one image for training and another for testing, so we display those images here.
Code
# Read the datargb = skio.imread("../data/MA_roads/tiff/train/10828735_15.tiff")ans = skio.imread("../data/MA_roads/tiff/train_labels/10828735_15.tif") >0rgb_test = skio.imread("../data/MA_roads/tiff/train/21929005_15.tiff")ans_test = skio.imread("../data/MA_roads/tiff/train_labels/21929005_15.tif") >0# Display training data and correct outputfig, ax = plt.subplots(1, 2, figsize = (10, 6))skio.imshow(rgb, ax = ax[0])ax[0].set_title("Training Data")skio.imshow(ans, ax = ax[1])ax[1].set_title("Training Solution");# Display testing data and correct outputfig, ax = plt.subplots(1, 2, figsize = (10, 6))skio.imshow(rgb_test, ax = ax[0])ax[0].set_title("Testing Data")skio.imshow(ans_test, ax = ax[1])ax[1].set_title("Testing Solution");
Creating Image Filters
In our k-means and random forest sections, we will compute many features to help identify specific aspects of roads. In the previous section, we defined a function for creating this features. Let us take a closer look at a smaller area and then apply our filters to that area.
On the top left is the range of red, green and blue for each pixel. We chose this feature because roads are gray, and in the RGB color space, gray pixels have similar values of red, green and blue.
On the top middle is canny edges for the red channel. We created this feature hoping to detect the edges of roads, but as you can see, it detects edges in many objects other than roads. We include canny edges in the red, green and blue channels.
On the top right is the graymask created using standard deviation amongst the RGB channels and a threshold. It captures the roads, but also many other features.
On the bottom left is the hough transform image, which captures the long linear nature of roads but does not successfully detect every pixel.
On the bottom middle is the original image after gaussian blurring with \(\sigma = 3\). Our thought process here was that there might be some noise in the image leading random pixels to have the same R, G, and B values as roads. By blurring the image, we hoped to account for this by giving some weight to the values of nearby pixels. We include gaussian blur with \(\sigma = 1\), \(\sigma = 3\), and \(\sigma = 5\) for red, green and blue, hoping that our model might learn from multiple blurring radii.
On the bottom right is our image after the log of gaussian filter has been applied to the red, green and blue channels. We hopes to pick up on the width/frequency of roads with this filter, so we included this filter for \(\sigma = 0.5\), \(\sigma = 0.6\), and \(\sigma = 0.8\).
Principal Components Analysis
Principal Component Analysis (PCA) was utilized to assess the possibility of cutting down on the number of features input into the various models. Currently there are about 27 filtered and calculated features extracted from each 1500x1500x3 RGB image. When scaling this process, the ability to cut down on preprocessing steps and decrease the size of the image input into the models would provide large increases in the speed and storage needed to run this process. This is where PCA has the potential to help, PCA is a commonly used tool for dimensionality reduction across data science. PCA is especially helpful when a dataset has multiple potentially correlated and redundant features and it is unclear which features are most important. In this situation, many of the features generated from our satellite images may provide very similar information or are highly correlated. PCA would allow us to use less layers while retaining the maximum amount of variance.
K-Means Clustering
K-Means Clustering is a basic unsupervised machine learning algorithm that groups data points into k clusters, the value of k can be chosen by the user. The centroids of each cluster are initialized randomly as three data points and all points are assigned to the cluster nearest to them. Then the mean coordinates of all the points in each cluster is calculated and this value is now the new centroid for that cluster. Then the process is repeated until a maximum number of interactions is reached or the centroids spot changing significantly between each interaction. Although there were only two categories in this study; road and background, k was chosen to be three. This is because it was found by (Maurya, et al) to keep the clusters containing the roads from including other non-road features.
K-Means clustering was run on features extracted from the full 1500x1500 images, and two sets of features were used. First was all features discussed in the previous section and the second was the top 5 principal components accessed by PCA. We will compare the results of these two approaches to K-Means clustering and then compare the results of K-Means in general to the results from our other models.
Random Forest
Random forest is a highly flexible supervised machine learning algorithm that can perform both regression and classificiation and can take both continuous and categorical data as input. Random forest is an ensemble learning method that works by training a specified number of decision trees on the training data. Each decision tree is trained using a random subset of training data and features, and essentially is composed of a root node and a number of internal/decision nodes. Each data point is passed to the root of the decision tree, and at each decision node, either travels to the node’s left or right child depending on the value of one feature of the data point. The data point is passed through the tree until it reaches a terminal node, at which point it is classified or given a predicted value.
In the random forest model, classification predictions are created based on a majority vote of the decision trees, while regression predictions are created based on averaging the output of the decision trees. For our work, we consider each pixel as an observation and we use the random forest model to classify pixels as either road or non-road. Due to computational constraints, we did not train a model using the entirety of multiple images – we don’t even train a model on a single image. Each image has \(2,250,000\) pixels, and we ran this part of our analysis on our personal computers. Instead, we created our first model by training on the RGB channels of a subset of one image. In the second model, we trained on the RBG channels plus all of the additional layers mentioned above. We noticed that our model tended to produce far more accurate predictions on non-road pixels than road pixels, so we then sampled an image for an equal number of pixels of each class and trained both RGB and additional layer models on that input. Then, we attempted to prevent overfitting by fitting a random forest on the additional layers input reduced to 5 features via PCA. Finally, we trained both RGB and additional layers models on sampled data from multiple images. In the results section, we compare the performance of all of our models.
U-Net
A U-Net model is a convolutional neural network architecture which is designed for segmentation. The architecture is characterized by it’s symmetrical “U-shaped” design composed of an encoder path and a decoder path. The encoder path takes as input the original image and consists of multiple iterations of convolutional layers followed by max pooling layers. This path is designed to take the context and identify key features and reduce the spatial dimensions between each layer. The decoder path is unique to U-Net neural networks, as it attempts to construct a mask using the features learned in the encoder path and regaining the spatial dimensions lost. It consists of a series of up-sampling layers (often using transposed convolutions or interpolation) followed by concatenation with feature maps from the corresponding contracting path. Throughout this process, a skip connection is made between each layer of the two paths to preserve the high resolution between each layer. The final layer of the U-Net typically uses a 1x1 convolution followed by an activation function (e.g., sigmoid or softmax) to produce the final segmentation mask or output. This layer condenses the information learned by the network into the desired output format, such as a binary mask for semantic segmentation tasks.
Our U-Net model is comprised of 10 layers: 5 encoding layers and 5 decoding layers. We were limited by resources like time and memory, so instead of fitting the model with 1500 x 1500 pixel images, we used 128 x128 pixels. We also divided each 1500 x 1500 into 128 x 128 images, increasing the training size tremendously. Images and masks without roads were removed from the dataset. The model is optimized using the Adam algorithm with a learning rate of 0.01. Our loss is calculated using the Binary Cross-Entropy function. Like previous models, we measured the strength of our model using the dice coefficient.
Results
PCA
Read in our PCA example image and label and crop them
This image will be used to demostrate what is being done with PCA and to assess how well the dimension reduction and variance maximization of PCA group of road points and background points into distnguishable clusters. A much smaller image is used so that the data points don’t just appear as a large mass of the size of graph that is able to be shown. For this example, we picked a straight forward patch with a perpedicular road intersection.
Get all filters of the image
Code
pca_filters = compute_features(pca_img_cropped)
Show top 5 principal components
Code
pca_layers = pca_filters.reshape(pca_filters.shape[0] * pca_filters.shape[1], pca_filters.shape[2])# Standardize the featuresscaler = StandardScaler()pca_layers_scaled = scaler.fit_transform(pca_layers)# Initialize PCA and fit the scaled datapca_5 = PCA(n_components=5)# layers_pca = pca_10.fit_transform(pca_layers_scaled)pca_5_comps = pca_5.fit_transform(pca_layers_scaled)# Explained variance ratioexplained_variance_ratio_5 = pca_5.explained_variance_ratio_# Plotting the explained variance ratioplt.figure(figsize=(6, 4))plt.bar(range(1, len(explained_variance_ratio_5) +1), explained_variance_ratio_5, alpha=0.5, align='center')plt.ylabel('Explained Variance Ratio')plt.xlabel('Principal Components')plt.title('Explained Variance Ratio by Principal Components')plt.show()
This graph shows the percent of variance explained by the top 5 principal components, it can be interpretted that since the top component explains about 45% of variance, the second component explains about 15%, and so on. Therefore, in reducing our dimensions to just 5 we still maintain around 70-80% of the variance that our previous 27 features had.
In order to visualize the new dimensinos that PCA can output, we will use the top 3 components to plot our road and background points in 3D space
Code
# Initialize PCA and fit the scaled datapca_3 = PCA(n_components=3)# layers_pca = pca_10.fit_transform(pca_layers_scaled)pca_3_comps = pca_3.fit_transform(pca_layers_scaled)
There is some clear distinction between the road and backround points but also still a lot of overlap. This shows that the problem of segmenting roads will be difficult but possible.
K-Means Clustering
Load in test image, this image will be used to test k-Means and future methods
# Show cluster and original imagefig, axes = plt.subplots(1, 2, figsize=(10, 10), sharex=True, sharey=True)axes[0].imshow(pca_road_cluster, cmap='gray')axes[0].set_title('Road Cluster')axes[1].imshow(test_image)axes[1].set_title('Original Image')plt.show()
A clear visual difference can be seen with the K-Means clustering run on the PCA top-5 components. The resulting road clusters are more uniform, while the results with using all features were quite grainy. This observation impacts the metrics as well, with PCA layers having similar recall to all layers but better precision and therefore DICE coefficient.
Random Forest Results
Initial RGB Model
Recall that our initial random forest model uses a small subset of our training image, as displayed earlier. First, we train our model.
Code
# Flatten imagestrain_small_rgb = small_rgb.reshape(small_rgb.shape[0]*small_rgb.shape[1], 3)y_train = small_ans.reshape(small_ans.shape[0]*small_ans.shape[1])# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel1 = RF.fit(train_small_rgb, y_train)# Predictions on training datamodel1_pred = model1.predict(train_small_rgb)# Confusion matrixaccuracy_metrics(y_train, model1_pred)
While we have a really good overall accuracy rate, we are correctly predicting only 69% of the actual road pixels. With a precision of 0.85, about 85% of our road predictions are actually roads.
While we still have a good overall accuracy rate, our predictions of roads is substantially worse. We have only classified 8.5% of the road pixels correctly, and only 11.3% of our road predictions were actually roads.
Our model did NOT generalize well! This looks terrible!
Initial Additional Layers Model
Now, let’s try training on the same region, but incorporating all of the features described above.
Code
# Train model# Flatten imagetrain_small_rgb_layers = small_rgb_layers.reshape(small_rgb_layers.shape[0]*small_rgb_layers.shape[1], small_rgb_layers.shape[2])# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel2 = RF.fit(train_small_rgb_layers, y_train)# Predictions on training datamodel2_pred = model2.predict(train_small_rgb_layers)# Confusion matrixaccuracy_metrics(y_train, model2_pred)
Adding these filters to our model had negligible impact on our results. It improved the accuracy from 0.906 to 0.928 and the precision from 0.113 to 0.149, but the recall dropped from 0.085 to 0.045. This means that of the pixels that actually represent roads, we are only correctly classifying 4.7% of them. With perfect results on our training data and pitiful results on our testing data, it appears that incorporating these features in our training data led to severe overfitting! Visually, our results look slightly less random, even though the performance metrics are worse.
Sampling for Overfitting: RGB
We have fed a substantial amount of data, which ought to contain some useful information regarding roads, to our model In our training solution, this data was in fact useful, leading to virtually 100% accuracy. However, on the testing data for both the RGB model and the model with additional layers, our model correctly predicted less than 10% of our roads. Perhaps this means that our model is overfit to our training data. Since the vast majority of pixels in our training data represent non-roads, perhaps our model is overfit to the particularities of the non-road pixels in our training data. One way to address this issue is to randomly select an equal number of pixels of both classes, and then train the model on those pixels. Let’s try randomly picking 5000 road pixels and 5000 non-road pixels for our training data and 5000 of each for our testing data and evaluating our model’s performance.
First, let’s use this method on a model with just RGB layers.
Code
# Flatten training imagestrain_rgb = rgb.reshape(rgb.shape[0]*rgb.shape[1], 3)y_train = ans.reshape(ans.shape[0]*ans.shape[1])# Subset training data by labely_train_true = y_train[y_train]y_train_false = y_train[~y_train]train_rgb_true = train_rgb[y_train]train_rgb_false = train_rgb[~y_train]# Sample indices of each labeltrue_indices = sample_without_replacement(y_train_true.shape[0], 10000)false_indices = sample_without_replacement(y_train_false.shape[0], 10000)# Create modified training datay_train_mod = np.concatenate([y_train_true[true_indices[:5000]], y_train_false[false_indices[:5000]]])train_rgb_mod = np.concatenate([train_rgb_true[true_indices[:5000]], train_rgb_false[false_indices[:5000]]])# Create modified testing datay_test_mod = np.concatenate([y_train_true[true_indices[5000:]], y_train_false[false_indices[5000:]]])test_rgb_mod = np.concatenate([train_rgb_true[true_indices[5000:]], train_rgb_false[false_indices[5000:]]])
Code
# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel3 = RF.fit(train_rgb_mod, y_train_mod)# Predictions on training datamodel3_pred = model3.predict(train_rgb_mod)# Confusion matrixaccuracy_metrics(y_train_mod, model3_pred)
While this model does not have 100% training accuracy like the additional layers model, it has improved significantly over the original RGB model. Most notably, the training recall has improved from less than 70% to roughly 99%. Let’s see if we maintain this performance when we make predictions on our testing data.
Code
# Predictions on testing datamodel3_test_pred = model3.predict(test_rgb_mod)# Confusion matrixaccuracy_metrics(y_test_mod, model3_test_pred)
Our results are encouraging! Our overall accuracy, precision, and recall are all approximatly 0.75. In the original RGB model, the overall accuracy was over 90%, while the precision and recall were roughly 10%. By balancing the amount of training data in each class, we were able to balance the different accuracy metrics, improving our predictions of roads at the expense of our predictions of non-roads. Perhaps if we incorporate our additional layers into the model, these balance improvements will translate to balanced and higher accuracy metrics.
While our results above are encouraging, our training and testing data were both drawn from the same image, so our model may have overtrained to this image. Let’s form predictions and compute accuracy metrics on a different image.
Surprisingly, the overall accuracy is higher in the testing image than in the training image! The recall is still over 70%, indicating that we are capturing most pixels representing roads correctly. With a much lower precision, we must be predicting road pixels frequently where there are not actually roads.
Since we are working with an entire image, we can inspect our results!
It looks like we tend to label pixels as roads when they in reality represent other human features like buildings. We also exaggerate the width of some roads.
Sampling for Overfitting: Additional Layers
Now let’s try the same sampling technique but after producing all of our features.
Code
# Create additional featurestrain_rgb_mod_layers = compute_features(rgb)# Flatten training image with extra layerstrain_rgb_2 = train_rgb_mod_layers.reshape(train_rgb_mod_layers.shape[0]*train_rgb_mod_layers.shape[1], train_rgb_mod_layers.shape[2])# Subset training data by labeltrain_rgb_true_2 = train_rgb_2[y_train]train_rgb_false_2 = train_rgb_2[~y_train]# Create modified training datatrain_rgb_mod_2 = np.concatenate([train_rgb_true_2[true_indices[:5000]], train_rgb_false_2[false_indices[:5000]]])# Create modified testing datatest_rgb_mod_2 = np.concatenate([train_rgb_true_2[true_indices[5000:]], train_rgb_false_2[false_indices[5000:]]])
Code
# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel4 = RF.fit(train_rgb_mod_2, y_train_mod)# Predictions on training datamodel4_pred = model4.predict(train_rgb_mod_2)# Confusion matrixaccuracy_metrics(y_train_mod, model4_pred)
It appears that there were some errors on our testing data. Going from the RGB model to the additional layers model, our overall accuracy improved from 0.755 to 0.812, the precision improved from 0.749 to 0.796, and the recall improved from 0.767 to 0.84. These are the most accurate road predictions yet!
While our training and testing data contained none of the same pixels, they were both drawn from the same image, so it is possible that they were overtrained to our particular image of choice. Perhaps a more valid testing metric would involve testing our model on pixels from a different image. Let’s form predictions and compute accuracy metrics on the entirety of another image.
Similar to the RGB model, the results on the testing image were largely similar to the results on the previous image, except for the precision dropping by over 50% The overall accuracy is over 85%, but the recall is now 65.9% and the precision is now 26.6%. While this is certainly not perfect, the precision and recall are still a substantial improvement over the models without sampling. However, the recall was actually slightly higher in the sampled RGB model, indicating that the RGB model generalized better in terms of predicting road pixels. Perhaps there are tactics we can use to combat overfitting.
Also, since we are now working with a complete image, we can once again inspect a full image illustrating our predictions versus the truth.
Once again, our model tends to incorrectly predict roads where there are other human features like buildings, and it exaggerates the width of some roads.
PCA for Overfitting
It appears that our additional layers model with sampled training data is overfit to our training data, as we have excellent performance on the training data but subpar performance on the new image. Perhaps our model suffers from overfitting because we have constructed so many features, many of which are similar to one another. To help combat this issue, we fit our model again below, but after applying Principal Component Analysis and selecting the most important components.
First, we apply Principal Component Analysis to our training data and plot the percent variance explained by each component. Note that PCA is only applicable for continuous features, so we cannot include binary features such as canny edges in this model.
Code
# Add layers to modeltrain_pca_layers = compute_features(rgb, include_categorical =False) # Flatten training image with extra layerstrain_pca_layers_flat = train_pca_layers.reshape(train_pca_layers.shape[0]*train_pca_layers.shape[1], train_pca_layers.shape[2])# Standardize the featuresscaler = StandardScaler()train_pca_layers_scaled = scaler.fit_transform(train_pca_layers_flat)# Initialize PCA and fit the scaled datapca = PCA(n_components=train_pca_layers.shape[2])layers_pca = pca.fit_transform(train_pca_layers_scaled)# Explained variance ratioexplained_variance_ratio = pca.explained_variance_ratio_# Plotting the explained variance ratioplt.figure(figsize=(8, 6))plt.bar(range(1, len(explained_variance_ratio) +1), explained_variance_ratio, alpha=0.5, align='center')plt.ylabel('Explained Variance Ratio')plt.xlabel('Principal Components')plt.title('Explained Variance Ratio by Principal Components')plt.show()
Apparently, over 50% of the variance in our data can be explained by the first component! The second component only accounts for about 11.5% of the variance in the data, and the numbers continue to drop after that.
Code
explained_variance_ratio[0:5].sum()
0.843420909342875
The first 5 components account for over 84% of the variation in our data. Let’s try only retaining the first 5 components for our model and seeing whether our performance improves.
Code
# Initialize PCA and fit the scaled datapca = PCA(n_components=5)layers_pca = pca.fit_transform(train_pca_layers_scaled)# Subset training data by labellayers_pca_true = layers_pca[y_train]layers_pca_false = layers_pca[~y_train]# Create modified training datalayers_pca_mod_train = np.concatenate([layers_pca_true[true_indices[:5000]], layers_pca_false[false_indices[:5000]]])# Create modified testing datalayers_pca_mod_test = np.concatenate([layers_pca_true[true_indices[5000:]], layers_pca_false[false_indices[5000:]]])# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel5 = RF.fit(layers_pca_mod_train, y_train_mod)# Predictions on training datamodel5_pred = model5.predict(layers_pca_mod_train)# Confusion matrixaccuracy_metrics(y_train_mod, model5_pred)
On the testing data, all of our model’s performance metrics are lower than its non-PCA counterpart, although not by that much. Let’s test on an entirely new image.
Code
# Add layers to modeltest_pca_layers = compute_features(rgb_test, include_categorical =False)# Flatten testing imagestest_pca_layers_flat = test_pca_layers.reshape(test_pca_layers.shape[0]*test_pca_layers.shape[1], 22)# Standardize the featurestest_pca_layers_scaled = scaler.fit_transform(test_pca_layers_flat)# Project onto principal componentslayers_pca_test = pca.transform(test_pca_layers_scaled)
Code
# Predictions on testing datamodel5_test_pred_2 = model5.predict(layers_pca_test)# Confusion matrixaccuracy_metrics(y_test, model5_test_pred_2)
Again, on the testing image, all of our model’s performance metrics are lower than its non-PCA counterpart. In this scenario, it appears that the components explaining very little variation in the data were actually somewhat useful for predictions. Note that we tried this with a variety of number of retained components, and we found that the model’s performance improved as we increased the number of components.
Visually, our predictions appear somewhat worse that those from our non-PCA additional layers model. There is overall a lot more noise in our predictions, and interestingly, we are not predicting a road in much of the massive highway.
Sampling from Multiple Images: RGB
PCA did not help us generalize to new testing data, but perhaps there is another approach we could take. Earlier, we created our training data by sampling from a single image. Perhaps we could sample from multiple images, mitigating bias from working with a single image. Below, we select 10 images and sample from them to train our model.
Code
# Store id's of imagesimgs = ["10828735_15", "10228675_15", "10228705_15", "10228720_15", "10228735_15", "10528675_15", "10528750_15", "10978720_15", "11128825_15", "12028750_15"]# Initialize arrays to store training datay_train = np.array([])rgb_train = np.zeros((0,3))# Sample from each imagefor img in imgs: rgb = skio.imread("../data/MA_roads/tiff/train/"+ img +".tiff") ans = skio.imread("../data/MA_roads/tiff/train_labels/"+ img +".tif") >0# Flatten training images rgb_flat = rgb.reshape(rgb.shape[0]*rgb.shape[1], 3) ans_flat = ans.reshape(ans.shape[0]*ans.shape[1])# Subset training data by label ans_true = ans_flat[ans_flat] ans_false = ans_flat[~ans_flat] rgb_true = rgb_flat[ans_flat] rgb_false = rgb_flat[~ans_flat]# Sample indices of each label true_indices = sample_without_replacement(ans_true.shape[0], 5000) false_indices = sample_without_replacement(ans_false.shape[0], 5000)# Create modified training data y_train = np.concatenate([y_train, ans_true[true_indices], ans_false[false_indices]]) rgb_train = np.concatenate([rgb_train, rgb_true[true_indices], rgb_false[false_indices]])# Create modelRF = RandomForestClassifier()# Fit and output the performance of the modelmodel6 = RF.fit(rgb_train, y_train)# Predictions on training datamodel6_pred = model6.predict(rgb_train)# Confusion matrixaccuracy_metrics(y_train, model6_pred)
These metrics are actually somewhat lower than the training metrics for the RGB model sampled from a single image. But our true question is whether the model generalizes better to the testing data – more specifically, to our new testing image.
Code
# Predictions on testing datamodel6_test_pred = model6.predict(flat_rgb_test)# Confusion matrixaccuracy_metrics(y_test, model6_test_pred)
In the RGB model trained on sampling data from one image, our results were as follows.
Overall accuracy: 0.829
Precision: 0.225
Recall 0.715
DICE: 0.343
This model has slightly decreased in all metrics except for the recall. Considering the additional computational power required to train this model, it does not appear to be advantageous over the other model.
Per usual when working with all of our layers, we have virtually perfect results on our training data. Next, we test the model on our usual testing image.
Code
# Predictions on testing datamodel7_test_pred = model7.predict(test_rgb_3)# Confusion matrixaccuracy_metrics(y_test, model7_test_pred)
This is the best performance we have seen so far. The biggest difference between this model and the additional layers model with training data sampled from a single image is that the recall has increased from 0.658 to 0.823.
This is definitely the cleanest output we have seen so far, which is reflective of our model’s high recall value. The model continues to erroneously label human features that are not roads as roads, and it has labelled virtually the entire highway as road instead of just its centerlines. Differentiating between the centerline and the road surface seems to be a very difficult problem.
Comparison of Random Forest Results
We fit a lot of random forest models, so we summarized their accuracy metrics in the following table. Note that our testing metrics listed here are the results when we tested on an entirely new image.
Model
RGB
More Layers
RGB Sampled
More Layers Sampled
PCA Sampled
RGB Multiple Images
More Layers Multiple Images
Training Accuracy
0.953
1.0
0.989
1.0
1.0
0.956
1.0
Training Precision
0.853
1.0
0.985
1.0
1.0
0.95
1.0
Training Recall
0.69
1.0
0.992
1.0
1.0
0.964
1.0
Training Dice
0.763
1.0
0.989
1.0
1.0
0.957
1.0
Testing Accuracy
0.906
0.928
0.829
0.868
0.733
0.816
0.846
Testing Precision
0.113
0.149
0.225
0.27
0.123
0.224
0.264
Testing Recall
0.085
0.045
0.715
0.658
0.537
0.789
0.823
Testing Dice
0.097
0.069
0.343
0.383
0.2
0.348
0.4
For a quick summary of each model’s performance, we look at the testing Dice score. We noticed a major bump in performance when we started sampling equal portions of training data, with the dice coefficient of both the RGB and additional layers models jumping from less than 0.1 to above 0.3. We attempted to account for overfitting by introducing PCA, but this cut our Dice coefficient in half. Finally, creating our training data by sampling from multiple images resulted in the highest performance, particularly in the additional layers model, where the Dice coefficient jumped to 0.4. Our final model was our best performing random forest model by every metric except testing accuracy.
U-Net Results
Because of the high computational demans of the U-Net model, we did not re-run the model in this notebook. For full documentation of our code, please see our GitHub repository.
We managed to get solid predictions for the small 128 x 128 images. The precision got up to 0.445, the recall went to 0.938, and the dice coefficient was 0.604. We wanted to compare our results with the previous models, so we ran our predictions on the test set (also divided into 128 x 128 pixel images), and reconstructed the 128 x 128 masks to get 1408 x 1408 masks which resemble the original image. These final predictions were less accurate than the original 128 x 128 masks with a precision metric of 0.06, a recall of 0.197, and a dice coefficient of 0.092. Individually the predictions made by the U-Net were pretty accurate, but when reconstructed to the greater original image size, the cumulative prediction must have lost enough information to lower its accuracy metrics.
Overall Result Comparison
Which method worked best: k-means, random forest, or U-Net? We include the following table to facilitate comparison of results. We trained many different models in the preceding sections, but we only report the best performing model for each method in the table below.
Model
K-Means
Random Forest
U-Net
Testing Precision
0.187
0.264
0.06
Testing Recall
0.661
0.823
0.197
Testing Dice
0.291
0.4
0.092
Essentially, with increasing complexity of the algorithm and computational requirements came increased performance. K-Means clustering required the least computational power but also achieved the worst results, achieving a Dice coefficient of 0.291 after we reduced the number of features using PCA. Random forest required more computational power, especially when we began sampling training data from multiple images, achieving an improved Dice score of 0.4. Finally, our U-Net model achieved even better results when we considered small subsets of our testing images with a dice score of 0.604, but poor results when applied to the entire image with a Dice score of 0.092. The U-Net models in the literature and Kaggle datasets have achieved by far the best results. Unfortunately, achieving such high performance with U-Net requires substantial computational power for model training. Overall, we conclude that U-Net can achieve the best performance when appropriate resources are available. In the absence of high performance computing capabilities, random forest may suffice.
Additionally, it is interesting to note that we had mixed results with the effectiveness of PCA. For K-Means Clustering, using the top-5 components helped the model perform better, but for Random Forest, using PCA led to decreased performance. PCA was not applicable to deep neural networks.
Accessibility
Due the high requirements to run these models, especially U-Net, and the large amount of data needed, these methods are not very accessible to many individuals. Luckily this is not a task that many individuals would look to perform but more a task that would be performed by large organization like governments. Considering accessibility at this scale brings in the question of access to large datasets and high performance computers, as many countries may not have access to such resources and therefore would not be able to implement these methods on a large scale.
There are some positive accessibility points with these algorithms as well, being that no one has to physically be in the place where you hope to segment roads. This means that in areas with difficult to access roads, one can still segment the road network. Some examples of when this could be extremely helpful are climate disasters and war time aid. With climate disasters, roads may be destroyed like during the floods in Vermont last summer. Knowing what roads were destroyed without having to go out into the field where danger from flood is still high could be extremely helpful for disaster response. In the example of war, especially in places that are being heavily bombed, it may be hard to get humanitarian aid workers into certain areas due to the destruction of roads. These algorithms could help assess with roads are still open without having to put anyone on the field and in danger.
There are some other accessibility points such as ability to interpret the image segmentation. In an area where the visual interpretation is essential, making sure color blind inclusive colors are used is very important.
Ethical Considerations
As mentioned above, some countries and organizations may not have the resources to train large models like U-Net. Resources like data centers and HPCs are essential and many places may not have access to them. Looking deeper, if a country like the US trains a UNet model for segmenting roads on a dataset including road in the US, that model could be used by other countries but it may be much less accurate if there road infrastructure looks different than that in the US. Furthermore, if large labeled datasets of roads only exist with roads from developed countries, this would hinder any country that may have less built infrastructure as the training data will not align with the data being input into the model.
Further ethical considerations around data across fields is the exploitative nature in which many labeled datasets are created. Large tech companies often out-source their data labeling overseas and pay people very little for the data that makes all this possible. They make huge profits off of the models that are only possible with this labeled data and the associated labor.
Going into remote sensing and the high resolution of satellite images, privacy concerns may be raised. These satellite images will contain sections consisting of private problems and invasion of personal privacy should always be a consideration when using remote sensing on inhabited areas.
Reflection
Schedule and Obstacles
We each stayed on schedule for our tasks in general, but some tasks were much more difficult than expected and we ended up deciding we would save them for future work. Some tasks just ended up being more time consuming, like training the U-Net. Lots of time was spent waiting for a model to train just have an error or horrible predictions. Looking back, we should have learned how to connect to Middlebury’s ADA cluster to train our model. We ended up ditching LiDAR data because it was more difficult to process than we realized and would put an even heavily burden on the amount of training data we were using. We also felt like we already had a plethora of topics to discuss and compare, so it wasn’t necessary to add even another aspect to the project.
Future Work
Future work we are interested in pursuing is of course integrating LiDAR data into each of our models and comparing the metrics on how well roads were segmented. We are very curious to see if this additional data would help our models and are interested in learning about LiDAR data. Another path of future work is performing more data engineering and augmentation on our training data for input into U-Net. We know from reading and seeing models trained on Kaggle that very accurate segmentations are possible and we had a lot of room for improvement with our model.
References
Fan, Zhiyong, Yu Liu, Min Xia, Jianmin Hou, Fei Yan, and Qiang Zang. 2023. “ResAt-UNet: A u-Shaped Network Using ResNet and Attention Module for Image Segmentation of Urban Buildings.”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 16: 2094–2111.
Jinxin, Cao, Shi Qixin, and Sun Liguang. 2006. “A Methodology for Automatic Detection and Extraction of Road Edges from High Resolution Remote Sensing Images.” In 2006 IEEE International Conference on Industrial Technology, 69–74. IEEE.
Lv, Jinna, Qi Shen, Mingzheng Lv, Yiran Li, Lei Shi, and Peiying Zhang. 2023. “Deep Learning-Based Semantic Segmentation of Remote Sensing Images: A Review.”Frontiers in Ecology and Evolution 11: 1201125.
Maurya, Rohit, PR Gupta, and Ajay Shankar Shukla. 2011. “Road Extraction Using k-Means Clustering and Morphological Operations.” In 2011 International Conference on Image Information Processing, 1–6. IEEE.