This study investigates the effectiveness of state-of-the-art deep learning models trained on high-resolution single-band satellite images in estimating site-level industrial development over time in the People's Republic of China.

These techniques are contrasted with Visible Infrared Imaging Radiometer Suite (VIIRS) nighttime lights as a proxy for site development.


Due to insufficient or difficult to obtain data on development in inaccessible regions, remote sensing data is an important tool for interested stakeholders to collect information on economic growth. To date, no studies have utilized deep learning to estimate industrial growth at the level of individual sites. In this study, we harness high-resolution panchromatic imagery to estimate development over time at 419 industrial sites in the People's Republic of China using a multi-tier computer vision framework. We present two methods for approximating development: (1) structural area coverage estimated through a Mask R-CNN segmentation algorithm, and (2) imputing development directly with visible & infrared radiance from the Visible Infrared Imaging Radiometer Suite (VIIRS). Labels generated from these methods are comparatively evaluated and tested. On a dataset of 2,078 50 cm resolution images spanning 19 years, the results indicate that two dimensions of industrial development can be estimated using high-resolution daytime imagery, including (a) the total square meters of industrial development (average error of 0.021 square kilometers), and (b) the radiance of lights (average error of 9.8 milliwatts per square meter-steradian). Trend analysis of the techniques reveal estimates from a Mask R-CNN-labeled CNN-LSTM track ground truth measurements most closely. The Mask R-CNN estimates positive growth at every site from the oldest image to the most recent, with an average change of 4,084 square meters.


Satellite imagery analysis using deep learning methods, specifically convolutional neural networks (CNNs), has grown in popularity since 2012, with uses extending into the estimation of population, wealth, poverty, conflict, migration, education, and infrastructure, among other applications. These techniques have broadly illustrated that harnessing satellites to remotely track development over time in otherwise data sparse regions is a potentially effective strategy.

One currently untested application of deep learning with satellite imagery is the identification and monitoring of industrial sites (e.g., factories, power plants, ports). The development of industrial sites is of broad interest, as it can serve as a proxy for everything from economic development to the projection of soft power. Because of its interrelationship with national security or proprietary corporate interests, information on such large-scale development is often undocumented or difficult to obtain openly by interested parties. This article focuses on testing our capability to automatically detect and monitor industrial sites within China using high-resolution panchromatic satellite imagery. Largely unrecorded in structured open source text information, the size and extent of industrial sites in China can be observed through routine or targeted satellite collection. From select sources, many locations appear, on average, at least yearly in cloud-free high-resolution imagery from satellite-based sensors over the past 15 years; some locations of interest have temporal granularity of as high as one day.

To date, no work has explored the use of machine learning methods trained on satellite imagery to estimate, and monitor over time, the development of particular economic industries at the scale of individual sites. In this work, the primary research question we seek to answer is, `How accurately can a convolutional deep learning system estimate development at industrial sites from high-resolution panchromatic imagery? This article focuses on 2,078 images across 419 unique, known sites in the People's Republic of China from 2002 to 2021 (imagery access was supplied by NGA). Each image is labeled using a Mask R-CNN (MR-CNN) to estimate structure footprint coverage, based on a transfer learning parameter tuning approach with a subset of 182 manually digitized images. The MR-CNN's predictions on each of the 2,078 images are then used as labels for a CNN-LSTM (long short-term memory) to estimate the total area covered by structures. The ultimate goal of this study is to explore the potential of a technique that accepts as input a single satellite image and achieves the task of estimating a single, total building footprint metric for that image (i.e., we do not seek to estimate the spatial location of buildings, only their total coverage). This approach enables users to identify regions where rapid industrialization (or de-industrialization) may be occurring for more detailed, qualitative analyses.

Data & Methodology

The methodology and data utilized in this study are summarized as follows:

  1. A MR-CNN is trained to measure the total area covered by structures using building location data from Shanghai (N=4,582 images) and fine-tuned with 182 high-resolution (ground sample distance ~0.5 m; a total of 545 million building pixels and 6.62 billion non-building) satellite images at known factories, power stations, and ports in China
  2. The MR-CNN results are used to label 2,078 high-resolution (~0.5 m) images
  3. 1,822 of the resulting image-label pairs are used to train a CNN-LSTM, with the goal of accurately estimating the total footprint of development at an industrial site over time (but, notably, not the explicit pixel locations of that development)
  4. The effectiveness of the CNN-LSTM in estimating total structure area is tested on the remaining 256 images
  5. Results are compared to training the CNN-LSTM using low-resolution satellite-derived nighttime light (radiance) values as labels, serving as a secondary proxy for development

An overview of the deep learning methodology and architectures is illustrated in Fig. 1.

Sites of interest were selected using the GeoNames database as of September 2021, searching for "factory", "power station", and "port" within the country of China to filter results. This resulted in a sample of 419 industrial sites, encompassing 215 factories, 148 power stations (hydroelectric dams, coal plants, converter stations, etc.), and 56 ports. As seen in Fig. 2, the sites cover a diverse geographic range throughout China's mainland borders. For each of these locations, image scenes were retrieved from G-EGD, owned and operated by Maxar Technologies. Image tiles were selected from Maxar imagery based on the oldest archived strip available, spacing instances evenly to the newest strip, limiting cloud cover, and minimizing nadir. Images were then cropped from the tiles based on a 800x800 meter (64,000 square meter) geographic square centered around the location's coordinates.

There are 2,078 total images covering 419 unique sites. Fig. 3 displays selections from two sites that underwent rapid growth. Fig. 4 shows the number of sites, with a median of five observations per site. These observations are relatively evenly distributed over time, with the number of instances per year shown in Fig. 5. Image resolution varies from 30 to 60 cm, with a median resolution of 50 cm. Ninety-five percent of the images are single-band panchromatic, while the remaining 5% are RGB. For input into the models, the RGB images were converted to single-band black and white images using the National Television System Committee (NTSC) luminance conversion formula, Y = 0.299R+0.587G +0.114B, where for a given pixel with components normalized 0 to 1, R is the red component, B is the blue component, G is the green component, and Y is the final converted luminance (also 0 to 1).

To train the MR-CNN, first, the algorithm, pre-trained on MS COCO (Microsoft Common Objects in Context) dataset, is trained on 4,582 30 cm resolution panchromatic Worldview-3 images of Shanghai, China, and their associated building footprint masks. The images were upscaled from 640x640 pixels to 1024x1024. Each image depicts an area of 200x200 meters. Our MR-CNN hyperparameters were adopted from a MR-CNN trained on RGB images in the same Shanghai dataset with only the image dimensions and region proposal network (RPN) anchor scales changed to accommodate the larger image size. Fig. 6A shows a typical example of the model's performance on a site in the Maxar dataset. At this stage, the model struggles to identify industrial-specific structures. Therefore, to improve detection performance, the MR-CNN is fine-tuned on 182 Maxar images (resized to 1024x1024) and their associated masks. Of the 182, 37 are held out for validation. The masks are created by manually geocoding the structures contained in the images. The 182 images were selected from across the latitudinal spectrum of the dataset with the most recent example of each site chosen to maximize the number of structures available to geocode. Due to the frequent difficulty in distinguishing industrial structures from commercial and residential, all structures present in the images are geocoded---i.e., buildings, silos, warehouses, piers, dams, electrical power relays, and antennas, and do not include roads and bare pavement. A visual example of a geocoding is shown in Fig. 6B. The average precision, AP, at defined intersection over union (IoU) thresholds, T, is used to evaluate the performance. IoU is defined as the area of overlap between the prediction and the mask divided by the combined area of the prediction and mask. The total area in square meters covered by structures in an image is then calculated using the number of pixels in the image classified as structure and the known geographic area depicted in each image (800x800 m). Overlaps in prediction masks were accounted for and removed in the calculation of structure pixels. With labels now generated, the CNN-LSTM model can be trained.

As shown in Fig. 6B, the backbone of the CNN-LSTM is a ResNet50 pre-trained on 10-band, 10x10 meter resolution Sentinel-2 satellite imagery. During training and testing, each batch input into the network contains all (and only) the images of a single site. For input into the CNN, each image is downscaled, through bilinear interpolation, to 516 pixel height and 426 pixel width. These values are derived from the median image dimensions divided by 3.75 (a number converged upon through trial and error with GPU memory capacity). A second experiment is conducted in which only the least and most recent instances of each site are used, allowing for image dimensions of 1004x841. As illustrated in Fig. 1B, the CNN portion outputs to an LSTM, after which dimensions are reduced through three fully-connected layers. The final output represents the estimates of the total structural area for each instance of the site. Adam optimization and L1 loss are utilized to optimize parameters across the full architecture. A 75-12.5-12.5% split is carried out, with 1556 images used in training, 256 in validation, and 256 in testing.

For a researcher interested in monitoring the development of industrial sites, a simpler approach might be to examine nighttime light intensity. Nighttime lights (NTL) have proven to be accurate for urban mapping and estimating wealth and poverty at city, state, and country levels. Here, we contrast the relative accuracy of using nighttime light radiance values with the CNN-LSTM to estimate development with the previously presented labeling approach. Our nighttime lights-derived strategy is based on the Visible Infrared Imaging Radiometer Suite (VIIRS). Nightly, VIIRS collects visible and infrared global observations of Earth's land, atmosphere, and oceans with approximately 500x500 meter resolution at the equator (Fig. 7). The program was launched in April 2012, so only images in our dataset captured after this time will be analyzed by the model (1,367 total; 988 training; 183 validation; 196 testing). For our analysis, a nighttime light label is determined by finding the maximum pixel value (radiance) of the monthly average that overlaps with the daytime image crop. Only VIIRS pixels that have at least half of their area overlapping with the daytime image are selected.

To directly compare nighttime light estimations with those of the MR-CNN-based estimations, a linear model of the form A=m*NTL+b is conceived to approximate the structured area. NTL are the predicted nighttime light values from the CNN-LSTM or raw nighttime light values, m is the slope, and b is the y-intercept. To build the linear model, values for area are obtained from 181 ground truth, geocoded images (with one site excluded due to an image being older than 2012).


Fig. 6C shows the MR-CNN result on a geocoded image after fine-tuning. Fig. 6D is a random result on an image not geocoded or included in fine-tuning. The average precision, AP, at various IoU thresholds, T, on the 37 validation images is shown in Table 1. Table 2 shows the AP at an IoU threshold of 0.3 grouped by image year on the entire 182-image geocoded dataset.

Using the process outlined in "Data & methodology", the total area of structures contained in each image in the Maxar dataset is computed using the MR-CNN. Across the 2,078 images, the average structural area is 48,106 square meters with a standard deviation of 27,753 square meters. The average difference between the most and least recent instance of each site is 4,894 square meters. This compares to an average structure area of 82,909 square meters in the geocoded dataset, with a standard deviation of 72,963 square meters.

Training the CNN-LSTM on the entire Maxar dataset yields a test loss of 21,399 square meters. Fig. 8 shows the validation loss trend during training. Training on only the least and most recent sites with larger images generates a slightly improved loss of 20,890 square meters.

Determination of nighttime light values, as outlined in "Data & methodology", generates an average label of 13.0 milliwatts per square meter-steradian with a standard deviation of 21.5 milliwatts per square meter-steradian. The average difference between the labels on the most and least recent instance of each site is 3.1 milliwatts per square meter-steradian. Using these labels for CNN-LSTM training on the dataset produces an L1 loss of 9.8 milliwatts per square meter-steradian on the test dataset.

Fig. 9 shows the performance of NTL (raw labels and CNN-LSTM predictions) to approximate structural area through linear regression. The R-squared of the fit lines are 0.10 and 0.01 for Figs. 9A and B, respectively.


To compare four area estimation strategies: (1) directly using raw NTL labels, (2) directly using raw MR-CNN labels, (3) MR-CNN-based CNN-LSTM, and (4) NTL-based CNN-LTSM to ground truth, each technique is evaluated on the geocoded dataset from 2018 to 2021 (169 images corresponding to 169 sites). Fig. 10 displays the averaged estimations, grouped by year. Table 3 shows the percent change in area estimated by each technique.

Of particular note for our application is that, across our ground truth dataset, the total amount of land that was developed decreased. This is likely reflective of highly publicized industrial greening efforts being undertaken nationwide. As we are most interested in detecting these types of trends, it is important that the algorithm used for change detection is capable of identifying the correct trend (i.e., we focus on our ability to accurately predict the trend of development, not necessarily the accuracy of any individual absolute square meterage of developed land measurement). The average trends predicted are shown in figure 4.12, by technique; of particular note is that the MR-CNN-based CNN-LSTM method is the only method that predicts the decline contained in the ground truth.

Using the MR-CNN-based CNN-LSTM on all 419 sites, Fig. 11 shows the estimated relative development change in two regions in China from the oldest to the most recent instance of each site. Every site is estimated by the model as having growth in development. The average change is +4,084 square meters.


While the MR-CNN CNN-LSTM was the only technique able to accurately predict the negative trend in developed industrial land, it also opens the door to numerous future inquiries. The MR-CNN precision results suggest limitations in the use of single-band satellite imagery with current state-of-the-art segmentation models. Our precision values are in contrast to other examples of object detection from high-resolution satellite imagery where average precision values upwards of 0.94 at an IoU threshold of 0.5 have been reported. The relative ineffectiveness of the MR-CNN in this study most likely derives from the fact that the images contain a single band as opposed to the multispectral imagery most often examined in the literature. The differentiated color information encoded into multispectral imagery provides more information for the model to distinguish between similarly shaped features. For example, the inclusion of color is likely a key input in the estimation of buildings (grey, e.g.), ponds/lakes (blue), and fields/woodland (green). A common inaccuracy that occurs during inference is the prediction of fields, forests, and ponds as industrial structures.

The labels and CNN-LSTM results derived from nighttime lights data show that 500x500 meter resolution VIIRS nighttime lights have difficulty capturing differentiation in development at the sub-km scale. Highlighting this is the fact that every NTL prediction by the CNN-LSTM on the 181 images geocoded set fell under 2.02 milliwatts per square meter-steradian (close to the median), despite there being significantly more range to the underlying values, as seen in Figs. 9A&B. This suggests there are no significant features present in the images correlating with their respective nighttime light values.

Finally, this study would benefit from greatly increasing the number of geocoded images, for the MR-CNN, and for technique trend analysis. Increasing the number of geocoded sites, such as geocoding two instances of each site, would provide more robust information for the MR-CNN to train on and, at least as importantly, allow for tracking the ground truth development at specific sites (to then contrast estimation techniques with).


In answering the research question, how accurately can a convolutional deep learning system estimate development at industrial sites from high-resolution panchromatic imagery, the work presented in this article provides three core contributions to the literature. First, we offer a comparison of the effectiveness of various techniques for remotely estimating development at industrial sites using deep learning and satellite data. Second, we discover that the resolution of current nighttime light sensors is generally insufficient to resolve development at the scale of individual industrial sites. Third, we provide evidence panchromatic imagery is relatively not well-suited for CVRS object detection tasks.

A CNN-LSTM is able to resolve structural area and radiance to approximately 0.021 square km and 10 milliwatts per square meter-steradian, respectively, at the tested industrial sites in China. The estimations from the NTL-based model approximate structural area with an R-squared of 0.01, with the raw labels alone approximating with a higher R-squared of 0.1 (Fig. 9). As seen in Fig. 10, the hand-geocoded information reveals more recent images of industrial sites have less structural area on average, a result driven by small sample size or actual deindustrialization efforts by China. MR-CNN-based CNN-LSTM estimations reflect this trend, but NTL labels, MR-CNN labels, and NTL-based CNN-LSTM estimations do not. The labels generated by the MR-CNN, and the resulting predictions by the CNN-LSTM trained on those labels, indicate widespread growth over time when each site is tracked individually (average change of 4,084 square meters), as illustrated in Fig. 11.

Code Sharing

Please reach out via the LinkedIn profile link in the author section if you would like to experiment with the project code.

    Look Ahead

    The contributions provided in this study provide meaningful future directions for related work. One such direction would be to test the methods detailed here with multispectral imagery, if available for these locations, or at locations where multispectral imagery is more readily accessible. Second, using the methods and imagery in this study, there may be significant improvements in MR-CNN (or related algorithm, e.g., U-Net) performance with a substantial addition to the number of carefully geocoded sites. A greater number of geocoded sites would also increase the sample size of ground truth observations allowing more robust trend analysis of the various prediction techniques. Finally, topographic data such as digital surface models (DSMs) may be particularly useful as an additional model input to capture change in the built environment.

    Things to Watch

    • Remote high-resolution economic growth monitoring
    • Vision transformers with satellite imagery