How to Re-grid ERA5-Land Climate Data

21 Oct 2024 in programming

Two ways to Re-grid ERA5-Land climate dataset

Edited by: Jerry Zhuoer Feng

Why Re-gridding?

ERA5-Land provides global hourly high resolution information of climate variables produced by the Copernicus Climate Change Service (C3S) at the European Centre for Medium-Range Weather Forecasts (ECMWF). It contains most meterological variables that we use including wind, temperature, precipitation, and many more. The ERA5-Land dataset covers the period from 1950 to 5 days before the current date and is updated daily. A detailed description of the dataset can be found here: Overview.

However, ERA5-Land is a gridded dataset at 0.1° x 0.1° spatial resolution, which may not be what we need. Depends on the research question and the model, we often need to re-grid the data to a different spatial resolution.

Method 1: Re-grid GeoTIFF data using Python

Special thanks to Dr. Shanti Shwarup Mahto for his contribution to this section!

This method is intended to be used after downloading ERA5 Data using Method 2 outlined in ERA5 Data Download

This code allows you to re-grid the GeoTIFF data you downloaded and output them in netCDF format, which is what we use most of the time. Here the process is done locally, so remember to download all the data you need from your Google Drive and put them in the correct folder before you start running the code below. Here we show the process of re-gridding the ERA5 Land data to 0.05° grid.

import rasterio
import numpy as np
from netCDF4 import Dataset
import os
from rasterio.warp import calculate_default_transform, reproject, Resampling
from tqdm import tqdm  # For progress bar

# Defind the interval you want in degrees
interval = 0.05

# Define input and output folders
input_folder = 'ERA5_Hourly_raw'
output_folder = 'ERA5_Hourly_' + str(interval)

# Ensure the output directory exists
if not os.path.exists(output_folder):
    os.makedirs(output_folder)

# Define the desired latitude and longitude range
lat_min, lat_max = 0.125, 29.975
lon_min, lon_max = 95.025, 109.975

# Create target longitude and latitude arrays
lon = np.arange(lon_min, lon_max + interval, interval)
lat = np.arange(lat_max, lat_min - interval, -interval)  # Reverse latitude to correct upside-down issue

# Process each .tif file in the input folder
tif_files = [f for f in os.listdir(input_folder) if f.endswith('.tif')]

for tif_file in tqdm(tif_files, desc="Converting files"):
    print(f"Processing {tif_file}...")
    input_path = os.path.join(input_folder, tif_file)
    output_path = os.path.join(output_folder, tif_file[:-4] + '.nc')
    year = tif_file[9:-8]

    # Open the .tif file using rasterio
    with rasterio.open(input_path) as src:
        # Define the transform for the desired resolution and bounds
        dst_transform, width, height = calculate_default_transform(
            src.crs, src.crs, len(lon), len(lat),
            left=lon_min, bottom=lat_min, right=lon_max, top=lat_max
        )

        # Create an empty array for the reprojected data
        param = np.empty((src.count, height, width), dtype=np.float32)

        # Reproject each band using bilinear interpolation
        for i in tqdm(range(src.count), desc="Reprojecting bands", leave=False):
            reproject(
                source=src.read(i + 1),
                destination=param[i],
                src_transform=src.transform,
                src_crs=src.crs,
                dst_transform=dst_transform,
                dst_crs=src.crs,
                resampling=Resampling.bilinear
            )

    # Define time array
    time = np.arange(param.shape[0])

    # Create the NetCDF file
    with Dataset(output_path, 'w', format='NETCDF4') as nc:
        # Create dimensions
        nc.createDimension('longitude', len(lon))
        nc.createDimension('latitude', len(lat))
        nc.createDimension('time', len(time))

        # Create variables
        longitude = nc.createVariable('longitude', 'f4', ('longitude',))
        latitude = nc.createVariable('latitude', 'f4', ('latitude',))
        times = nc.createVariable('time', 'i4', ('time',))
        param_var = nc.createVariable('2m_temperature', 'f4', ('time', 'latitude', 'longitude'), zlib=True, complevel=4)

        # Assign data to variables
        longitude[:] = lon
        latitude[:] = lat
        times[:] = time
        param_var[:, :, :] = param

        # Add attributes
        longitude.units = 'degrees_east'
        latitude.units = 'degrees_north'
        times.units = f'days since {year}-01-01'
        param_var.units = 'degree Celsius'

    print(f'NetCDF file created for {tif_file}')

Method 2: Re-grid the data using Climate Data Operators in Linux

Climate Data Operators is a collection of command line Operators to manipulate and analyze Climate data developed by Max Planck Institute for Meteorology. CDO is an incredibly powerful tool to process climate data, which carries out complex operations in a single line or two. For more information, check out Overview. If you are interested in a comprehensive guide to CDO, check out the tutorial here: User Guide. Unfortunately, CDO only works in a Linux environment, so you need to setup a Linux environment to use this, and figure out a way to transfer file between Linux and Windows.

Installing CDO in Linux is easy:

sudo apt-get install cdo

In Windows, you will have to build it from source. A instruction is given here CDO for Windows. I have not tested this, so feel free to give it a try and update the Lab Manual if it works!

In CDO, we use the remapbil function to re-grid the data, which performs a bilinear interpolation. This can be done in a single line:

cdo remapbil,targetgrid ifile ofile

The targetgrid part is slightly nuanced. Essentially, you will need a file (often .txt) that contains the number of rows, columns, cell size, and the coordinates of the lower left corner. Kindly approach me or Dr. Shanti for the grid desciption file. you can read more about it here Re-gridding with CDO

How to Download ERA5 Climate Data

21 Oct 2024 in programming

Two ways to download ERA5-Land climate dataset

Edited by: Jerry Zhuoer Feng

What is it?

ERA5-Land is a gridded dataset at 0.1° x 0.1° spatial resolution and an hourly temporal resolution. It contains most meterological variables that we use including wind, temperature, precipitation, and many more. The ERA5-Land dataset covers the period from 1950 to 5 days before the current date and is updated daily. A detailed description of the dataset can be found here: Overview.

Why do I need a script to download it?

Although there is a website to download the data, you will find out that you cannot multi-select Year or Month using the website.

Method 1: Downloading data using Climate Data Store API

Disclaimer: Currently there seems to be a rather small limit on how much data you can download at once. If you need to bulk download data e.g., 10+ years, it is recommended to take the detour lined out in Method 2.

First you will need a ECMWF account. you can register an account for free here: Registration. This will give you a personal API. Install the Climate Data Store API in your local environment just like any other Python library:

pip install cdsapi

To setup the API, follow the instructions here: API Setup.

You can now request and download data using a python script like the one below. The official website Website contains a helpful tool that helps you to generate the Python code, but you need to change some of the parameters due to the multi-select limit mentioned above.

import cdsapi
dataset = "reanalysis-era5-land"
request = {
    "variable": [
        "2m_temperature",
        "10m_u_component_of_wind",
        "10m_v_component_of_wind",
        "total_precipitation"
    ], # Check the variable name on the official website
    "year": [
        "2021", "2022", "2023"
    ], # list of years
    "month": [
        "01", "02", "03",
        "04", "05", "06",
        "07", "08", "09",
        "10", "11", "12",
    ], # list of months
    "day": [
        "01", "02", "03",
        "04", "05", "06",
        "07", "08", "09",
        "10", "11", "12",
        "13", "14", "15",
        "16", "17", "18",
        "19", "20", "21",
        "22", "23", "24",
        "25", "26", "27",
        "28", "29", "30",
        "31"
    ], # list of days
    "time": [
        "00:00", "01:00", "02:00",
        "03:00", "04:00", "05:00",
        "06:00", "07:00", "08:00",
        "09:00", "10:00", "11:00",
        "12:00", "13:00", "14:00",
        "15:00", "16:00", "17:00",
        "18:00", "19:00", "20:00",
        "21:00", "22:00", "23:00"
    ], # list of timestamps
    "data_format": "netcdf",
    "download_format": "zip",
    "area": [90, -180, -90, 180] # North, West, South, East
}

client = cdsapi.Client()
client.retrieve(dataset, request).download()

Method 2: Downloading data using Google Earth Engine

The Google Earth Engine method is slightly more complicated, but it allows us to bypass the download limit in Method 1. You can find out more about the dataset here: Catalog. Unfortunately, Google’s website only gives you the instruction in JavaScript, so let’s take a look at how to download it in python.

Step 1: Create a Earth-Engine-Enabled Google Cloud Project

Google requires a Cloud Project to use the Google Earth Engine authentication flow. Create one here: Create Google Cloud Project. Remember the name of your project, it will be needed later when calling the API.

You will then need to enable the Google Earth Engine API for the project you just created here: Enabling API for Your Project. Make sure you are signed in and double check the project name in the upper left hand corner.

Step 2: Authenticate inside Python

First install the Google Earth Engine API in your Python environment:

pip install earthengine-api

You will then need to import and authenticate the API. Run the following code:

import ee
ee.Authenticate()
ee.Initialize(project='era5-land-data')

You will be prompted to follow a URL in your Python IDE to generate a code and paste in a box in your Python IDE. After that you are done! One thing to note, the authentication expires after being idle for one week, so if you come back to your project after a while, you may need to do the authentication again.

Step 3: Download the data

Special thanks to Dr. Shanti Shwarup Mahto for his contribution to this section!

After authenticating the Google Earth Engine API, you can download the data using the code below. Remember to run the authentication bloc before this!

import os
import time
from datetime import datetime, timedelta

# Define the Area of Interest (AOI)
geometry = [95, 30, 110, 0]  # Top-left(lon, lat) and Bottom-right(lon, lat) coordinates
geometry = ee.Geometry.Rectangle(geometry)

# Change with your location in the Google Drive
dtdr = '/'
os.chdir(dtdr)
data_download_directory1 = 'ERA5_Hourly_raw'  # Folder name in your Google Drive, if the folder does not exist, it will create a new folder.
start_date = datetime(2009, 7, 1)
end_date = datetime(2018, 12, 31)

current_date = start_date

while current_date <= end_date:
    next_date = current_date + timedelta(days=1)
    
    dataset = ee.ImageCollection('ECMWF/ERA5_LAND/HOURLY') \
                .filterDate(current_date.strftime('%Y-%m-%d'), next_date.strftime('%Y-%m-%d')) \
                .filterBounds(geometry)
                
    count = dataset.size().getInfo()
    print(f"Number of images between {current_date.strftime('%Y-%m-%d')} and {next_date.strftime('%Y-%m-%d')}: {count}")
    
    # Select the variable
    def process_image(image):
        layer1 = image.select('temperature_2m').subtract(273.15)  # Correct method name
        return layer1.copyProperties(image, ['system:time_start'])
           
    dataset = dataset.map(process_image)
                 
    dataset = dataset.map(lambda image: image.clip(geometry))
    
    # Make a stack by combining all 24 hourly images into a single multi-band image
    hourly_stack = dataset.toBands()
    print(current_date)
    #================================= Daily Stack
    export_params1 = {
        'image': hourly_stack,
        'description': f"ERA5Land_{current_date.strftime('%Y%m%d')}",
        'scale': 11000,
        'fileFormat': 'GeoTIFF',
        'region': geometry,
        'crs': 'EPSG:4326',
        'folder': data_download_directory1,
        'maxPixels': 1e13,
        'formatOptions': {'cloudOptimized': True}
    }
    # Export the image as a cloud-optimized GeoTIFF to Google Drive
    task = ee.batch.Export.image.toDrive(**export_params1)
    task.start()
    print(f"Exporting daily file: {current_date.strftime('%Y-%m-%d')}")
    current_date = current_date + timedelta(days=1)

A caveat of this is that the data is downloaded in your Google Drive instead of locally, so you will need to retrieve it from your Google Drive. Another caveat is that this code only creates a series of requests, not downloading the files directly. This means that the files are not downloaded (yet!) after the code has finished running. Depend on the size of your request, your internet speed, and the Google Earth Engine server, you may need to wait a significant while (more than an hour) before all the files requested show up in your Google Drive. Finally, these codes download files in GeoTIFF format, which may or may not be what you need. To post-process this, refer to the tutorial on How to Re-grid ERA5 Climate Data.

A Practical Guide to Python Virtual Environments

30 Jul 2024 in programming

A first introduction to Python virtual environments

Edited by: Phumthep Bunnak

The problem: dependency hell

Imaging you’re juggling two Python projects simultaneously. Your first project, analyzing historical stocks data, relies on a specific version of the Pandas package (version 1.5.3, let’s say). Your second project, a cutting-edge machine learning model, demands the latest features of Pandas 2.1.0.

Here’s where things get tricky: A critical function in Pandas has changed between these two versions. In 1.5.3, a function expects a certain argument order; in 2.1.0, the order is completely different. Trying to run both projects in the same environment can lead to errors and crashes and headaches. This is a classic example of “dependency hell”.

The solution: virtual environments

A virtual environment in Python is a self-contained directory that houses a specific Python Installation (interpretor/version) and a set of packages independent of other Python environments. In other words, each virtual environment has its own

Python interpreter. This makes it easy to have multiple Python versions on your machine simultaneously.
Specific versions of Python packages. A Python package (e.g. Pandas) is not shared across environment.

Python virtual environments are isolated workspaces for your projects, preventing package and version conflicts. This isolation extends to software installed in other virtual environments and the default Python installation that might come with your operating system. Each environment is disposable and not tracked by version control systems like Git ¹. You can customize each virtual environment to match a project’s specific requirements, ensuring ease of deployment and reproducibility across machines. Having a distinct virtual environment for each project helps maintain a clean and streamlined project workflow, free of dependency issues.

Several tools can create and manage virtual environments, but conda shines for our research group’s coding projects because conda has a large community of users, making it easy to find support. Ultimately, the choice of virtual-environment manager depends on your project needs.

Why choose Anaconda over Pip (and when to use both)

If Python comes with pip, a perfectly functioning package installer, why bother with conda? The short answer is that conda offers capabilities beyond pip’s scope. While pip solely manages Python packages, conda not only manages Python packages but also Python versions themselves, along with non-Python dependencies like C/C++ libraries often required by scientific computing or data analysis tools. Using pip outside a virtual environment can lead to conflicts with other system-wide Python applications, as pip installs packages globally. In contrast, conda ensures that your project’s dependencies remain isolated and don’t interfere with other installations. Additionally, conda’s intelligent solver automatically identifies and resolves conflicts among packages, saving time by avoiding the need to manually installing and removing multiple dependencies.

It’s important to note that conda and pip are not mutaully exclusive and can be used together effectively. In fact, a conda environment comes with pip pre-installed, enabling their simultaneous use in some workflow. The recommended practice is to first install all necessary packages using conda. Anaconda boasts curated channels with a wide selection of Python packages. However, if a specific package is not available through Anaconda channels, then you can easily switch to the bundled pip within your conda environment to install the package from the PyPI ecosystem.

Creating Virtual Environments with Anaconda (3 Ways)

We will now explore three cases for creating a virtual environment using Anaconda.

Creating a custom environment from scratch:

conda create -n my_pandas_project python=3.9
conda activate my_pandas_project
conda install pandas=1.5.3

This creates an environment named my_pandas_project with Python 3.9 and then installs Pandas version 1.5.3.

Creating an environment based on a requirement file:

If your project already has a requirements.txt file listing its dependencies:

conda create -n my_pandas_project python=3.9
conda activate my_pandas_project
conda install --file requirements.txt

This ensures that you get the exact package versions specified in the file. This approach avoids installing a package one-by-one.

Bonus: Installing a forked GitHub repo as a package:

This approach is ideal when collaborating with others.

Create a folder for your project (e.g., stock_analysis). Inside that folder, run:

conda env create -f environment.yml
conda activate stock_analysis

An environment.yml file is similar to requirements.txt but often contains more detailed dependency information.

Best Practices

Keep your projects organized and prevent conflicts by having one environment per project
Use descriptive names for environments (e.g., stock_analysis_v1) to avoid confusion
Keep an updated environment.yml or requirements.txt file when developing a project

[Bonus] Using `pip` to install a local project to a `conda` environment

Instead of just installing packages from online sources, pip allows you to install Python projects directly from your local machine. This setup is especially useful when you have developed your own packages or working with code not yet published online. Here’s how to do it within a conda environment.

Activate an environment (or create one if you haven’t)
```
 conda activate <your_environment_name>
```
Navigate to your project folder. Use your terminal to move to the root directory of your local project containing the setup.py or pyproject.toml file.
Install with pip using the following command:
```
 pip install -e .
```

The -e flag (short for “editable”) installs your project in “development mode.” This means that any changes you make to your project’s code will be immediately reflected in the conda environment without needing to reinstall. With these steps, you can develop and use your local Python package!

Navigating the Linux Command Line

10 Jul 2024 in programming

A first introduction to the Linux Command Line

Edited by: Phumthep Bunnak

How to Scrap a Website to Gather Reservoir Storage Data?

16 Jun 2024 in programming

Scraping a website with Python

Edited by: Shanti Shwarup Mahto

My experience as a Section Editor for JWRPM

03 Feb 2023 in writing

A brief overview on my experience as a Section Editor for JWRPM

These months are marking the end of my experience as a Section Editor for the Journal of Water Resources Planning and Management. Here’s a list of common issues I found in the papers I handled during the past years:

CIS Lab manual

How to Re-grid ERA5-Land Climate Data

Why Re-gridding?

Method 1: Re-grid GeoTIFF data using Python

Method 2: Re-grid the data using Climate Data Operators in Linux

How to Download ERA5 Climate Data

What is it?

Why do I need a script to download it?

Method 1: Downloading data using Climate Data Store API

Method 2: Downloading data using Google Earth Engine

Step 1: Create a Earth-Engine-Enabled Google Cloud Project

Step 2: Authenticate inside Python

Step 3: Download the data

A Practical Guide to Python Virtual Environments

The problem: dependency hell

The solution: virtual environments

Why choose Anaconda over Pip (and when to use both)

Creating Virtual Environments with Anaconda (3 Ways)

Best Practices

[Bonus] Using `pip` to install a local project to a `conda` environment

Links

Navigating the Linux Command Line

How to Scrap a Website to Gather Reservoir Storage Data?

My experience as a Section Editor for JWRPM

Error

Why Re-gridding?

Method 1: Re-grid GeoTIFF data using Python

Method 2: Re-grid the data using Climate Data Operators in Linux

What is it?

Why do I need a script to download it?

Method 1: Downloading data using Climate Data Store API

Method 2: Downloading data using Google Earth Engine

Step 1: Create a Earth-Engine-Enabled Google Cloud Project

Step 2: Authenticate inside Python

Step 3: Download the data

The problem: dependency hell

The solution: virtual environments

Why choose Anaconda over Pip (and when to use both)

Creating Virtual Environments with Anaconda (3 Ways)

Best Practices

[Bonus] Using pip to install a local project to a conda environment

Links

Pagination

Templates (for web app):

Error

[Bonus] Using `pip` to install a local project to a `conda` environment