General programming

Markdown

Markdown is a lightweight markup language for creating formatted text using a plain-text editor. It is widely used for blogging, including our own Lab manual! You can use Markdown for many other things, such as creating slides and other types of presentation material (example).

To do list

  • Go through this guide that will introduce you to Markdown. If you want to practice your Markdown skills, consider writing a post or webpage for this manual!

Introduction to Python and R

In your daily work you will certainly make use of a general-purpose language, such as Matlab, Python, Julia, or R (with the caveat that R is more functional to statistical analysis and data visualization). In our lab, the two commonly used languages are Python and R. Being familiar with one (or both) of them is therefore important.

To do list

  • Install Python on your computer. The easiest way to do so is to install Anaconda, an open-source distribution of the Python and R programming languages for data science that aims to simplify package management and deployment. Anaconda, together with its interface Anaconda Navigator, allow you to easily manage programming languages and packages.

  • Read the Anaconda documentation.

  • Familiarize with Python. There are hundreds of guides available online; our suggestion is to use this Python tutorial.

  • Note that R can also be installed as a stand-alone software (i.e., independent of Anaconda). If you are planning to install it in such a way, simply visit the R Project for Statistical Computing website. We recommend installing also RStudio, an integrated development environment for R.

  • Familarize with RStudio and R.


Python coding standard

When working on a long-term collaboration project with others, adopting a consistent coding standard is key. This practice not only improves the readability and maintainability of the codebase, saving you from future headaches, but also develops industry-valued skills.

To do list

  • Familiarize yourself with the Google Python Style Guide. While the choice of a specific style guide is subjective, Google Python Style Guide offers a comprehensive set of common rules along with clear explanations and examples.
  • Streamline your coding workflow by configuring your favourite IDE to automatically format your code with tools, such as Black.

GitHub

GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. It is commonly used to host open-source software development.

Our team relies heavily on GitHub, and all our software projects are hosted on the CIS Lab GitHub page–including the group manual. The expectation is that, over time, you will become a GitHub proficient user.

To do list

  • Open an account on GitHub.

  • Ask Stefano to give you access to the CIS Lab GitHub page.

  • Take the online training provided on GitHub website. We recommend proceeding as follows:

    • Start with Introduction to GitHub
    • Take all modules on First day on GitHub
    • Take all modules on First week on GitHub (Code with Codespaces and Code with Copilot are optional)

How to organize a (GitHub) repository

A well-organized GitHub repository improves code readability, promotes collaboration, and supports reproducibility. Below are general guidelines for structuring a research-oriented repository.

your-project/
├── README.md # Project overview and usage
├── LICENSE # Open-source license (e.g., MIT, GPL)
├── .gitignore # Files/folders to exclude from version control
├── environment.yml # Conda environment (or use requirements.txt for pip)
├── src/ # Source code
│ └── main.py
├── notebooks/ # Jupyter or R Markdown notebooks
│ └── analysis.ipynb
├── data/
│ ├── raw/ # Raw input data (never modified)
│ └── processed/ # Cleaned data used in analysis
├── docs/ # Additional documentation (e.g., for GitHub Pages)
│ └── index.md
├── tests/ # Unit and integration tests
│ └── test_main.py
└── results/ # Figures, tables, model outputs, logs

Key Files and Their Roles

  • README.md: Overview, installation instructions, usage examples.
  • LICENSE: Declares how the code can be reused or modified.
  • .gitignore: Prevents tracking of temporary or large files (e.g., *.Rhistory, *.pyc, data/).
  • requirements.txt / environment.yml: Captures project dependencies. Use renv for R.

Best Practices

  • Use meaningful file and function names.
  • Comment non-trivial code and include docstrings or roxygen documentation.
  • Do not commit large datasets. Instead, store them externally or regenerate them from raw sources.
  • Include at least minimal automated tests (tests/) and instructions on running them (see below).
  • Use GitHub Issues and Pull Requests to track progress and code changes.

To do list


GitHub Actions (Optional but Powerful)

GitHub Actions is a built-in automation tool that lets you run tasks every time code is pushed, a pull request is opened, or on a custom schedule. For example, you can use Actions to:

  • Automatically run tests when someone pushes a change
  • Check that your code follows formatting rules
  • Deploy a website or update documentation
  • Re-run simulations on a schedule (e.g., daily or weekly)

A GitHub Action is configured in a YAML file (e.g., .github/workflows/test.yml) and can be set up to run code in Python, R, or Bash.

Example Use Case

A simple test.yml workflow might:

  1. Set up an R or Python environment
  2. Install your package or scripts
  3. Run unit tests from the tests/ folder

To do list

Start here:


Introduction to unit testing

Ensuring software reliability is an integral part of high-quality research, and unit tests help ensure the software is behaving as expected. Unit testing focuses on verifying that the building blocks – small modular units of functions, classes, and methods – work correctly. Many open-source projects report their test coverage, or the percentage of codebase tested. While high coverage does not guarantee the software is bug-free, it lends confidence to users that the code has been thoroughly tested. Ultimately, unit testing helps us squash bugs early in development, provide confidence when making code changes, and implicitly document the way individual code units are expected to work.

In Python, popular unit testing frameworks include unittest and pytest. These frameworks automate the process of running tests and reporting results.

To do list

  • Read a tutorial on unit testing in Python here.
  • Familiarize with implementing unit testing by looking at PowNet 2.0

Writing Software Documentation

Well-documented code accelerates collaboration, reduces onboarding time, and enhances reproducibility. In our lab, we pair in-code documentation with external documentation hosted on ReadTheDocs, using GitHub as the source of truth.

Types of Documentation

  1. Docstrings / Roxygen Comments
    These live inside the code and explain what each function or class does, what it expects as input, and what it returns.
    • Python: Use triple-quoted docstrings in either NumPy or Google format.
    • R: Use #' comments above each function (with {roxygen2}) to generate help files.
  2. README.md
    Every repository should have a clear and minimal README that includes:
    • A short project description
    • How to install dependencies
    • How to run key scripts or reproduce key figures
  3. Narrative Documentation
    For larger projects, we maintain full documentation websites that include:
    • Overview of the model or tool
    • Installation and setup instructions
    • Usage examples and tutorials
    • Auto-generated API reference from in-code docstrings

GitHub & ReadTheDocs: How It Works

We use ReadTheDocs to host our documentation and GitHub to store the code and doc sources. ReadTheDocs connects to GitHub and builds the documentation automatically every time a change is pushed to the main branch (or another branch if configured). Each project includes a docs/ folder (for Sphinx) and a configuration file (conf.py). The docs are written in reStructuredText (.rst) or Markdown, and can include embedded code blocks, figures, equations, and links to the API.

To enable this:

  • The repo must be public (or have a ReadTheDocs subscription if private).
  • You must activate the repo on ReadTheDocs.org, link it to the GitHub repo, and configure a build environment (e.g., requirements.txt or readthedocs.yml).
  • Optional: Use versioned docs (e.g., stable vs. dev branches).

Lab Example: PowNet

The PowNet repository uses this setup effectively. Its documentation is automatically built and hosted at https://pownet.readthedocs.io.

It includes:

  • A clear system overview
  • Installation instructions
  • Tutorials on how to run dispatch models
  • An API reference auto-generated from Python docstrings

Best Practices

  • Write docstrings as you code (not afterward).
  • Include usage examples, figures, or outputs where appropriate.
  • Keep narrative docs concise but complete.
  • Make sure your README links to the full docs.

Linux for research

GNU/Linux is the powerhouse (operating system) behind most of our research group’s computing clusters. Mastering the command line is required for running computational experiments. If you want a first dive into Linux, then check our blog post here. A deeper dive into this topic requires taking a short course on Bash Scripting.

To do list


Cluster basics

The Cornell University Center for Advanced Computing (CAC) provides several computing resources. As part of the EWRS concentration, we have access to Hopper, a 22 compute nodes (c0001-c0022) with dual 20-core Intel Xeon Gold 5218R CPUs @ 2.1 GHz and 192 GB of RAM. This is likely the first cluster you will use.

To do list (Getting started with Hopper)

  • To use Hopper, submit the request form to CAC. Also email Professor Vivek Srikrishnan to ask for his approval of the request.

  • While waiting for the approval, read and understand this guide to get started with Hopper.


Large-scale computing

Some of the computational experiments in our lab, such as simulation-optimization or uncertainty analysis, can take hours or days to complete if run sequentially. Large-scale computing refers to strategies that divide this work across multiple cores, CPUs, or even machines, enabling experiments to finish faster and scale to more realistic problem sizes.

One of the most widely used technologies for this is MPI (Message Passing Interface). MPI enables programs to run in parallel by distributing tasks to separate processes that communicate with each other. In our lab, we typically do not write raw MPI code. Instead, we use high-level libraries like mpi4py (Python) or configure batch jobs via SLURM on a cluster like Hopper.

When you might need this

  • Running thousands of ensemble simulations or inflow scenarios
  • Training multiple policies in parallel
  • Performing grid search over large parameter spaces

Tools and Languages

  • mpi4py: Python interface to MPI (used in some of our reservoir tools)
  • joblib or multiprocessing: For simpler parallelism on shared memory
  • SLURM: Job scheduler used on Hopper to submit parallel tasks
  • sbatch: Command-line tool to launch jobs across nodes

Example from our lab

In the PowNet repository, we use cluster computing to run large-scale power system dispatch simulations. These jobs are often parallelized across multiple scenarios using SLURM job arrays, which are configured with Bash scripts and submitted to Cornell’s Hopper cluster. While the actual SLURM scripts may not be included in the public repo, they are used internally and follow Hopper conventions. Refer to the Hopper guide for templates and examples.

To do list (Getting started with parallel computing)



© 2024. All rights reserved.

Powered by Hydejack v9.2.1