General programming
Markdown
Markdown is a lightweight markup language for creating formatted text using a plain-text editor. It is widely used for blogging, including our own Lab manual! You can use Markdown for many other things, such as creating slides and other types of presentation material (example).
To do list
- Go through this guide that will introduce you to Markdown. If you want to practice your Markdown skills, consider writing a post or webpage for this manual!
Introduction to Python and R
In your daily work you will certainly make use of a general-purpose language, such as Matlab, Python, Julia, or R (with the caveat that R is more functional to statistical analysis and data visualization). In our lab, the two commonly used languages are Python and R. Being familiar with one (or both) of them is therefore important.
To do list
Install Python on your computer. The easiest way to do so is to install Anaconda, an open-source distribution of the Python and R programming languages for data science that aims to simplify package management and deployment. Anaconda, together with its interface Anaconda Navigator, allow you to easily manage programming languages and packages.
Read the Anaconda documentation.
Familiarize with Python. There are hundreds of guides available online; our suggestion is to use this Python tutorial.
Note that R can also be installed as a stand-alone software (i.e., independent of Anaconda). If you are planning to install it in such a way, simply visit the R Project for Statistical Computing website. We recommend installing also RStudio, an integrated development environment for R.
Python coding standard
When working on a long-term collaboration project with others, adopting a consistent coding standard is key. This practice not only improves the readability and maintainability of the codebase, saving you from future headaches, but also develops industry-valued skills.
To do list
- Familiarize yourself with the Google Python Style Guide. While the choice of a specific style guide is subjective, Google Python Style Guide offers a comprehensive set of common rules along with clear explanations and examples.
- Streamline your coding workflow by configuring your favourite IDE to automatically format your code with tools, such as Black.
GitHub
GitHub is a developer platform that allows developers to create, store, manage and share their code. It uses Git software, providing the distributed version control of Git plus access control, bug tracking, software feature requests, task management, continuous integration, and wikis for every project. It is commonly used to host open-source software development.
Our team relies heavily on GitHub, and all our software projects are hosted on the CIS Lab GitHub page–including the group manual. The expectation is that, over time, you will become a GitHub proficient user.
To do list
Open an account on GitHub.
Ask Stefano to give you access to the CIS Lab GitHub page.
Take the online training provided on GitHub website. We recommend proceeding as follows:
- Start with Introduction to GitHub
- Take all modules on First day on GitHub
- Take all modules on First week on GitHub (Code with Codespaces and Code with Copilot are optional)
How to organize a (GitHub) repository
A well-organized GitHub repository improves code readability, promotes collaboration, and supports reproducibility. Below are general guidelines for structuring a research-oriented repository.
Recommended Directory Structure
your-project/
├── README.md # Project overview and usage
├── LICENSE # Open-source license (e.g., MIT, GPL)
├── .gitignore # Files/folders to exclude from version control
├── environment.yml # Conda environment (or use requirements.txt for pip)
├── src/ # Source code
│ └── main.py
├── notebooks/ # Jupyter or R Markdown notebooks
│ └── analysis.ipynb
├── data/
│ ├── raw/ # Raw input data (never modified)
│ └── processed/ # Cleaned data used in analysis
├── docs/ # Additional documentation (e.g., for GitHub Pages)
│ └── index.md
├── tests/ # Unit and integration tests
│ └── test_main.py
└── results/ # Figures, tables, model outputs, logs
Key Files and Their Roles
README.md
: Overview, installation instructions, usage examples.LICENSE
: Declares how the code can be reused or modified..gitignore
: Prevents tracking of temporary or large files (e.g.,*.Rhistory
,*.pyc
,data/
).requirements.txt
/environment.yml
: Captures project dependencies. Userenv
for R.
Best Practices
- Use meaningful file and function names.
- Comment non-trivial code and include docstrings or roxygen documentation.
- Do not commit large datasets. Instead, store them externally or regenerate them from raw sources.
- Include at least minimal automated tests (
tests/
) and instructions on running them (see below). - Use GitHub Issues and Pull Requests to track progress and code changes.
To do list
- read this write-up on organizing research code: How to structure a Python data science project
- Check out a good example, PowNet!
GitHub Actions (Optional but Powerful)
GitHub Actions is a built-in automation tool that lets you run tasks every time code is pushed, a pull request is opened, or on a custom schedule. For example, you can use Actions to:
- Automatically run tests when someone pushes a change
- Check that your code follows formatting rules
- Deploy a website or update documentation
- Re-run simulations on a schedule (e.g., daily or weekly)
A GitHub Action is configured in a YAML file (e.g., .github/workflows/test.yml
) and can be set up to run code in Python, R, or Bash.
Example Use Case
A simple test.yml
workflow might:
- Set up an R or Python environment
- Install your package or scripts
- Run unit tests from the
tests/
folder
To do list
Start here:
Introduction to unit testing
Ensuring software reliability is an integral part of high-quality research, and unit tests help ensure the software is behaving as expected. Unit testing focuses on verifying that the building blocks – small modular units of functions, classes, and methods – work correctly. Many open-source projects report their test coverage, or the percentage of codebase tested. While high coverage does not guarantee the software is bug-free, it lends confidence to users that the code has been thoroughly tested. Ultimately, unit testing helps us squash bugs early in development, provide confidence when making code changes, and implicitly document the way individual code units are expected to work.
In Python, popular unit testing frameworks include unittest
and pytest
. These frameworks automate the process of running tests and reporting results.
To do list
- Read a tutorial on unit testing in Python here.
- Familiarize with implementing unit testing by looking at PowNet 2.0
Writing Software Documentation
Well-documented code accelerates collaboration, reduces onboarding time, and enhances reproducibility. In our lab, we pair in-code documentation with external documentation hosted on ReadTheDocs, using GitHub as the source of truth.
Types of Documentation
- Docstrings / Roxygen Comments
These live inside the code and explain what each function or class does, what it expects as input, and what it returns.- Python: Use triple-quoted docstrings in either NumPy or Google format.
- R: Use
#'
comments above each function (with{roxygen2}
) to generate help files.
- README.md
Every repository should have a clear and minimal README that includes:- A short project description
- How to install dependencies
- How to run key scripts or reproduce key figures
- Narrative Documentation
For larger projects, we maintain full documentation websites that include:- Overview of the model or tool
- Installation and setup instructions
- Usage examples and tutorials
- Auto-generated API reference from in-code docstrings
GitHub & ReadTheDocs: How It Works
We use ReadTheDocs to host our documentation and GitHub to store the code and doc sources. ReadTheDocs connects to GitHub and builds the documentation automatically every time a change is pushed to the main
branch (or another branch if configured). Each project includes a docs/
folder (for Sphinx) and a configuration file (conf.py
). The docs are written in reStructuredText (.rst
) or Markdown, and can include embedded code blocks, figures, equations, and links to the API.
To enable this:
- The repo must be public (or have a ReadTheDocs subscription if private).
- You must activate the repo on ReadTheDocs.org, link it to the GitHub repo, and configure a build environment (e.g.,
requirements.txt
orreadthedocs.yml
). - Optional: Use versioned docs (e.g., stable vs. dev branches).
Lab Example: PowNet
The PowNet
repository uses this setup effectively. Its documentation is automatically built and hosted at https://pownet.readthedocs.io.
It includes:
- A clear system overview
- Installation instructions
- Tutorials on how to run dispatch models
- An API reference auto-generated from Python docstrings
Best Practices
- Write docstrings as you code (not afterward).
- Include usage examples, figures, or outputs where appropriate.
- Keep narrative docs concise but complete.
- Make sure your README links to the full docs.
Linux for research
GNU/Linux is the powerhouse (operating system) behind most of our research group’s computing clusters. Mastering the command line is required for running computational experiments. If you want a first dive into Linux, then check our blog post here. A deeper dive into this topic requires taking a short course on Bash Scripting.
To do list
Read our tutorial
Take a Bash Scripting Tutorial.
Cluster basics
The Cornell University Center for Advanced Computing (CAC) provides several computing resources. As part of the EWRS concentration, we have access to Hopper, a 22 compute nodes (c0001-c0022) with dual 20-core Intel Xeon Gold 5218R CPUs @ 2.1 GHz and 192 GB of RAM. This is likely the first cluster you will use.
To do list (Getting started with Hopper)
To use Hopper, submit the request form to CAC. Also email Professor Vivek Srikrishnan to ask for his approval of the request.
While waiting for the approval, read and understand this guide to get started with Hopper.
Large-scale computing
Some of the computational experiments in our lab, such as simulation-optimization or uncertainty analysis, can take hours or days to complete if run sequentially. Large-scale computing refers to strategies that divide this work across multiple cores, CPUs, or even machines, enabling experiments to finish faster and scale to more realistic problem sizes.
One of the most widely used technologies for this is MPI (Message Passing Interface). MPI enables programs to run in parallel by distributing tasks to separate processes that communicate with each other. In our lab, we typically do not write raw MPI code. Instead, we use high-level libraries like mpi4py
(Python) or configure batch jobs via SLURM on a cluster like Hopper.
When you might need this
- Running thousands of ensemble simulations or inflow scenarios
- Training multiple policies in parallel
- Performing grid search over large parameter spaces
Tools and Languages
mpi4py
: Python interface to MPI (used in some of our reservoir tools)joblib
ormultiprocessing
: For simpler parallelism on shared memorySLURM
: Job scheduler used on Hopper to submit parallel taskssbatch
: Command-line tool to launch jobs across nodes
Example from our lab
In the PowNet
repository, we use cluster computing to run large-scale power system dispatch simulations. These jobs are often parallelized across multiple scenarios using SLURM job arrays, which are configured with Bash scripts and submitted to Cornell’s Hopper cluster. While the actual SLURM scripts may not be included in the public repo, they are used internally and follow Hopper conventions. Refer to the Hopper guide for templates and examples.
To do list (Getting started with parallel computing)
- Read the SLURM section in the Hopper guide
- If using Python, read the mpi4py tutorial