Introduction to Pseudocode#

What you will learn

  • Be able to approach a coding task with a modular, systematic approach.

  • Be able to write pseudocode.

Writing pseudocode is a powerful tool that you can use to plan, organize, and structure your code without worrying about the syntax of a specific programming language or the specific steps. Writing pseudocode helps you focus on clearly expressing the logical steps that your code needs to perform to process your data.

Writing pseudocode before diving into actual coding will help you write better code from the start and hopefully reduce the number of times your code gets refactored or rewritten.

Benefits of Pseudocode#

  1. Clarifies logic: Helps you outline your code’s structure without getting bogged down by specific syntax.

  2. Easier collaboration: Allows you to communicate your plan for designing a workflow with people in different roles (if you are working collaboratively), in different scientific domains, or across different programming languages.

  3. Quick debugging: You can often quickly identify problem areas or logical errors when writing pseudocode and consider how to address them before writing actual code. This will save you time!

Using pseudocode with LLMs#

LLMs (like ChatGPT) can assist in converting pseudocode into actual Python code. By writing clear pseudocode, you can prompt the LLM to generate Python code that follows your logic, helping you focus on problem-solving and testing the generated code rather than obscure syntax.

Example: processing Crossref data from JOSS papers#

You have inherited the code below; it’s quite messy and hard to read

from datetime import datetime

# Example list of papers with nested title, citations, and weird date format
ps = [
    {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]},
    {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]},
    {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]},
]

# Process each paper manually

# P1
p = ps[0]
if p["citations"]:
    pd1 = datetime.strptime(p["pub_date"], "%Y/%m/%d")
    cit1 = p["citations"][0]  # Access first element directly
    print({"tit": p["title"][0], "pd": pd1})

# P2
x = ps[1]
if x["citations"]:
    pd2 = datetime.strptime(x["pub_date"], "%Y/%m/%d")
    cit2 = x["citations"][0]  # Access first element directly
    print({"t": x["title"][0], "pd": pd2})

# P3
z = ps[2]
if z["citations"]:
    p3 = datetime.strptime(z["pub_date"], "%Y/%m/%d")
    cit3 = z["citations"][0]  # Access first element directly
    print({"ttl": z["title"][0], "pdate": p3})

# Manually calculate the mean number of citations
total_citations = cit1 + cit2 + cit3
mean_citations = total_citations / 3

print(f"Mean number of citations: {mean_citations}")
{'tit': 'P1', 'pd': datetime.datetime(2023, 5, 10, 0, 0)}
{'t': 'P2', 'pd': datetime.datetime(2022, 4, 12, 0, 0)}
{'ttl': 'P3', 'pdate': datetime.datetime(2021, 3, 15, 0, 0)}
Mean number of citations: 5.333333333333333

Before tackling the code above, break things down a bit using English rather than Python! The goal of the code above is to process citation data. In this case, you have a list of dictionaries to process.

Step 1: Write pseudocode#

In the code above, each dictionary is processed individually.

Ask yourself:

  1. What steps are repeated in the code above?

  2. Is there a better way to process the data more efficiently?

Your pseudocode might look something like this:

Open a list of Python dictionary objects.
Process the data in each list:
    * Extract the date
    * Extract the number of citations
    * Store the data in some format that makes it easier to process(don't print it)

Step 2: generate pseudocode that begins to consider Python syntax#

Next, you can choose whether you want to write cleaner Python code yourself or generate your code using an LLM. Below, you begin to flesh out the pseudocode, considering the types of Python data structures that might be most useful for storing the data.

In this case, Pandas is a great option because it has a built-in mean method and handles tabular data well.

Open a list of Python dictionary objects.
Create a Python loop to process the data in each list:
* Extract the date
* Extract the number of citations
* Add the cleaned data to a list
Turn the list into a Pandas DataFrame
Calculate mean citations

Step 3: refine your pseudocode further#

Now that you have pseudocode, you can begin to fill in the code gaps! Keep your pseudocode steps. Add the code required to perform each step.

from datetime import datetime

# Example list of papers with nested title, citations, and weird date format
pubs = [
    {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]},
    {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]},
    {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]},
]

# It's straightforward to convert a list to a DataFrame.
# Create / initialize an empty list

all_pubs = []
# Create a Python loop to process each publication the list
for pub in pubs:
    # * Extract the date and add print statements for checks
    pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d")
    print(pub_date)
    # * Extract the number of citations
    citation_count = pub["citations"][0]
    print(citation_count)
    # * Add the cleaned data to a list
    all_pubs.append({"pub_date": pub_date, "citation_count": citation_count})
    print(all_pubs)
# Turn the list into a Pandas DataFrame
# Calculate mean publications using pandas.mean
2023-05-10 00:00:00
5
[{'pub_date': datetime.datetime(2023, 5, 10, 0, 0), 'citation_count': 5}]
2022-04-12 00:00:00
3
[{'pub_date': datetime.datetime(2023, 5, 10, 0, 0), 'citation_count': 5}, {'pub_date': datetime.datetime(2022, 4, 12, 0, 0), 'citation_count': 3}]
2021-03-15 00:00:00
8
[{'pub_date': datetime.datetime(2023, 5, 10, 0, 0), 'citation_count': 5}, {'pub_date': datetime.datetime(2022, 4, 12, 0, 0), 'citation_count': 3}, {'pub_date': datetime.datetime(2021, 3, 15, 0, 0), 'citation_count': 8}]

Based on the above, you can begin to clean up your workflow even further. You no longer need print statements.

Tip

It can also be helpful to use a code formatter as you go to keep your code consistent. If you are working in Jupyter Lab, Jupyter Lab code formatter is a great option..

from datetime import datetime

import pandas as pd

# Example list of papers with nested title, citations, and weird date format
pubs = [
    {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]},
    {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]},
    {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]},
]

all_pubs = []
for pub in pubs:
    pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d")
    citation_count = pub["citations"][0]

    all_pubs.append({"pub_date": pub_date, "citation_count": citation_count})

# Turn the list into a Pandas DataFrame
all_pubs_df = pd.DataFrame(all_pubs)
all_pubs_df.head()
pub_date citation_count
0 2023-05-10 5
1 2022-04-12 3
2 2021-03-15 8
# Calculate mean publications using pandas.mean
mean_citations = all_pubs_df["citation_count"].mean()
mean_citations
np.float64(5.333333333333333)

Using the above pseudocode, ChatGPT provided the following code.

from datetime import datetime
import pandas as pd

def extract_data(papers):
    """
    Extract publication date and number of citations from a list of papers.

    Parameters
    ----------
    papers : list of dict
        List of paper dictionaries containing 'title', 'pub_date', and 'citations'.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the extracted information.
    """
    processed_data = []

    for paper in papers:
        if paper["citations"]:
            pub_date = datetime.strptime(paper["pub_date"], "%Y/%m/%d")
            citations = paper["citations"][0] 
            processed_data.append({
                "title": paper["title"][0],
                "publication_date": pub_date,
                "citations": citations
            })

    return pd.DataFrame(processed_data)

# Example list of papers with nested title, citations, and date format
papers = [
    {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]},
    {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]},
    {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]},
]

# Extract the data into a DataFrame
processed_papers = extract_data(papers)

# Example of further processing or analysis (without printing)
# e.g., Calculate the mean number of citations
mean_citations = processed_papers["citations"].mean()

# Store processed data and mean citations for further analysis
processed_data_store = {
    "processed_papers": processed_papers,
    "mean_citations": mean_citations
}

You now have a cleaner, working script!#

By following the above steps, you now have a clean working workflow that you can adapt and refactor even further. Let’s pretend that you know you will need to add two things to your workflow:

  • You know that your data is going to be in a JSON file, rather than provided to you as a Python dictionary.

  • You also know that you will need to process multiple files.

Have a look at the code below, consider using pseudocode to identify areas that you could clean up even further to support a multi-file workflow.

from datetime import datetime

# Open the data in .JSON format rather than via an example 
pubs = [
    {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]},
    {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]},
    {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]},
]
# if there are multiple files, I will need a processing step for each file
all_pubs = []
for pub in pubs:
    # Could this step be a function or multiple functions that cleans the data?
    pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d")
    citation_count = pub["citations"][0]
    all_pubs.append({"pub_date": pub_date, "citation_count": citation_count})

all_pubs_df = pd.DataFrame(all_pubs)
mean_citations = all_pubs_df["citation_count"].mean()