--- jupytext: text_representation: extension: .md format_name: myst format_version: 0.13 jupytext_version: 1.16.4 kernelspec: display_name: Python 3 (ipykernel) language: python name: python3 --- ```{raw-cell} --- editable: true hideCode: false hidePrompt: false raw_mimetype: '' slideshow: slide_type: '' --- --- title: "Learn to Write Pseudocode for Python Programming" excerpt: "Pseudcode can help you design data workflows by listing the individual workflow steps in plain language, so the focus is on the overall data process rather than on the specific code needed. Learn best practices for writing pseudocode when building data-processing workflows." last_modified: --- ``` +++ {"editable": true, "slideshow": {"slide_type": ""}} (intro-write-pseudocode)= # Introduction to Pseudocode :::{admonition} What you will learn :class: tip * Be able to approach a coding task with a modular, systematic approach. * Be able to write pseudocode. ::: Writing pseudocode is a powerful tool that you can use to plan, organize, and structure your code without worrying about the syntax of a specific programming language or the specific steps. Writing pseudocode helps you focus on clearly expressing the logical steps that your code needs to perform to process your data. Writing pseudocode before diving into actual coding will help you write better code from the start and hopefully reduce the number of times your code gets refactored or rewritten. ## Benefits of Pseudocode 1. **Clarifies logic**: Helps you outline your code’s structure without getting bogged down by specific syntax. 2. **Easier collaboration**: Allows you to communicate your plan for designing a workflow with people in different roles (if you are working collaboratively), in different scientific domains, or across different programming languages. 3. **Quick debugging**: You can often quickly identify problem areas or logical errors when writing pseudocode and consider how to address them before writing actual code. This will save you time! ## Using pseudocode with LLMs LLMs (like ChatGPT) can assist in converting pseudocode into actual Python code. By writing clear pseudocode, you can prompt the LLM to generate Python code that follows your logic, helping you focus on problem-solving and testing the generated code rather than obscure syntax. ## Example: processing Crossref data from JOSS papers You have inherited the code below; it's quite messy and hard to read ```{code-cell} ipython3 --- editable: true slideshow: slide_type: '' --- from datetime import datetime # Example list of papers with nested title, citations, and weird date format ps = [ {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]}, {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]}, {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]}, ] # Process each paper manually # P1 p = ps[0] if p["citations"]: pd1 = datetime.strptime(p["pub_date"], "%Y/%m/%d") cit1 = p["citations"][0] # Access first element directly print({"tit": p["title"][0], "pd": pd1}) # P2 x = ps[1] if x["citations"]: pd2 = datetime.strptime(x["pub_date"], "%Y/%m/%d") cit2 = x["citations"][0] # Access first element directly print({"t": x["title"][0], "pd": pd2}) # P3 z = ps[2] if z["citations"]: p3 = datetime.strptime(z["pub_date"], "%Y/%m/%d") cit3 = z["citations"][0] # Access first element directly print({"ttl": z["title"][0], "pdate": p3}) # Manually calculate the mean number of citations total_citations = cit1 + cit2 + cit3 mean_citations = total_citations / 3 print(f"Mean number of citations: {mean_citations}") ``` Before tackling the code above, break things down a bit using English rather than Python! The goal of the code above is to process citation data. In this case, you have a list of dictionaries to process. ### Step 1: Write pseudocode In the code above, each dictionary is processed individually. Ask yourself: 1. What steps are repeated in the code above? 1. Is there a better way to process the data more efficiently? Your pseudocode might look something like this: ```md Open a list of Python dictionary objects. Process the data in each list: * Extract the date * Extract the number of citations * Store the data in some format that makes it easier to process(don't print it) ``` ### Step 2: generate pseudocode that begins to consider Python syntax Next, you can choose whether you want to write cleaner Python code yourself or generate your code using an LLM. Below, you begin to flesh out the pseudocode, considering the types of Python data structures that might be most useful for storing the data. In this case, Pandas is a great option because it has a built-in mean method and handles tabular data well. ```md Open a list of Python dictionary objects. Create a Python loop to process the data in each list: * Extract the date * Extract the number of citations * Add the cleaned data to a list Turn the list into a Pandas DataFrame Calculate mean citations ``` ### Step 3: refine your pseudocode further Now that you have pseudocode, you can begin to fill in the code gaps! Keep your pseudocode steps. Add the code required to perform each step. ```{code-cell} ipython3 --- editable: true slideshow: slide_type: '' --- from datetime import datetime # Example list of papers with nested title, citations, and weird date format pubs = [ {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]}, {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]}, {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]}, ] # It's straightforward to convert a list to a DataFrame. # Create / initialize an empty list all_pubs = [] # Create a Python loop to process each publication the list for pub in pubs: # * Extract the date and add print statements for checks pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d") print(pub_date) # * Extract the number of citations citation_count = pub["citations"][0] print(citation_count) # * Add the cleaned data to a list all_pubs.append({"pub_date": pub_date, "citation_count": citation_count}) print(all_pubs) # Turn the list into a Pandas DataFrame # Calculate mean publications using pandas.mean ``` +++ {"editable": true, "slideshow": {"slide_type": ""}} Based on the above, you can begin to clean up your workflow even further. You no longer need print statements. :::{tip} It can also be helpful to use a code formatter as you go to keep your code consistent. If you are working in Jupyter Lab, [Jupyter Lab code formatter is a great option.](https://jupyterlab-code-formatter.readthedocs.io/). ::: ```{code-cell} ipython3 --- editable: true slideshow: slide_type: '' --- from datetime import datetime import pandas as pd # Example list of papers with nested title, citations, and weird date format pubs = [ {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]}, {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]}, {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]}, ] all_pubs = [] for pub in pubs: pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d") citation_count = pub["citations"][0] all_pubs.append({"pub_date": pub_date, "citation_count": citation_count}) # Turn the list into a Pandas DataFrame all_pubs_df = pd.DataFrame(all_pubs) all_pubs_df.head() ``` ```{code-cell} ipython3 --- editable: true slideshow: slide_type: '' --- # Calculate mean publications using pandas.mean mean_citations = all_pubs_df["citation_count"].mean() mean_citations ``` +++ {"editable": true, "slideshow": {"slide_type": ""}} :::{dropdown} Using the above pseudocode, ChatGPT provided the following code. ```python from datetime import datetime import pandas as pd def extract_data(papers): """ Extract publication date and number of citations from a list of papers. Parameters ---------- papers : list of dict List of paper dictionaries containing 'title', 'pub_date', and 'citations'. Returns ------- pd.DataFrame A DataFrame containing the extracted information. """ processed_data = [] for paper in papers: if paper["citations"]: pub_date = datetime.strptime(paper["pub_date"], "%Y/%m/%d") citations = paper["citations"][0] processed_data.append({ "title": paper["title"][0], "publication_date": pub_date, "citations": citations }) return pd.DataFrame(processed_data) # Example list of papers with nested title, citations, and date format papers = [ {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]}, {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]}, {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]}, ] # Extract the data into a DataFrame processed_papers = extract_data(papers) # Example of further processing or analysis (without printing) # e.g., Calculate the mean number of citations mean_citations = processed_papers["citations"].mean() # Store processed data and mean citations for further analysis processed_data_store = { "processed_papers": processed_papers, "mean_citations": mean_citations } ``` ::: +++ ### You now have a cleaner, working script! By following the above steps, you now have a clean working workflow that you can adapt and refactor even further. Let's pretend that you know you will need to add two things to your workflow: * You know that your data is going to be in a JSON file, rather than provided to you as a Python dictionary. * You also know that you will need to process multiple files. Have a look at the code below, consider using pseudocode to identify areas that you could clean up even further to support a multi-file workflow. ```{code-cell} ipython3 from datetime import datetime # Open the data in .JSON format rather than via an example pubs = [ {"title": ["P1"], "pub_date": "2023/05/10", "citations": [5]}, {"title": ["P2"], "pub_date": "2022/04/12", "citations": [3]}, {"title": ["P3"], "pub_date": "2021/03/15", "citations": [8]}, ] # if there are multiple files, I will need a processing step for each file all_pubs = [] for pub in pubs: # Could this step be a function or multiple functions that cleans the data? pub_date = datetime.strptime(pub["pub_date"], "%Y/%m/%d") citation_count = pub["citations"][0] all_pubs.append({"pub_date": pub_date, "citation_count": citation_count}) all_pubs_df = pd.DataFrame(all_pubs) mean_citations = all_pubs_df["citation_count"].mean() ``` +++ {"editable": true, "slideshow": {"slide_type": ""}} :::{todo} ## Add Multiple data files to your workflow Above you begin to think about the steps associated with creating a workflow for a single list of dictionaries. Using pseudocode helps you think through your logic clearly, while LLMs can assist by generating Python code based on your structure. This process is especially helpful when working on tasks like processing JOSS CrossRef data, where filtering, extracting, and calculating values are essential steps. ::: ```{code-cell} ipython3 ```