Chapter 25 Leveraging ChatGPT for R Programming Assistance

25.1 Introduction

This class is a bit different to previous classes. Instead of a lecture, this will be a workshop with some discussion and plenty of hands-on use of new tools. Specifically we’ll be focussing on using ChatGPT, which is a Large Language Model (LLM) to help with R programming tasks.

25.2 Overview

  • What is a Large Language Model (LLM) and how do they work?
  • The ethics of using LLMs
  • Use cases in R coding

25.3 What is a Large Language Model (LLM) and how do they work?

LLMs are a type of machine learning model that are designed to process and generate human-like language. They can be used to process and generate language in a way that is similar to how humans use language, but on a much larger scale and with the ability to process vast amounts of data quickly. These models use a neural network architecture, which is a type of algorithm that is loosely modeled on the structure of the human brain. LLMs are trained on massive datasets of text, such as books, news articles, and social media posts, in order to learn patterns and relationships within language. This process is known as pre-training, and it allows LLMs to develop a deep understanding of the structure and nuances of language. This is incredibly computationally intensive and costs MILLIONS of dollars to create a model. For example, it is said that OpenAI spent an estimated $4.6 million to train ChatGPT-3 on 285,000 processing cores and 10,000 graphics cards.

After an LLM has been pre-trained, it is fine-tuned for specific tasks. A finished model can be used for tasks such as answering questions, summarising text or generating new text. When you ask a question to ChatGPT, it uses its knowledge of language to understand the meaning of the question, and then generates a response based on what it has learned from the training data.

25.4 Limitations

25.4.1 Limited knowledge

While LLMs like ChatGPT have impressive capabilities they also have limitations. One of the biggest concerns with LLMs is the potential for bias in their training data. In a nutshell, an LLM can be expected to reflect any bias that is present in its training dataset. Another important limitation, which is particularly important for a rapidly developing field like R and its many packages, is that the LLM will have no knowledge of developmnents that occur after the period of its training data. In this case, its usefulness to help, for example, with using packages produced after this date will be limited. In short, I would expect ChatGPT to be less useful for newer and less-used packages, because there will have been less input data for ChatGPT systems to work with.

25.4.2 Hallucination

Another important limitation is that LLMs can “hallucinate”. In the context of LLMs “hallucination” refers to a phenomenon where the model generates text that is not grounded in reality. Essentially, the model generates text that appears to be “imagined” or “made up” rather than being based on facts. For example, if you ask about something that the LLM has no knowledge of it will still often produce an answer, but the answer will be close to nonsense!

To illustrate a hallucination try the following prompt in ChatGPT:

Explain the actuary package in R. 
Summarise what it is for and give an example of basic use. 
Describe any limitations.

There is no such package in R. So this entire answer is a hallucination.

25.4.3 Numerical ability

Most LLMs are not good with numbers. This could have important consequences for helping with maths statistics, so bear it in mind!

To illustrate that, try the following prompt.

Write a sentence about dogs that contains 8 + 2 words.

You might need to try it a few times, but you should see that ChatGPT is not good at counting words. It would fail at a task such as asking it to write a paragraph with a certain word count.

Now try this:

How many Rs are there in the word Strawberry?

25.5 Ethics of using LLMs in education

The use tools like ChatGPT in student coursework raises ethical concerns. On the one hand, students can use these tools to enhance their learning and understanding of complex concepts. For example, they can generate helpful insights and help with writing assignments. On the other hand, they can encourage plagiarism or cheating which could result in NOT learning.

Some guidelines:

  • Use ChatGPT as a supplement to your own learning and understanding of the course material. For example, by asking it to explain difficult concepts.
  • Don’t rely solely on ChatGPT to complete your assignments or study for exams (remember the hallucinations!).
  • When using ChatGPT to generate solutions or code, be sure that you understand and can explain the reasoning behind the output. Don’t simply copy and paste answers without understanding.
  • Give credit where credit is due. If you use ChatGPT to generate code or solutions, acknowledge the role of ChatGPT in your work.
  • Be aware of the limitations of ChatGPT. Double-check your work and verify any information you obtain from ChatGPT.
  • Use ChatGPT to enhance your learning experience, but don’t use it to cheat. Plagiarism and academic dishonesty are taken very seriously at SDU and can negatively impact your education and career.

The Guidelines for SDU Biology Masters students are here

Despite these issues, LLMs like ChatGPT can be very useful in R programming. In the next part of the workshop we’ll cover some use cases. I provide some examples here but during the workshop we’ll work through others that are relevant to the course. You could also try using your own project code.

25.6 Use cases in R

  • Finding errors (debugging)
  • Explaining complex code
  • Interpreting output (e.g. what does this model summary mean?)
  • Improving code (finding better ways to do things)
  • Translating code from another language
  • Solving modelling problems (e.g. I have this set-up, how do i model it?)

25.6.1 Finding errors.

It can be frustrating to deal with error messages in R. Use ChatGPT to find the error and make it work.

A suitable prompt could be:

The following code doesn’t work. Please help me debug it.

mod1 <- glm(noffspring ~ weight, data = fox family = poisson)

25.6.2 Explaining code

You can ask for help understanding what code does

I don’t understand what this code does. Please explain it to me. 
I'm a total beginner in using R and in statistics in general


mod1 <- glm(noffspring ~ weight, data = fox, family = poisson)

25.6.3 Interpreting output

You can ask for help interpreting model outputs.

I have the following model output summary. Please explain it to me:
I'm a total beginner in using R and in statistics in general


Analysis of Variance Table

Response: lengthMM
              Df Sum Sq Mean Sq F value    Pr(>F)    
genotype       1 426.02  426.02  99.575 7.135e-13 ***
diet           1 111.02  111.02  25.949 7.064e-06 ***
genotype:diet  1  54.19   54.19  12.665 0.0009073 ***
Residuals     44 188.25    4.28                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

25.6.4 Translating code (e.g. from Python/Matlab to R)

You can ask it to translate code from one language to another. This can be useful if you can find code in e.g. Python or Matlab and you need to make it work with the rest of your analysis.

Translate the following Python code into R. 

import numpy as np
from scipy import stats 
from numpy.random import seed
from numpy.random import randn
from numpy.random import normal
from scipy.stats import ttest_1samp

seed=(1)
sample =normal(150,10,20)
print('Sample: ', sample)

t_stat, p_value = ttest_1samp(sample, popmean=155)
print("T-statistic value: ", t_stat)
print("P-Value: ", p_value)

25.6.5 Solving modelling problems

You can describe your data to ChatGPT, and ask it to help you with modelling choices. This is probably quite error prone, so it is good practice to ask follow up questions to make sure you understand WHY it is advising certain approaches. In a nutshell, be suspicious and treat ChatGPT as an advisor/sparring partner rather than blindly listening too it. Engage your own brain!

25.6.6 Helping with documentation/comments

Sometimes you will have written a long script without bothering to comment it. This can be troublesome to “future you” and to collaborators who will struggle to understand what your code is doing. You can ask ChatGPT to add (or improve) comments to your code. Try it on scripts you have already written during the course.

25.6.7 Finding alternative/better ways

Sometimes you will have written a script which works but is too complicated and hard to follow. You could ask ChatGPT if there is an easier/better/alternative way to do it. Try it with some code you have already written.

25.7 Some tips

Set package constraints: When using ChatGPT to help solve R problems, it often suggests using various R packages, including some that may not exist. To avoid confusion and ensure the advice aligns with your needs, it’s helpful to specify which packages you want to use.

For instance, if you’re focusing on data handling with the Tidyverse (which we use in the course), you can guide ChatGPT by adding “...Please use Tidyverse packages” to your request. If you prefer to work without additional packages, simply say “... use base R”.

Be Specific with Your Questions: Clearly describe your problem, including the desired outcome and any error messages or unexpected behaviour you encounter. The more details you provide, the more precise the help can be.

Share Code Snippets: Include relevant code snippets when asking for help. This allows for more accurate advice, whether you’re troubleshooting an issue, deugging, or seeking to improve your code.

State Your R Experience Level: Mention your familiarity with R. This helps in tailoring explanations to your level, whether you’re a beginner or an advanced user.

Iterative Approach: Use an iterative approach by testing small pieces of advice and asking follow-up questions. This is particularly useful for complex problems, but also for simpler questions when you don’t quite grasp the explanation or answer given.

Consider Reproducibility: When sharing data, use small, reproducible examples (often called a “reprex”) to illustrate your issue. This helps in providing more targeted help.