Open-ended survey questions have always been a double-edged sword.
We know they yield lots of useful details about “why”, and we’d like to use them more in our research, but they’re such a pain to deal with, aren’t they?
The recent explosion of generative AI technologies has offered a glimmer of hope to those facing this pain. Indeed, many now believe, with the application of generative AI, analyzing survey verbatim data is now a solved problem - or will be within the next few years.
So, how true is this? Is it time to fire your coders and let ChatGPT do all the work? Can generative AI automatically analyze your verbatim data instantly, and at the push of a button?
Coding is the traditional process used to turn a set of unstructured verbatim text responses from a survey into something that can be analyzed and measured quantitatively.
To do this, firstly we need to extract the themes within the data. Secondly, we need to quantify the incidence of these themes so we can measure them.
Above all, it’s important we do this with accuracy and precision. The themes need to be precise enough to be actionable (e.g. “politeness of the receptionist”, not simply “customer service”) and they need to be quantified correctly (i.e. if 50 people mentioned the rude receptionist then the data needs to show a count of 50).
Without accuracy, our data is unreliable and dangerous for use in decision-making. Without precision, our analysis will lack nuance and won't be meaningful.
Large language AI models, like ChatGPT, LLaMA and Bard, are very good at summarizing text. In many ways that is their primary purpose - to take a vast amount of text on the internet and condense it into a representative model.
So, if you pass in a set of verbatims and ask it to summarize the main themes found within, it’ll do a decent job. It will return a list of rich, human-like phrases which generally encapsulate the main themes it finds. You can even ask follow-up questions, request examples, and generally dig deeper into the data.
If your goal is simply to get a high-level read of your data and derive a sense of the main themes then it’s a really great tool. Arguably, it’s more effective than the commonly used approach of simply skimming through a set of verbatims trying to “get the gist” of what they contain.
The problem with generating a list of the “main themes” is this is still more qualitative than quantitative. If your goal is to produce a robust quantitative analysis then things get a bit trickier.
In order to quantify your themes (i.e. exactly how many people said Theme X vs Theme Y) you need to “code” each response and specifically tag each verbatim with each theme that applies.
It turns out that generative AI is not so good at performing this fiddly task.
In our experiments, given a typical set of verbatims and a specified codeframe, GPT will only be able to autocode around 10%-20% of verbatims at the accuracy levels required by real-world market researchers.
So, if you try and use it to quantify the themes in your data, you will find there are gaps where it has missed themes. You will see it will miscategorize themes in some places. And, often, you will find that the results you get are not consistently repeatable - which is clearly a problem when you’re trying to do quantitative research and you need comparability and reliability.
You don’t need to take my word for it, OpenAI has made ChatGPT available for free to everyone, so you can easily try this same experiment for yourself - you will find that it’s easy to get a broad summary, but hard to get an accurate quantitative analysis.
GPT came into the public consciousness in early 2023, largely because GPT3 was so much more powerful and effective than GPT2 (which barely registered on the public radar).
The numbers above are based on our experiments using GPT3, which leads to the obvious question - how do the results change if we use GPT4 instead? Is it any better at autocoding verbatims? Can we extrapolate a path into the future based on the progression from GPT3 to GP4?
In our experiments, GPT4 is able to autocode around 30% - 40% of verbatims, at an acceptable level of accuracy.
This is a very interesting finding because, a) it shows that the technology is definitely improving and is able to offer increasing support to the coding process, and b) it is still a long way from doing a perfect job automatically. If something like GPT is able to autocode 40% of your verbatims accurately that's great, but the majority of your verbatims are still uncoded. To tackle that, you need to involve people in the process.
“Human in the loop”, “AI augmentation”, “human-led” - whatever you want to call it, the message is the same: AI has arrived but we need to view it as an assistant, rather than a complete human replacement.
AI can provide a useful starting point, but it’s people who have the domain knowledge, nuanced understanding, appreciation of client objectives, etc… that allow them to appropriately interpret, refine and curate AI output.
As an example, suppose you needed to write a press release to announce your new range of vegan-friendly snacks. You could ask ChatGPT to write it for you, but would you take the output blind and send it out? Of course not. You would take the output, add your brand's tone of voice, refine it, pass it through a quality control process, finesse it - and then send it out.
The same is true during the coding process. Generative AI is very useful for producing a set of initial themes, and maybe autocoding some of those themes, but eventually, if you want an acceptable level of quality, you need to involve people so they can do what they do best with your data.
Behind the scenes, codeit builds a custom machine-learning model trained on the coded data created up to that point.
This model can then be used to autocode the remaining uncoded data (or new uncoded data that is imported later).
Whilst tools like GPT may be useful for kick-starting the coding process, it is only once a real person is involved in curating the output and teaching the system that we find a custom machine-learning model can significantly outperform off-the-shelf generative AI.
To summarise, you can expect the following levels of effectiveness and accuracy from GPT vs a custom machine-learning model on a typical survey:
Autocoding
Accuracy
ChatGPT
10% - 20%
80%
GPT4
30% - 40%
90%
Custom ML Model
60%
It seems then that generative AI probably isn't a magic silver bullet that automatically and fully solves your coding challenges.
Instead, the most effective approach for real-world market research projects is to blend together generative AI, human interaction, and machine-learning.
A typical user will progress through the following process when coding verbatims in codeit:
Definitely not! AI is clearly a useful tool that can supercharge your coding team BUT you still need people involved in the process.
You need software that puts this technology in the hands of coders to speed up the process of coding but still retain the levels of accuracy and precision required by the real-world research industry.
Clearly the world of AI is evolving very quickly and many big tech companies are investing heavily in this area. However, we should be wary of assuming that AI can now do the job of coding for us perfectly, and with no need for human intervention.
To get meaningful and actionable results you need human-led AI software that blends cutting-edge tools with human oversight for maximum efficiency and accuracy.
At codeit we are confident that our human-led AI approach is the best solution when it comes to verbatim coding. To prove it we are happy to give interested users a 30-day free trial.
Let's navigate the rise of generative AI, together.
We will not share your information with any third parties
Try it for Free
Anything we can help you with? Ask us