survey questions have always been a double-edged sword.
know they yield lots of useful details about “why”, and we’d like to use them
more in our research, but they’re such a pain to deal with, aren’t they?
recent explosion of generative AI technologies has offered a glimmer of hope to
those facing this pain. Indeed, many now believe, with the application of
generative AI, analyzing survey verbatim data is now a solved problem - or will
be within the next few years.
how true is this? Is it time to fire your coders and let ChatGPT do all the
work? Can generative AI automatically analyze your verbatim data instantly, and at the
push of a button?
Coding is the traditional process used to turn a
set of unstructured verbatim text responses from a survey into something that
can be analyzed and measured quantitatively.
To do this, firstly we need to extract the themes within the data. Secondly, we
need to quantify the incidence of these themes so we can measure them.
Above all, it’s important we do this with accuracy and precision. The themes
need to be precise enough to be actionable (e.g. “politeness of the
receptionist”, not simply “customer service”) and they need to be quantified
correctly (i.e. if 50 people mentioned the rude receptionist then the data
needs to show a count of 50).
accuracy, our data is unreliable and dangerous for use in decision-making.
Without precision, our analysis will lack nuance and won't be meaningful.
Large language AI models, like ChatGPT, LLaMA and Bard, are very good at summarizing text. In many ways that is their primary
purpose - to take a vast amount of text on the internet and condense it
into a representative model.
So, if you pass in a set of verbatims and ask it to summarize the main themes
found within, it’ll do a decent job. It will return a list of rich, human-like
phrases which generally encapsulate the main themes it finds. You can even ask follow-up
questions, request examples, and generally dig deeper into the data.
If your goal is simply to get a high-level read of your data and derive a
sense of the main themes then it’s a really great tool. Arguably, it’s more
effective than the commonly used approach of simply skimming through a set of
verbatims trying to “get the gist” of what they contain.
The problem with generating a list of the “main themes” is this is still more
qualitative than quantitative. If your goal is to produce a robust quantitative
analysis then things get a bit trickier.
In order to quantify your themes (i.e. exactly how many people said Theme X vs
Theme Y) you need to “code” each response and specifically tag each verbatim
with each theme that applies.
It turns out that generative AI is not so good at performing this fiddly task.
In our experiments, given a typical set of verbatims and a specified
codeframe, GPT will only be able to autocode around 10%-20% of verbatims at the
accuracy levels required by real-world market researchers.
So, if you try and use it to quantify the themes in your data, you will find
there are gaps where it has missed themes. You will see it will miscategorize
themes in some places. And, often, you will find that the results you get are
not consistently repeatable - which is clearly a problem when you’re trying to
do quantitative research and you need comparability and reliability.
don’t need to take my word for it, OpenAI has made ChatGPT available for free
to everyone, so you can easily try this same experiment for yourself - you will
find that it’s easy to get a broad summary, but hard to get an accurate
GPT came into the public consciousness in early 2023, largely because GPT3 was so much more powerful and effective than GPT2 (which barely registered on the public radar).
The numbers above are based on our experiments using GPT3, which leads to the obvious question - how do the results change if we use GPT4 instead? Is it any better at autocoding verbatims? Can we extrapolate a path into the future based on the progression from GPT3 to GP4?
In our experiments, GPT4 is able to autocode around 30% - 40% of verbatims, at an acceptable level of accuracy.
This is a very interesting finding because, a) it shows that the technology is definitely improving and is able to offer increasing support to the coding process, and b) it is still a long way from doing a perfect job automatically. If something like GPT is able to autocode 40% of your verbatims accurately that's great, but the majority of your verbatims are still uncoded. To tackle that, you need to involve people in the process.
in the loop”, “AI augmentation”, “human-led” - whatever you want to call it, the
message is the same: AI has arrived but we need to view it as an assistant,
rather than a complete human replacement.
AI can provide a useful starting point, but it’s people who have the domain
knowledge, nuanced understanding, appreciation of client objectives, etc… that
allow them to appropriately interpret, refine and curate AI output.
As an example, suppose you needed to write a press release to announce your new
range of vegan-friendly snacks. You could ask ChatGPT to write it for you, but
would you take the output blind and send it out? Of course not. You would take
the output, add your brand's tone of voice, refine it, pass it through a quality control process, finesse it - and then send it out.
The same is true during the coding process. Generative AI is very useful for
producing a set of initial themes, and maybe autocoding some of those themes,
but eventually, if you want an acceptable level of quality, you need to involve
people so they can do what they do best with your data.
Behind the scenes, codeit builds a custom machine-learning model trained on the coded data created up to that point.
This model can then be used to autocode the remaining uncoded data (or new uncoded data that is imported later).
Whilst tools like GPT may be useful for kick-starting the coding process, it is only once a real person is involved in curating the output and teaching the system that we find a custom machine-learning model can significantly outperform off-the-shelf generative AI.
To summarise, you can expect the following levels of effectiveness and accuracy from GPT vs a custom machine-learning model on a typical survey:
10% - 20%
30% - 40%
Custom ML Model
It seems then that generative AI probably isn't a magic silver bullet that automatically and fully solves your coding challenges.
Instead, the most effective approach for real-world market research projects is to blend together generative AI, human interaction, and machine-learning.
A typical user will progress through the following process when coding verbatims in codeit:
Definitely not! AI is clearly a useful tool that can supercharge your coding team BUT you still need people involved in the process.
You need software that puts this technology in the hands of coders to speed up the process of
coding but still retain the levels of accuracy and precision required by the
real-world research industry.
Clearly the world of AI is evolving very quickly and many big tech companies are investing heavily in this area. However, we should be wary of assuming that AI can now do the job of coding for us perfectly, and with no need for human intervention.
To get meaningful and actionable results you need human-led AI software that blends cutting-edge tools with human oversight for maximum efficiency and accuracy.
At codeit we are confident that our human-led AI approach is the best solution when it comes to verbatim coding. To prove it we are happy to give interested users a 30-day free trial.
Let's navigate the rise of generative AI, together.
We will not share your information with any third parties
Try it for Free
Anything we can help you with? Ask us