AI and qualitative analysis, Privacy

GDPR-compliant, hybrid interpretation

15. May 2025 21 minutes reading time

A workshop report on local interpretation with four LLMs on a MacBook Pro with Gemma 3, Qwen 3, Mistral 3.1 and Llama 3.3

by Dr. Thorsten Dresing, May 15, 2025

1 AI & qualitative research - goodbye data protection?

As a qualitative researcher and co-author of the paper “Hybrid interpretation of text-based data with dialogically integrated LLMs” (Krähnke, Pehl & Dresing, 2025), I see the potential of Large Language Models (LLMs) as additional “sparring partners” in the interpretation process. Interesting passages in the text can thus no longer be explored alone or in the classic interpretation group, but in dialogic exchange with several LLMs. The well-known cloud services such as Gemini, ChatGPT, Claude and others open up a promising heuristic extension and are easily accessible and often free of charge.

However, this enthusiasm has been clouded by a central hurdle since the very beginning: data protection. For all major providers, data must be transferred to servers outside the EU. Even for services within the EU, cloud infrastructures from American companies are often used, which, due to legislation such as the US Cloud Act, put at least one big question mark behind GDPR compliance. Sending interview excerpts or other sensitive research data to external servers outside the scope of the GDPR or unclear legal situation without individualized commissioned data processing is not only an ethical no-go for researchers, but in the worst case would be a criminal offence. This is not exactly a trivial dilemma: the vast space of possibilities and scientific curiosity, and for some questions also the time advantage, must not be entered without further ado, as long as the data is subject to data protection.

I did not want to accept this discrepancy. My key question was: Can a hybrid interpretation be realized completely locally – while maintaining data protection? This gave rise to a fundamental challenge: How can not just one, but three or four different, powerful LLMs be installed and operated sensibly on a standard (albeit very good) computer? Which LLMs are available free of charge and where can you get such models and to what extent are they particularly suitable for hybrid interpretation? And what does the necessary hardware cost? And above all: is it worth all the effort? As of May 2025, is it possible to use various LLMs for hybrid interpretation in compliance with data protection regulations without having to sacrifice the differentiation and depth of responses compared to the use of large online LLMs such as chatGPT? What initially appeared to be a clear “Forget it, it’ll never work” gradually turned into an “Oh, it could work like that” until the final email to my colleagues on May 9, 2025: “I did it :)”.

The challenge was to find affordable hardware with the highest possible performance and at the same time to evaluate which are the best possible, most competent, but at the same time free open source LLMs that can be used sensibly on the selected computer configuration. And if both are found, then to check whether the results of hybrid interpretation are useful and comparable with the large commercial models.

In the following, I will present my technical setup and explain the criteria I used to test and select which LLMs, how I then use them and finally share observations from a total of fourteen interpretation sequences as examples – for example, on specific characteristics of the different LLMs, how they complement each other and that the order in which they are used plays a role, but also where the limits of the chosen setup lie for extensive tasks.

2. technical equipment - LLMs locally on the MacBook Pro: does that work (well) at all?

Modern LLMs are computationally intensive. The large, well-known models such as ChatGPT, Claude or Gemini never run on standard computers, but only on specialized NVIDIA hardware, where an H100 GPU card alone can cost around 30,000 euros – and often several of these are required. The size of the models is usually specified in the form of a parameter number. These are information points in a gigantic network that must potentially be taken into account when making inquiries. The more parameters, the more computing power (ideally from GPUs) and fast RAM the LLM requires. According to estimates, the flagship models have over 1,000 billion parameters. This is not possible because it cannot run on a notebook and is not open source.

Hardware – Macbook Pro M3 Max 48GB RAM

Select the egg (LLM) or the hen (computer) first? I started with the hen, i.e. to see what hardware is probably needed as a minimum. At the same time, I tried not to stretch my budget too much. The decisive factor in my search was the most important size of all, and there is no way around it: as much RAM (not hard disk space!) as possible. The larger this is, the larger the LLM used can be, i.e. the more parameters it can have. Opensource LLMs are already available with “only” 1 billion parameters, right up to the largest ones such as Llama 3 with 405 billion. And this has the most significant influence on the quality of the LLM output. Roughly speaking, the parameters of the model can be related to the amount of memory. A mid-range LLM with a 30 billion parameter model will require about 30GB of free memory. Many notebooks today have 8 GB or 16 GB; but even 32 GB would be too little for a 30B model, because the operating system and other features also require resources. So there must be more. The largest RAM currently available in notebooks is 64 GB and very rarely 128 GB. However, these are very expensive, often 5,000 euros and more. My solution idea was somewhere in between: A “discontinued model” offered me the best price/performance ratio.My choice fell on a MacBook Pro with M3 Max processor (16-core CPU, 40-core GPU) and 48 GB RAM, purchased for 2,715 euros including VAT. VAT as a return at Notebooksbilliger.de. The performance and the amount of memory are exactly right for the setting described below and through the tests I have also found out that more RAM is not absolutely necessary (but possible) for this setup, but less RAM prevents meaningful hybrid interpretation, because the smaller models often produce less useful and sometimes quite a lot of nonsense. However, it also became apparent that even this 48 GB RAM reaches its limits when it comes to processing very large input prompts, which, for example, should include several complete interview transcripts for a comprehensive analysis (which is not the focus of this workshop report). More or fewer GPU and CPU cores make computing processes faster or slower, more RAM makes them possible at all. So don’t buy anything less than 48GB for a setup similar to the one described here. For more ambitious projects, which regularly require very large context windows (e.g. because 10 interviews are to be combined), an investment in systems with significantly more RAM (e.g. Mac Studio M3 Ultra with 24CPU 60GPU and 96 GB or more) is necessary.

Software – LM Studio

Ok, now I have the computer, but how do I get an LLM on it? The easiest way is to use software that displays extensive selection lists of available LLMs (primarily in GGUF format), which can then be loaded and installed. It was also important to me to have a user interface that was very similar to the ChatGPT web interface, as I am now familiar with it. I found all this in the free LM Studio software (for Windows and Mac). This offers an intuitive user interface and makes it much easier to search, download and use various LLMs. It is important that LM Studio can distribute the computing load flexibly between the CPU and GPU and thus make optimum use of the available resources. In the tests, I have also found that a number of settings in LM Studio ensure that everything runs a little faster or runs at all:

Maximize GPU offload: The model layers have been outsourced as far as possible to the powerful 40-core GPU.
Assign CPU threads efficiently: Again, I mostly set this to maximum, although I then reduced it a little when malfunctions occurred, which often solved the problem.
Context window (n_ctx): With approx. 7,500 – 15,000 tokens, deeper conversations could also be held. This is sufficient for a few rounds of interpretation of short text excerpts as source material, but not for summaries of many interviews or the simultaneous analysis of several longer documents in a single prompt, which was not the primary goal of the testing here. Later tests showed that, for example, a prompt with 13,500 tokens (which also contained five short interviews) already overloaded the Llama 3.3 70B q3_K_XL model in the local setup and processing was no longer possible.
High “Prompt Batch Size” (n_batch): An increase to 4096 (instead of standard 512) tokens to make good use of long start prompts and reduce the initial processing time.
Activate “Flash Attention”: If available for a specific model, for acceleration.
Basics: Apple’s “Metal” GPU acceleration was active; “mmlock” was activated for stability under high load.
Dealing with unwanted text artifacts in LLM responses: Occasionally the LLMs get caught in an endless loop and produce the entire output again. This is made visible by small markers in the text, such as im_start tags or the word “Assistant” and subsequent, unintended text. To rectify this behavior, go to the settings of the loaded model (often found in the right-hand side area under “Stop Strings” or similar) and adjust the following:
- Define stop strings: Explicitly add the unwanted artifacts (e.g. im_start, <|im_start|>, “Assistant”, “Assistant:”) to the list of “Stop Strings”. This cancels text generation as soon as the model attempts to output one of these elements, resulting in cleaner and more accurate responses.

LLMs – Qwen3, Gemma3, Mistral 3.1 and Llama 3.3

The method of hybrid interpretation (Krähnke, Pehl & Dresing, 2025) thrives on multi-perspectivity. I therefore didn’t want one perfect model, but a team of differently trained LLMs. The 48 GB RAM of my computer meant an upper limit of around 34 GB per selected model file. All LLMs in LM Studio are free of charge – crazy! With the models listed, you not only have the choice between different numbers of parameters, but also different compression variants (quantizations). Completely confusing at first. A model like the Llama 3.3 70B is available, for example. as q8 with 70GB (much too big) or in q3_K_XL with 34GB (just about fits). How do you find the right beads from the thousands available?

My criteria were the best possible quality (models with 25+ billion parameters) with a response speed that was still acceptable to me. This is because the larger the model, the slower the display of the text output. I have set myself a target window of about 10 tokens per second, which is about 3-7 words per second. Yes, I know, this is much slower than ChatGPT in the web interface, which can achieve speeds of 250 tokens/s and more on NVIDIA H100 or comparable hardware. If you want that: buy an H100 and you’ll get your high speed 😉 For the hybrid research process, where I want to read and understand and think about all the answers, the local speed of 10t/s is perfectly acceptable.

After extensive testing and many failures, where I ran our standard prompt (see link) with a text passage, the following quartet emerged as the “sweet spot”. Please pay attention to the quantization information below (such as q3_K or q6). So load the appropriate one. Larger models would no longer make sense in the selected setting (Llama 3.3 70B in q4 ran at 1.3 tokens/s) or would not run at all, smaller models give away response quality with only a small gain in speed in response generation. This consideration becomes particularly relevant if the available context window is almost exhausted by very extensive entries (e.g. several interviews). In this case, smaller models with a correspondingly lower number of parameters or models with specific optimization for large context windows would inevitably have to be used, provided that these are executable on the hardware:

Llama 3.3 70B Instruct (q3_K_XL, ~34 GB, ~7.5 tokens/s): (USA) Metas model, although highly compressed, but with the largest knowledge base in the test field.
Qwen 3 32B Instruct (q6, ~24 GB, ~12 tokens/s): (China) Alibaba’s model, surprisingly capable, brand new from April 2025.
Gemma 3 27B Instruct-Turbo (q8_0, ~27 GB, ~12 tokens/s): (USA) Google’s open source model. Also brand new from April 2025.
Mistral Small 3.1 24B Instruct 2503 (q8_0, ~24 GB, ~15 Token/s): (Europe) A well-known classic in a new version.

All four ran stably and already showed the potential for different “thinking styles” in preliminary tests, which is essential for hybrid interpretation. Incidentally, it wasn’t just models like Deepseek that fell out in my tests due to a lack of response quality, I was really surprised, as this model in particular is otherwise very hyped.

3. the experiment

In order to test the interaction of the models, I carried out fourteen interpretation runs with a short standard interview excerpt. First, I used the four LLMs (Gemma 3, Qwen 3, Mistral 3.1, Llama 3.3) in an intuitively chosen order 1, starting with Gemma 3, which was instructed by a specific system prompt (see Appendix), followed by the other models, each of which was prompted with a standardized moderation prompt for a differentiated discussion of the previous results. After nine runs showed that Qwen 3 had a strong tendency towards early theorizing, I tested a modified order 2 in five further runs, in which Qwen 3 was only used at the end in order to promote a more material-oriented, bottom-up development of knowledge.

4. results and initial ideas on profile types of the LLMs

The analysis of the responses showed surprisingly consistent characteristic tendencies of the individual models and interesting differences in the dynamics of result generation through the various deployment sequences. I used the data to develop initial ideas for “profile types” of the LLMs. Here are my suggestions based on the processes:

Gemma 3 (The reliable initiator): As a starting LLM, always laid a solid foundation close to the text with multiple first interpretations on linguistic “hotspots” (dating, “actually”, “attempt”). Often focused on ambivalences in P1’s statements, e.g. the juxtaposition of P1 as a “pragmatic coordinator” versus the possibility that the wording could “signal uncertainty”.
Mistral Small 3.1 (The subtle differentiator): Tackles existing interpretations in a nuanced way, often with a view to psychological dimensions (self-perception, insecurity, pride) or strategic communication. Searched for “nuances”. In order 2, directly after Gemma 3, it shaped the discourse early on with this perspective. For example, it interpreted the pragmatic presentation of P1 as a possible “strategy for self-stylization” or the simplicity as an “expression of excessive demands”.
Llama 3.3 (The pragmatic consolidator): Tended to ground discussions pragmatically, question complexity and test interpretations for robustness. Often emphasized professionalism and responsibility. For example, despite counter-arguments, it maintained its interpretation of P1’s “professionalism and commitment” and further differentiated it by referring to the importance of cooperation.
Qwen 3 (The theoretical-conceptual architect): Brought in the most consistently explicit theoretical references (Goffman, Foucault, discourse theory, etc.) and specialized terminology. It occasionally elevated the discourse to an abstract level early on with concepts such as “institutionalized self-description”. At the end of sequence 2, it often functioned as a theoretical synthesizer, linking P1’s self-reflection to the “process of self-knowledge”, for example.

Observations on dynamics:

Qwen 3 early in the process (order 1): Often led to rapid conceptualization and “academic” discourse. The challenge was to maintain the closeness to the text with a high degree of theorization through Qwen 3. Mistral Small 3.1 and Llama 3.3 then focused again on detailed work in order to relate the abstract concepts back to the material or to differentiate them.
Qwen 3 late in the process (order 2): Allowed a longer, detail-oriented exploration of psychological and strategic aspects by Mistral Small 3.1 and Llama 3.3. The development had a stronger “bottom-up” effect. Qwen 3’s theoretical input at the end often served to bundle and classify an already rich case analysis, which sometimes had the effect of external supervision.
Conclusion: Qwen 3’s position was formative: used early, it is a “theory engine”; used late, it is more of a “theory roof”.

Comparative conclusions:

Type of knowledge development varies: Order 1 tends towards quick abstraction, order 2 towards deeper detailed exploration before theorizing. It would also be conceivable to proceed entirely without Qwen 3 if no explicit theorization is desired – on the other hand, theoretical references are an enrichment, and as a human being I can (and must) only deepen what I consider appropriate in the further course anyway. No sequence is “better” per se; the choice depends on the research focus.
LLM profiles remain recognizable: The clearly distinguishable basic tendencies of the models were evident in both sequences, but were modulated by position and precursor contributions.
High reliability, but caution with citation: There were no serious errors or “hallucinations” in any of the runs. All LLMs remained within the scope of the task. The explicit citation was always correct, but the references were sometimes incorrect due to paragraph numbers. This is not a problem in the context of hybrid interpretation, as the total amount of text is manageable and known in advance. This can probably be resolved with adapted prompts and/or a different pre-structuring of the text material.
Stable “hotspots” – Fruitful variance: The key linguistic passages in the text were consistently recognized as relevant. The variety of interpretations confirms the value of the multi-LLM approach.
Comparison with chatGPT and co.: All discussions and contributions of local LLMs were on a comparable level to the large online models when it came to identifying linguistic anomalies and interpretive perspectives. In the context of the hybrid interpretation tested here, which focuses on short text excerpts, this is unreservedly helpful for your own analysis process. In terms of conceptualization and theorization, however, the large models are currently better.

The experiments show that local LLMs can be used productively for hybrid interpretation. Both model selection and sequencing have an influence on the dynamics of hybrid interpretations. LLMs do not act uniformly, but with characteristic strengths. A conscious design of the workflow process forces different perspectives and profitable bias in the hybrid interpretation setting. However, it is important to emphasize that these positive results relate to the use case of focused interpretation of short text segments presented here. The transferability to scenarios with significantly larger amounts of data in the prompt (e.g. the analysis of several entire interviews) is only possible to a limited extent with the 48GB system used here, as the models reach their context limits here or smaller, potentially less powerful model variants would have to be used (-> the next workshop report on the use of a Mac Studio with 96GB is already in progress…).

5 My personal interim conclusion and outlook: Is the local use of LLM worthwhile in qualitative research?

After fourteen intensive test runs with my AI quartet on the MacBook Pro, I ask myself the question: Was it worth the effort? Is the local, GDPR-compliant hybrid interpretation not only a technical feasibility study, but also a practical enrichment for qualitative researchers? My clear interim conclusion is: Yes, especially for the use case tested here of detailed, dialogic interpretation of manageable amounts of text, with some important limitations and considerations for more extensive tasks.

I was pleasantly surprised:

The quality and sophistication of the LLM contributions: Despite the local setup and the necessary compromises in model size (due to quantization), all four models consistently provided plausible, close to text and often very stimulating interpretations. The fear that local models could fall significantly behind the quality of the large cloud services has not been confirmed for this specific use case. Their ability to formulate nuanced counterarguments and develop their own differentiated interpretations was impressive.
The “personalities” of the models: The multi-LLM approach has proven to be extremely fruitful. The different “thinking styles” of Gemma, Mistral, Llama and Qwen have actually led to a multi-perspective illumination of the text, which I would not have expected on my own or even in a homogeneous LLM constellation. Each model brought its own specific strengths to the table, contributing to a richer overall picture.
The stability and absence of “hallucinations”: There were no serious factual errors, nonsensical text productions or the dreaded “hallucinations” in any of the fourteen runs. The LLMs always remained focused on the task and the text. The observed deficiencies in the exact citation of paragraph numbers are more of a technical detail that appears to be solvable through adapted prompts or text editing, but does not call into question the fundamental fidelity of the text.
The role of sequencing: The realization that the order of the LLMs has a noticeable influence on the dynamics and the type of knowledge development was an important methodological learning process. It shows that hybrid interpretation is not only a question of model selection, but also of process design.

The pragmatic side: the hurdles and joys of working locally

The setup effort: The initial setup, the selection of suitable models and quantizations as well as the fine-tuning of the settings in LM Studio require time, patience and a certain amount of technical experimentation. It is not “plug-and-play” like the web services.
The speed: At 7 to 15 tokens per second, response generation is significantly slower than with the cloud giants. However, for an interactive research process, where you have to carefully read and reflect on the answers anyway, I found this speed to be absolutely acceptable and not a hindrance. Sometimes the little “pause for thought” was even welcome.
The hardware requirements: A powerful system with plenty of RAM is required. 48 GB seems to be a good lower limit for the setup described here and comparable tasks. However, for research projects that regularly require the analysis of several longer documents simultaneously or the use of models with very large context windows (e.g. over 100k tokens), more powerful systems with significantly more RAM (e.g. 96GB, 128GB or more) will have to be considered. This is an investment, but as shown, “discontinued models” can also be a very good and economically sensible solution.
The joy of data sovereignty: The greatest benefit is undoubtedly the certainty that sensitive research data will not leave your own computer. This aspect of GDPR compliance and ethical responsibility is non-negotiable for me and makes the local approach without alternative for many quality projects, despite the additional effort involved.

Who is it for? And what are the next steps?

In my opinion, the approach of local hybrid interpretation outlined here offers considerable potential for various target groups:

Researchers working on their final theses (Bachelor’s, Master’s, PhD): Who often act as “lone wolves” and do not always have direct access to established interpretation groups. LLMs can be valuable sparring partners here.
Small research teams: Who would like to expand their diversity of perspectives through the use of LLMs.
Teachers of qualitative methods: who want to demonstrate a new, practical method of “thinking with and about text” to students and guide them towards the critical and reflective use of AI.
All qualitative researchers: Who are curious about the possibilities of LLMs, but have the highest demands on data protection and data control.

Of course, this is only a first workshop report. Many questions remain unanswered and require further systematic investigation. How do other LLM combinations and sequences behave? Could these local LLMs also be suitable for other qualitative approaches outside of hybrid interpretation? So with these LLM specific questions, for example. qualitative content analysis, the documentary method or others. And the question of how the quality criteria of qualitative research could be systematically applied to the LLM contributions in order to evaluate and compare their quality in an even more differentiated way also seems exciting to me. And how does the approach scale with significantly longer text passages or a larger number of documents to be analyzed? The context window is an important influencing factor here. My first informal tests with the simultaneous analysis of five short transcripts (approx. 13,500 token input) already showed that the Llama 3.3 70B model on the 48GB system was overwhelmed and could no longer process them. This underlines the need for further testing and potentially more powerful hardware or smaller models for such extensive use cases.

My personal outlook: Generative AI technology is only at the beginning of its development. The models will become more powerful, the software for local operation more user-friendly and hopefully the data protection framework for use in research will also become clearer. The approach presented here is an attempt to make this development usable for qualitative research in a constructive and methodologically reflective way, whereby the balance between local feasibility, data volume and desired model performance must be constantly re-explored. The aim is not to replace human interpretation work, but to supplement and enrich it with new, diverse perspectives. LLMs used locally can help us recognize our own blind spots, consider alternative interpretations, and ultimately arrive at deeper and more robust research findings. The path is exciting and I invite everyone to join us, experiment and share the experience gained.

Further sources

This blog post is based on: Krähnke, U., Pehl, T., & Dresing, T. (2025). Hybrid interpretation of text-based data with dialogically integrated LLMs:
On the use of generative AI in qualitative research… https://nbn-resolving. org/urn:nbn:de:0168-ssoar-99389-7
Instructions for hybrid interpretation: https://www.audiotranskription.de/wp-content/uploads/2025/02/audiotranskription_Einfuehrung-in-die-hybride-Interpretation-mit-drei-LLMs-2.pdf
An analysis example based on hybrid interpretation: Download DOCX
The standard prompt we use (although we have not assigned a name for the locally used LLMs, so simply remove this from the prompt): https://www.audiotranskription.de/hybrides-interpretieren/prompt/
Uses LM Studio software: https://lmstudio.ai/

Citation of this article:

Dresing, T. (2025, May 15). GDPR-compliant, hybrid interpretation: A workshop report on local interpretation with four LLMs on a MacBook Pro with Gemma 3, Qwen 3, Mistral 3.1 and Llama 3.3. audiotranskription.de. Retrieved on [date of access], from https://audiotranskription.de/llm-lokal-und-dsgvo-konform-nutzen

Abstract

This workshop report addresses the challenge of using large language models (LLMs) in qualitative research in compliance with data protection regulations. In view of the GDPR problems of common cloud-based AI services, the feasibility and benefits of a fully local, hybrid interpretation method are being investigated. The report documents in detail the technical setup on a standard MacBook Pro (M3 Max, 48GB RAM) using the LM Studio software as well as the selection and configuration of a quartet of four open source LLMs (Gemma 3 27B, Qwen 3 32B, Mistral Small 3.1 24B, Llama 3.3 70B) with different quantizations. Based on fourteen interpretation runs with a standard interview extract and two varied LLM sequences, characteristic “profile types” of the models and the influence of the order of use on the result dynamics are analyzed. The results show that local LLMs can provide high-quality, differentiated and stable interpretative contributions comparable to those of large online models in identifying linguistic salience and generating interpretative perspectives, although cloud models remain superior in theorizing. The article concludes with a positive assessment of the practicability of the local approach, which offers researchers a data-safe and methodologically enriching alternative for hybrid interpretation, and outlines future research needs.