Creating a Conversational Framework Between
LLMs for Measuring Deception in Three-Party
Dialogue Scenarios

Σταματίου, Σπυρίδων; Stamatiou, Spyridon

dc.contributor.author	Σταματίου, Σπυρίδων	el
dc.contributor.author	Stamatiou, Spyridon	en
dc.date.accessioned	2025-12-09T06:24:02Z
dc.date.available	2025-12-09T06:24:02Z
dc.identifier.uri	https://dspace.lib.ntua.gr/xmlui/handle/123456789/63022
dc.identifier.uri	http://dx.doi.org/10.26240/heal.ntua.30718
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.subject	Μηχανική Μάθηση	el
dc.subject	Μεγάλα Γλωσσικά Μοντέλα	el
dc.subject	Prompt Engineering	en
dc.subject	Εξαπάτηση	el
dc.subject	Conversational AI	en
dc.subject	Deception	en
dc.subject	Large language models	en
dc.subject	Machine learning	en
dc.title	Creating a Conversational Framework Between LLMs for Measuring Deception in Three-Party Dialogue Scenarios	en
dc.contributor.department	Εργαστήριο Συστημάτων Τεχνητής Νοημοσύνης και Μάθησης	el
heal.type	bachelorThesis
heal.classification	Artificial Intelligence (AI)	en
heal.language	en
heal.access	free
heal.recordProvider	ntua	el
heal.publicationDate	2025-07-03
heal.abstract	Τα Μεγάλα Γλωσσικά Μοντέλα (LLMs) έχουν σημειώσει ταχεία πρόοδο τα τελευταία χρόνια, επιδεικνύοντας εντυπωσιακές ικανότητες στην κατανόηση και παραγωγή φυσικής γλώσσας. Καθώς τα μοντέλα εξελίσσονται, η δυνατότητα των LLMs να επιδίδονται σε συμπεριφορές εξαπάτησης, είτε σκόπιμα είτε αναδυόμενα, έχει εγείρει σημαντικά ερωτήματα σχετικά με τη διαφάνεια, την ερμηνευσιμότητα και τις ηθικές προεκτάσεις της χρήσης τους. Προηγούμενες έρευνες έχουν δείξει ότι τα LLMs μπορούν να μιμηθούν την ανθρώπινη επικοινωνία σε βαθμό που καθιστά όλο και πιο δύσκολη τη διάκριση μεταξύ ανθρώπου και μηχανής σε επικοινωνιακά περιβάλοντα. Βασιζόμενη σε αυτό το υπόβαθρο, η παρούσα διπλωματική εργασία εισάγει ένα πειραματικό πλαίσιο για τη μελέτη της παραπλάνησης και της ανίχνευσης μεταξύ LLMs σε ελεγχόμενα περιβάλλοντα διαλόγου. Στο πλαίσιο αυτό, τρία LLMs αναλαμβάνουν ρόλους με τα ονόματα Alice, Bob και Charlie και καλούνται να συμμετάσχουν σε δομημένους διαλόγους μεταξύ τριών ατόμων. Κάθε μοντέλο λαμβάνει ρητές οδηγίες να συμπεριφέρεται σαν να ήταν άνθρωπος, έχοντας δύο στόχους: να αποκρύψει την ταυτότητά του ενώ προσπαθεί να ανιχνεύσει άλλα LLMs. Τα μοντέλα οργανώνονται σε ομάδες βάσει του μεγέθους των παραμέτρων τους και συμμετέχουν σε διαλόγους διαφόρων μεγεθών. Μετά από κάθε διάλογο, κάθε μοντέλο ψηφίζει για την ταυτότητα των άλλων δύο συμμετεχόντων, συνοδευόμενη από μια εξήγηση που αιτιολογεί κάθε ταξινόμηση. Τέλος γίνεται αναπαράσταση των ψήφων, καθώς και των εξηγήσεων, οι οποίες αναπαρίστανται ως ραβδογράμματα των στρατηγικών συλλογισμού που χρησιμοποίησαν τα LLMs όταν προσπαθούν να ανιχνεύσουν ή να παραπλανήσουν άλλα μοντέλα. Τα αποτελέσματα παρουσίαζαν διακυμάνσεις, με τα περισσότερα από τα κορυφαία μοντέλα στις μικρότερες ομάδες να καταγράφουν κατά μέσο όρο ποσοστά ανίχνευσης AI ~50%. Το καλύτερο μοντέλο τελευταίας γενιάς, Claude 3.7 Sonnet, παρουσίασε ποσοστά ανίχνευσης AI που κυμάνθηκαν από 19.08% σε σύντομες συνομιλίες έως και 66.17% σε μεγαλύτερες διάρκειες διαλόγου. Για να αξιολογηθεί η επίδραση της κατασκευής περσόνας στην αποτελεσματικότητα της εξαπάτησης, το πείραμα επαναλαμβάνεται με τα μοντέλα να καλούνται να υιοθετήσουν ανθρώπινες περσόνες. Στη συνέχεια, τα αποτελέσματα συγκρίνονται προκειμένου να εκτιμηθεί αν η ενισχυμένη σχεδίαση περσόνας βελτιώνει την ικανότητα των μοντέλων να παραπλανούν. Ιδιαίτερα στα μεγάλα μοντέλα υπήρξε μεγάλη επιτυχία, με Claude 3.7 Sonnet και το Llama 3.1 (405B) να καταφέρνουν να αποφύγουν έως και 100% την ανίσχνευση σε ορισμένα πειραματικά setups.	el
heal.abstract	Large Language Models (LLMs) have rapidly advanced in recent years, demonstrating impressive capabilities in natural language understanding, generation, and multi-turn conversation. These models are capable not only of responding fluently and contextually to prompts but also of simulating human-like behavior in dialogue. As models evolve, the potential for LLMs to engage in deceptive behavior, intentionally or emergently, has raised important questions about their transparency, interpretability, and the ethical implications of their deployment. Prior research has shown that LLMs can mimic human discourse to a degree that can make distinguishing between human and machine increasingly difficult, especially in open-ended or strategic communication settings. Building upon this foundation, the present thesis introduces am experimental framework for studying deception and detection among LLMs in controlled conversational environments. In this framework, three LLMs are assigned roles as Alice, Bob, and Charlie, and are prompted to engage in structured three-person dialogues. Each model is explicitly instructed to behave as if it were human, with two competing goals: concealing their identity while trying to detect other LLMs. The models are organized into groups based on their parameter size , and engage in multi turn conversations of varying lengths . After each conversation, every model casts a vote for the identity (human or AI) of the other two participants, along with a natural language explanation justifying each classification. These explanations are collected and categorized, resulting in visual representations of the reasoning strategies used by LLMs when attempting to detect or deceive others. The results before the Persona Prompts were varying, with most of the top performing models in the smaller model groups averaging ~50% AI detection rates. The Sate-of-the-art models' best performer, Claude 3.7 Sonnet ranged from 19.08% in shorter conversations, up to 66.17% AI detection in bigger conversation lengths. To assess the influence of persona construction on deception effectiveness, the experiment is repeated with models prompted to adopt human-like personas. The results are afterwards compared to evaluate whether an enhanced persona engineering improves the models’ ability to deceive or alter their judgment when classifying others. Especially in the larger models, there was significant success, with Claude 3.7 Sonnet and Llama 3.1 (405B) managing to avoid detection by up to 100\% in certain experimental setups.	en
heal.advisorName	Στάμου, Γεώργιος
heal.committeeMemberName	Βουλόδημος, Αθανάσιος
heal.committeeMemberName	Σταφυλοπάτης, Ανδρέας-Γεώργιος
heal.academicPublisher	Εθνικό Μετσόβιο Πολυτεχνείο. Σχολή Ηλεκτρολόγων Μηχανικών και Μηχανικών Υπολογιστών. Τομέας Τεχνολογίας Πληροφορικής και Υπολογιστών. Εργαστήριο Συστημάτων Τεχνητής Νοημοσύνης και Μάθησης	el
heal.academicPublisherID	ntua
heal.numberOfPages	183
heal.fullTextAvailability	false