Learning Science
Explore how people learn and educational technologies
Navigate through content by publication date
Sat, Nov 15
undefined items found
Uncovering Strategic Egoism Behaviors in Large Language Models
Large language models (LLMs) face growing trustworthiness concerns (\eg, deception), which hinder their safe deployment in high-stakes decision-making scenarios. In this paper, we present the first systematic investigation of strategic egoism (SE), a form of rule-bounded self-interest in which models pursue short-term or self-serving gains while disregarding collective welfare and ethical considerations. To quantitatively assess this phenomenon, we introduce SEBench, a benchmark comprising 160 scenarios across five domains. Each scenario features a single-role decision-making context, with psychologically grounded choice sets designed to elicit self-serving behaviors. These behavior-driven tasks assess egoistic tendencies along six dimensions, such as manipulation, rule circumvention, and self-interest prioritization. Building on this, we conduct extensive experiments across 5 open-sourced and 2 commercial LLMs, where we observe that strategic egoism emerges universally across models. Surprisingly, we found a positive correlation between egoistic tendencies and toxic language behaviors, suggesting that strategic egoism may underlie broader misalignment risks.
Reinforcing Trustworthiness in Multimodal Emotional Support Systems
In today's world, emotional support is increasingly essential, yet it remains challenging for both those seeking help and those offering it. Multimodal approaches to emotional support show great promise by integrating diverse data sources to provide empathetic, contextually relevant responses, fostering more effective interactions. However, current methods have notable limitations, often relying solely on text or converting other data types into text, or providing emotion recognition only, thus overlooking the full potential of multimodal inputs. Moreover, many studies prioritize response generation without accurately identifying critical emotional support elements or ensuring the reliability of outputs. To overcome these issues, we introduce \textsc{ MultiMood}, a new framework that (i) leverages multimodal embeddings from video, audio, and text to predict emotional components and to produce responses responses aligned with professional therapeutic standards. To improve trustworthiness, we (ii) incorporate novel psychological criteria and apply Reinforcement Learning (RL) to optimize large language models (LLMs) for consistent adherence to these standards. We also (iii) analyze several advanced LLMs to assess their multimodal emotional support capabilities. Experimental results show that MultiMood achieves state-of-the-art on MESC and DFEW datasets while RL-driven trustworthiness improvements are validated through human and LLM evaluations, demonstrating its superior capability in applying a multimodal framework in this domain.
Leveraging Large Language Models for Identifying Knowledge Components
Knowledge Components (KCs) are foundational to adaptive learning systems, but their manual identification by domain experts is a significant bottleneck. While Large Language Models (LLMs) offer a promising avenue for automating this process, prior research has been limited to small datasets and has been shown to produce superfluous, redundant KC labels. This study addresses these limitations by first scaling a "simulated textbook" LLM prompting strategy (using GPT-4o-mini) to a larger dataset of 646 multiple-choice questions. We found that this initial automated approach performed significantly worse than an expert-designed KC model (RMSE 0.4285 vs. 0.4206) and generated an excessive number of KCs (569 vs. 101). To address the issue of redundancy, we proposed and evaluated a novel method for merging semantically similar KC labels based on their cosine similarity. This merging strategy significantly improved the model's performance; a model using a cosine similarity threshold of 0.8 achieved the best result, reducing the KC count to 428 and improving the RMSE to 0.4259. This demonstrates that while scaled LLM generation alone is insufficient, combining it with a semantic merging technique offers a viable path toward automating and refining KC identification.
Owlgorithm: Supporting Self-Regulated Learning in Competitive Programming through LLM-Driven Reflection
We present Owlgorithm, an educational platform that supports Self-Regulated Learning (SRL) in competitive programming (CP) through AI-generated reflective questions. Leveraging GPT-4o, Owlgorithm produces context-aware, metacognitive prompts tailored to individual student submissions. Integrated into a second- and third-year CP course, the system-provided reflective prompts adapted to student outcomes: guiding deeper conceptual insight for correct solutions and structured debugging for partial or failed ones. Our exploratory assessment of student ratings and TA feedback revealed both promising benefits and notable limitations. While many found the generated questions useful for reflection and debugging, concerns were raised about feedback accuracy and classroom usability. These results suggest advantages of LLM-supported reflection for novice programmers, though refinements are needed to ensure reliability and pedagogical value for advanced learners. From our experience, several key insights emerged: GenAI can effectively support structured reflection, but careful prompt design, dynamic adaptation, and usability improvements are critical to realizing their potential in education. We offer specific recommendations for educators using similar tools and outline next steps to enhance Owlgorithm's educational impact. The underlying framework may also generalize to other reflective learning contexts.
Bytes of a Feather: Personality and Opinion Alignment Effects in Human-AI Interaction
Interactions with AI assistants are increasingly personalized to individual users. As AI personalization is dynamic and machine-learning-driven, we have limited understanding of how personalization affects interaction outcomes and user perceptions. We conducted a large-scale controlled experiment in which 1,000 participants interacted with AI assistants that took on certain personality traits and opinion stances. Our results show that participants consistently preferred to interact with models that shared their opinions. Participants also found opinion-aligned models more trustworthy, competent, warm, and persuasive, corroborating an AI-similarity-attraction hypothesis. In contrast, we observed no or only weak effects of AI personality alignment, with introvert models rated as less trustworthy and competent by introvert participants. These findings highlight opinion alignment as a central dimension of AI personalization and user preference, while underscoring the need for a more grounded discussion of the limits and risks of personalized AI.