Thoughts as a social science research comparing ChatpGPT 5.1 (thinking) Gemini 3 Pro, and NotebookLM,
By Holidays in Europe / November 27, 2025 / No Comments / Uncategorized
Evaluating AI Language Models: An In-Depth Comparative Analysis of ChatGPT 5.1, Gemini 3 Pro, and NotebookLM
As a researcher deeply engaged in socio-legal studies, I recently undertook a comprehensive examination of three prominent AI language models—ChatGPT 5.1, Gemini 3 Pro, and NotebookLM—to understand their capabilities and potential applications within academic research. Over approximately 90 minutes, I conducted a structured assessment, focusing on their proficiency in analyzing complex scholarly materials and extracting relevant concepts.
Methodology
I began by selecting a lengthy, well-known socio-legal academic article that I am intimately familiar with. My objective was to evaluate each model’s ability to disentangle specific concepts and compile an exhaustive list of pertinent examples. I employed identical prompts across all three platforms:
- Requesting a detailed breakdown of key concepts.
- Asking for a comprehensive enumeration of supporting examples.
Performance Outcomes
Initial Concept Extraction
- ChatGPT 5.1 (Thinking): The model completed the task within 11 minutes but overlooked a significant portion of relevant items—suggesting limitations in processing scope or domain familiarity.
- Gemini 3 Pro: Delivered a thorough and accurate extraction in under a minute, capturing a broad array of items.
- NotebookLM: Also produced a comprehensive list swiftly, matching Gemini’s performance closely.
Interestingly, Gemini marginally outperformed NotebookLM despite the latter’s specialization in document analytics within domain-specific contexts.
Extended Analysis with My Dissertation
Encouraged by initial results, I uploaded my own dissertation, which elaborates extensively on the same thematic area. When asked to identify additional items that the initial article omitted:
- NotebookLM edged out Gemini in detecting nuanced or overlooked details.
- ChatGPT 5.1 continued to miss a significant category, indicating possible domain or contextual limitations.
Categorization and Conceptual Clarity
Next, I tasked all three models to suggest categorizations for the collated examples, aiming to illustrate how a main concept could be disentangled effectively:
- ChatGPT 5.1: Provided numerous categories, but they were overly abstract and lacked practical utility.
- Gemini 3 Pro: Delivered precise and insightful categorization.
- NotebookLM: Performed well, offering solid categorization with a slightly simpler structure.
Reflections on Model Performance and Limitations
Seeking insights into ChatGPT’s oversight, I inquired about why it failed to include certain