Evaluating AI Language Models: An In-Depth Comparative Analysis of ChatGPT 5.1, Gemini 3 Pro, and NotebookLM

As a researcher deeply engaged in socio-legal studies, I recently undertook a comprehensive examination of three prominent AI language models—ChatGPT 5.1, Gemini 3 Pro, and NotebookLM—to understand their capabilities and potential applications within academic research. Over approximately 90 minutes, I conducted a structured assessment, focusing on their proficiency in analyzing complex scholarly materials and extracting relevant concepts.

Methodology

I began by selecting a lengthy, well-known socio-legal academic article that I am intimately familiar with. My objective was to evaluate each model’s ability to disentangle specific concepts and compile an exhaustive list of pertinent examples. I employed identical prompts across all three platforms:

  • Requesting a detailed breakdown of key concepts.
  • Asking for a comprehensive enumeration of supporting examples.

Performance Outcomes

Initial Concept Extraction

  • ChatGPT 5.1 (Thinking): The model completed the task within 11 minutes but overlooked a significant portion of relevant items—suggesting limitations in processing scope or domain familiarity.
  • Gemini 3 Pro: Delivered a thorough and accurate extraction in under a minute, capturing a broad array of items.
  • NotebookLM: Also produced a comprehensive list swiftly, matching Gemini’s performance closely.

Interestingly, Gemini marginally outperformed NotebookLM despite the latter’s specialization in document analytics within domain-specific contexts.

Extended Analysis with My Dissertation

Encouraged by initial results, I uploaded my own dissertation, which elaborates extensively on the same thematic area. When asked to identify additional items that the initial article omitted:

  • NotebookLM edged out Gemini in detecting nuanced or overlooked details.
  • ChatGPT 5.1 continued to miss a significant category, indicating possible domain or contextual limitations.

Categorization and Conceptual Clarity

Next, I tasked all three models to suggest categorizations for the collated examples, aiming to illustrate how a main concept could be disentangled effectively:

  • ChatGPT 5.1: Provided numerous categories, but they were overly abstract and lacked practical utility.
  • Gemini 3 Pro: Delivered precise and insightful categorization.
  • NotebookLM: Performed well, offering solid categorization with a slightly simpler structure.

Reflections on Model Performance and Limitations

Seeking insights into ChatGPT’s oversight, I inquired about why it failed to include certain

Leave a Reply

Your email address will not be published. Required fields are marked *