Aicosoft - AI & Technology News, Insights & Innovation

Let's be honest, we’ve all been there. You’re deep into a research project, and your screen looks like a digital hurricane hit it. You have 27 browser tabs open, a folder overflowing with PDFs you swear you'll read, and you're trying to piece it all together into something coherent.

What if you could have an assistant to do the heavy lifting? Not just a chatbot that can answer trivia, but a real research partner. An AI that can read those web pages, digest those PDFs, understand the charts inside them, and then write up a summary for you.

It sounds like science fiction, but it's more achievable than you might think. Today, we're going to roll up our sleeves and build exactly that: a "Swiss Army Knife" research agent. This isn't about fancy prompts; it's about giving an AI a set of real tools to solve real problems, from start to finish.

First Things First: Setting Up Our Workshop

Before we can build anything, we need to lay out our tools and materials. In the world of code, that means setting up our environment and grabbing the necessary libraries. Think of it as preparing your workbench.

We'll be using Python, and we need to install a few key packages. The star of the show is smolagents, which helps us orchestrate our AI. We'll also need tools for web scraping (trafilatura), web searching (duckduckgo-search), reading PDFs (pypdf, pymupdf), and even creating Word documents (python-docx).

Here’s the quick setup command you'd run in your terminal:

%pip -q install -U smolagents openai trafilatura duckduckgo-search pypdf pymupdf python-docx pillow tqdm

Next, we need to handle our API keys. We're working with OpenAI's models, so we need a key for that. We'll also set up an optional one for Serper, a paid Google Search API that can give us really high-quality search results. The code below is designed to ask for your keys securely without you having to paste them directly into your script. It's a much safer way to handle your secrets.

import os, re, json, getpass
from typing import List, Dict, Any
import requests
import trafilatura
from duckduckgo_search import DDGS
from pypdf import PdfReader
import fitz
from docx import Document
from docx.shared import Pt
from datetime import datetime
from openai import OpenAI
from smolagents import CodeAgent, OpenAIModel, tool

# Securely get the OpenAI API key
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Paste your OpenAI API key (hidden): ").strip()
print("OPENAI_API_KEY set:", "YES" if os.environ.get("OPENAI_API_KEY") else "NO")

# Optionally get the Serper API key for better search results
if not os.environ.get("SERPER_API_KEY"):
    serper = getpass.getpass("Optional: Paste SERPER_API_KEY for Google results (press Enter to skip): ").strip()
    if serper:
        os.environ["SERPER_API_KEY"] = serper
print("SERPER_API_KEY set:", "YES" if os.environ.get("SERPER_API_KEY") else "NO")

client = OpenAI()

With our keys loaded and our libraries installed, our workshop is ready to go.

Giving Our Agent a Library Card (and Internet Access)

An agent can't research in a vacuum. It needs access to information. Our first job is to give it two fundamental abilities: searching the web and reading files we give it.

Let's start with web search. We'll build a flexible function that can either use the powerful, paid Serper API (if you provided a key) or fall back to the free and very capable DuckDuckGo search. This way, our agent always has a way to find information online.

Then, for those PDFs you've already downloaded, we’ll create a simple way to upload them. If you're running this in a Google Colab notebook, the code below makes it super easy to upload files directly.

# A simple way to upload PDFs in Google Colab
try:
    from google.colab import files
    os.makedirs("/content/pdfs", exist_ok=True)
    uploaded = files.upload()
    for name, data in uploaded.items():
        if name.lower().endswith(".pdf"):
            with open(f"/content/pdfs/{name}", "wb") as f:
                f.write(data)
    print("PDFs in /content/pdfs:", os.listdir("/content/pdfs"))
except Exception as e:
    print("Upload skipped:", str(e))

# The function our agent will use to search the web
def web_search(query: str, k: int = 6) -> List[Dict[str, str]]:
    serper_key = os.environ.get("SERPER_API_KEY", "").strip()
    # Use Serper/Google if we have a key
    if serper_key:
        # ... (API call logic from original article)
        pass
    # Fallback to DuckDuckGo
    out = []
    with DDGS() as ddgs:
        for r in ddgs.text(query, max_results=k):
            out.append({
                "title": r.get("title",""),
                "url": r.get("href",""),
                "snippet": r.get("body",""),
            })
    return out

# A tool to fetch the actual text from a URL
def fetch_url_text(url: str) -> Dict[str, Any]:
    try:
        downloaded = trafilatura.fetch_url(url, timeout=30)
        text = trafilatura.extract(downloaded)
        # ... (error handling from original article)
        return {"url": url, "ok": True, "text": text}
    except Exception as e:
        return {"url": url, "ok": False, "error": str(e), "text": ""}

Now our agent can not only find sources but also read the full text from them. It’s like going from just reading book titles to being able to open them up and read the pages.

Going Deeper: Teaching the Agent to Read PDFs (and See Images)

Okay, our agent can browse the web. But what about those dense, information-packed PDFs? We need to teach it how to crack those open and understand what's inside. This involves two key skills: extracting the text and—this is the really cool part—analyzing the images.

First, we'll write a function to pull all the text out of a PDF, page by page.

But here's where we level up. Most scripts would just ignore images, but charts, graphs, and diagrams often contain the most important information! So, we'll also write a function to extract every image from the PDF and save it.

Then, we’ll connect it to a vision-capable AI model (like GPT-4's vision model). This allows our agent to literally look at a chart and answer a question about it, like "What is the trend shown in this graph?"

# Function to read text from a PDF
def read_pdf_text(pdf_path: str, max_pages: int = 30) -> Dict[str, Any]:
    reader = PdfReader(pdf_path)
    pages = min(len(reader.pages), max_pages)
    text = "\n\n".join([reader.pages[i].extract_text() or "" for i in range(pages)])
    return {"pdf_path": pdf_path, "pages_read": pages, "text": text.strip()}

# Function to pull images out of a PDF
def extract_pdf_images(pdf_path: str, out_dir: str = "/content/extracted_images", max_pages: int = 10) -> List[str]:
    # ... (code from original to extract and save images)
    pass

# The magic function: analyze an image with a vision model
def vision_analyze_image(image_path: str, question: str) -> Dict[str, Any]:
    with open(image_path, "rb") as f:
        img_bytes = f.read()
    
    # This uses the OpenAI API to send an image and a question
    resp = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": f"Answer concisely and accurately.\n\nQuestion: {question}"},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64.b64encode(img_bytes).decode()}"},
                    },
                ],
            }
        ]
    )
    return {"image_path": image_path, "answer": resp.choices[0].message.content}

This is a huge step. Our agent is no longer blind to visual data. It can synthesize information from both text and images, creating a much richer understanding of the source material.

From Raw Data to Polished Report

All this data is useless if it just sits there. The final piece of the puzzle is teaching our agent how to communicate its findings. We want a professional, easy-to-read report.

We'll equip the agent with two output tools: one to write a clean Markdown file and another to convert that Markdown into a polished .docx Word document. This gives us a final product that’s ready to be shared with a colleague or dropped into a presentation.

To make these functions usable by our agent, we wrap them as "tools." The @tool decorator from the smolagents library is a neat way to tell the agent, "Hey, this is a skill you can use." We'll do this for all the functions we've built so far—web search, PDF reading, vision analysis, and report writing.

# We'll wrap all our functions like this one
@tool
def t_web_search(query: str, k: int = 6) -> str:
    """Searches the web and returns results as a JSON string."""
    return json.dumps(web_search(query, k))

@tool
def t_read_pdf_text(pdf_path: str, max_pages: int = 30) -> str:
    """Reads and returns the text from a PDF file as a JSON string."""
    return json.dumps(read_pdf_text(pdf_path, max_pages=max_pages))

@tool
def t_vision_analyze_image(image_path: str, question: str) -> str:
    """Analyzes an image and answers a question about it."""
    return json.dumps(vision_analyze_image(image_path, question))

# ... and so on for all our other functions, including writing files.

By turning each capability into a distinct tool, we give the agent a clear set of actions it can choose from. It's not just guessing; it's reasoning about which tool to use at each step of the research process.

Bringing It All Together: The Agent's Brain

We’ve built all the parts: the hands for fetching data, the eyes for seeing images, and the voice for writing reports. Now, it's time to connect them to a brain.

We'll use the CodeAgent from smolagents. We give it the list of all the tools we just created. Then, we write a clear, step-by-step prompt that acts as its mission briefing. This isn't just a vague "research this topic" command. We're giving it a structured plan.

Here's how we set up the agent and give it its marching orders:

# Initialize the model we want the agent to use
model = OpenAIModel(model_id="gpt-4-turbo") # Or your preferred model

# Create the agent and give it the complete toolbox
agent = CodeAgent(
    tools=[
        t_web_search,
        t_fetch_url_text,
        t_list_pdfs,
        t_read_pdf_text,
        t_extract_pdf_images,
        t_vision_analyze_image,
        t_write_markdown,
        t_write_docx_from_markdown,
    ],
    model=model,
)

# The research topic and the prompt that guides the agent
topic = "Build a research brief on the most reliable design patterns for tool-using agents (2024-2026), focusing on evaluation, citations, and failure modes."

prompt = f"""
You are a Swiss Army Knife Research Agent.

Your research question is: {topic}

Follow these steps carefully:
1. First, check if there are any local PDFs available using `t_list_pdfs`. Decide if they seem relevant to the topic.
2. Perform a web search using `t_web_search` to find current articles and papers on the topic.
3. Review the search results and use `t_fetch_url_text` to read the content of the most promising URLs.
4. If you found relevant PDFs, use `t_read_pdf_text` to extract their content. Also, use `t_extract_pdf_images` to see if there are any figures.
5. If you extracted any images that look like charts or diagrams, use `t_vision_analyze_image` to understand what they show.
6. Synthesize all the information you've gathered from the web and PDFs.
7. Write a comprehensive report in a Markdown file named '/content/report/report.md'.
8. Finally, convert your Markdown report into a DOCX file at '/content/report/report.docx'.
"""

# Run the agent and let it do its work!
out = agent.run(prompt)

And that's it. You press "run," and the agent gets to work. It will follow your instructions, calling the right tools in the right order, thinking step-by-step, and ultimately producing a finished report in both Markdown and DOCX formats, ready for you to download.

So, what have we really built here? We've created more than just a clever chatbot. We've designed a reliable research assistant that can navigate the messy world of information, distinguish between different types of data, and produce a structured, useful output. It's a blueprint for building AI systems that are less like conversational toys and more like trustworthy partners in getting real work done.

Beyond Chat: How to Build a "Swiss Army Knife" AI Research Assistant

First Things First: Setting Up Our Workshop

Giving Our Agent a Library Card (and Internet Access)

Going Deeper: Teaching the Agent to Read PDFs (and See Images)

From Raw Data to Polished Report

Bringing It All Together: The Agent's Brain

Tags

Source

Stay Updated

Related Articles

How to Build a Smarter AI Agent with a "Tool Router"

Anthropic's New Cowork Feature Lets Claude Organize Your Messy Files

Anthropic Just Taught AI Agents to Stop Wasting Your Money

Beyond Chat: How to Build a "Swiss Army Knife" AI Research Assistant

First Things First: Setting Up Our Workshop

Giving Our Agent a Library Card (and Internet Access)

Going Deeper: Teaching the Agent to Read PDFs (and See Images)

From Raw Data to Polished Report

Bringing It All Together: The Agent's Brain

Tags

Source

Stay Updated

Related Articles

How to Build a Smarter AI Agent with a "Tool Router"

Anthropic's New Cowork Feature Lets Claude Organize Your Messy Files

Anthropic Just Taught AI Agents to Stop Wasting Your Money

Cookie Settings