Introduction
In this post I go into some detail of how the different Copilots in Microsoft 365 operate in practice and show that not all the Copilots are created equal. This information could be both useful from a technical perspective but also useful from a staff training perspective. When you rollout Microsoft Copilot, people within the organisation need to understand that all the Copilots within the Office applications are not the same and are all tuned in different ways.
A Copilot is a Copilot is a Copilot?
When I first heard about Microsoft Copilot and saw the similar looking Copilot frame on the right side of the screen, I figured it was probably just a common interface that could access the data from the application you had open at the time. However, after actually getting the opportunity of playing with Microsoft Copilot in the various apps it has becomes clear that it is actually a lot more complex than that. Each of the Copilots within the apps has been tailored to respond in a context that makes sense for the type of application that you’re using. This has been achieved by the engineers at Microsoft, using various methods of prompt engineering and orchestration in the background.
I thought it would be useful to demonstrate the differences in the way the various Copilots in different apps respond to the exact same prompt. For this demo I have chosen an innocuous query that is not explicit and could be interpreted in different ways to see what happens. The query I chose was “Tell me about the weather in Melbourne”. This is not the kind of prompt you would really use in practice but is instead something that I’ve chosen to highlight the differences in the way each Copilot responds to the prompt.
Let's start by querying the OpenAI ChatGPT 3.5
model and see how this foundation model interprets this request. This will offer a comparison to see the difference that the exact same prompt will give when asking it of the various Copilots.
1. ChatGPT 3.5
You will see here that the ChatGPT foundational model has interpreted this question as a request to know the specific temperature in Melbourne right now. This is because I wasn’t explicit enough in what I had asked the model and so I didn’t get back any general information about expected temperature ranges in Melbourne.
In setting up the ChatGPT model the OpenAI team appear to have created the system to fail gracefully in these cases where it thinks it's getting asked for data that's more current than it knows about. This is an unfortunate trait of the foundation models, they only know information up to when they were finished being trained. It is interesting that it did not respond with some more generic information about what the expected temperatures are throughout the year or historical information about the weather though (keep this in mind when we get to the Word Copilot example).
2. Bing Chat
The method used here is called a Retrieval Augmented
Generation (RAG) framework, where it doesn't ask the foundation model
for the answer to the question directly. Instead, Bing will first retrieve some
reputable sources for the kind of information being requested and provide that data as part of the prompt to the foundation model (also often referred to as Grounding the model with
data). The foundation model here has been used to interpret the retrieved data
instead of using its own “knowledge” from the data it was trained on. In this
case, Bing is functioning as an orchestration engine that retrieves data which
it compiles into an expanded prompt that will be sent to the ChatGPT mode in
addition to your original query.
3. M365 Chat
When I asked the M365 Chat interface within Teams this question, it responded that it couldn’t find the answer to the question and recommended that I use a web search. This is because the M365 Chat Copilot uses a similar Retrieval Augmented Generation (RAG) framework to Bing. Rather than searching the Internet for information on the weather in Melbourne, it attempted a Semantic Index search (Reference: https://learn.microsoft.com/en-us/microsoftsearch/semantic-index-for-copilot) across the documents, emails, chats and other data within my Office 365 tenant. I didn’t actually have any information within my tenancy on this topic at the time. As a result, M365 Chat was unable to get any information to the pass onto the foundation model to provide an answer. What is interesting to me here, is that it didn’t just ask the foundation model to have a go at telling me about the weather in Melbourne, but instead apologised for not being able to find any documents about this.
Note: In this case, the Microsoft 365 Chat
Copilot was configured to only have access to internal documents and was not
enabled for searching the Internet for data. This is a setting that administrators
have control over:
https://learn.microsoft.com/en-us/microsoft-365-copilot/manage-public-web-access
Of course, had I have had documents that contained
information on the weather in Melbourne it would have been able to answer me. Below
is an example of the output when there is a document containing information
about the weather in Melbourne. You will see here that the RAG model has been
used to retrieve the data and the document is referenced below the response:
What is also interesting about the previous response is that this information was actually generated in Word from a later example that I ran for this blog post. The data being displayed here is actually an interpretation of information previously generated by the model. I find this to be an interesting, because when data like this keeps getting recycled through these models over time, will there start to be degradation of the quality of the information? Like a photocopy of a photocopy. Here’s an interesting article that goes into some more detail on what could be the result of this in the long term: (reference: https://cosmosmagazine.com/technology/ai/training-ai-models-on-machine-generated-data-leads-to-model-collapse/). Always take care to check the information the Copilot outputs before using the information.
4. Microsoft Word
Microsoft Word is usually used to create longer form
documents, as a result, Microsoft has tuned the way the foundation model is
prompted when you ask it questions in Word. In the example of asking it about the weather in Melbourne, the model responded with more of a Wikipedia style
response, where it attempts to go into depth about what the climate is like in
Melbourne throughout the year.
This is a stark difference to the way the ChatGPT foundation
model tried to answer this question. This happens by design, as Microsoft
realises that this is more likely what you want in a Word document rather than
the wanting to know the temperature right now. The way they do this is by taking
the original query and then adding additional (“system prompt”) information to it before sending it to the foundation model. This allows them to change the output to be more like what you might want in a Word document. It’s not
clear exactly what Microsoft is including in the prompt that it sends to the foundation model, as you never get to see this additional information.
If you play around enough with ChatGPT you can see that adding additional text like
“provide an extended response similar to a reference encyclopaedia” will cause
the model to give outputs more like this. I don’t believe it’s documented
anywhere exactly what Microsoft add to the prompts to get these responses as
the prompt engineering is a bit of secret sauce.
5. PowerPoint
The PowerPoint Copilot is an even more interesting topic as it
doesn’t just produce text, it will also add pictures and make design choices
when producing its output. You can see that for our example weather query it
produced a nice picture of Melbourne’s botanical gardens and skyline, creates a
meaningful heading and some dot points about the weather in Melbourne. It looks
pretty impressive as an output to such a basic query:
This is all the more impressive when you have some understanding of what’s going on in the background for the PowerPoint Copilot. There is an interesting paper that I found which is produced by some of the research staff at Microsoft about how this works. It can be found here: https://arxiv.org/abs/2306.03460
TLDR: For apps like PowerPoint the Copilot needs to be able
to tell the application itself how to style the page in addition to just
generating text. This kind of thing can be done with scripting languages which
the foundation model could be used to produce (like Github Copilot), however,
this method is prone to syntax errors. The researchers at Microsoft found that
it was safer to create a specialised domain specific language for describing the
layout of a document (more like a declarative language like is used for
Terraform or PowerShell Desired State Configuration). The language, in
this case, is called Office Domain Specific Language (ODSL) and is designed to
use a minimal number of tokens (words) and be easily describable as an input to
a foundation model. Here’s an example of the language:
1 # Inserts new "Title and Content"
slides after provided ones.
2 slides = insert_slides(precededBy=slides, layout="Title
and Content")
When the prompt is sent to the model it will include schema information
about what the ODSL language and what the format of the desired response. The
model will then respond with a description of what each slide should look like
in the desired ODSL format. The response is thoroughly checked and validated
to have the right format and then translated into a lower-level language by an
interpreter program which then gets executed by
PowerPoint. This is both very cool and crazy that the foundation models are
powerful enough to do these kinds of things.
6. Outlook
When you write an email your colleagues you don’t really want
to be known as the person that writes the dreaded War and Peace novel length emails. Fortunately,
Microsoft are aware of this and when designing the Outlook Copilot, they took
this into account. The output of this Copilot is designed to produce output
that looks like, in both format and content, like an email. You can
see below that the simple weather in Melbourne prompt actually created what
looks and reads like an email. I must admit it did take a bit of artistic
licence and go on a bit more of a ramble than I would have liked in this case though:
7. Excel
The Excel Copilot is once again quite different than the other Copilots. Asking it the weather is not exactly what it’s supposed to be used for, but I asked it anyway, because, why not?:
In excel, the Copilot is more for creating formulas and reasoning over the data that is in your spreadsheets. In the current preview version, the Copilot will only work on data that is in a defined table. This is likely to do with the fact that the data needs to be ordered in such a way to be sent as a prompt to the foundation model. In doing this the data needs to retain all the column and row information but also keeps the token count low enough to be processed. I’m not sure if it’s clear how Microsoft could process an entire very large spreadsheet (with the potential complexity of multiple pages, and scattered data, etc) through the foundation models give their token limits currently. Until they figure this out, we may be stuck with only processing data that is in defined smaller tables for the time being.
If you are wondering what the Excel Copilot can actually do
though, here’s an example of how you could ask the Excel Copilot to reason over
the data in a table and give you an answer:
8. Microsoft Whiteboard
The Microsoft Whiteboard Copilot has another take on what it
produces based on our modest weather question. It produced a bunch of sticky
notes for various things that the weather could be in Melbourne. This is more contextualized
toward a brainstorming type of session which is be common when using a Whiteboard:
This is once again, a fun and different take on how a foundation model can be used to produce a more context aware output for the application at hand.
The Wrap Up
As you can see, all these Copilots across the Microsoft Office apps are all very different beasts, and this is something that people within your organisation should understand in order to get the most out of Copilot product set. This is certainly something to keep in mind when training staff on the potential use cases and determining which Copilot is right for the task at hand. Cheers!