Skip to content

The AI Coding Technology Behind GitHub Copilot: How to Make GPT Understand Your Code Better

AI coding tools are gradually changing our programming habits and experiences, with GitHub Copilot being the best representative of this change. This product uses powerful algorithms to select the most relevant code snippets and comments from multiple sources based on the developer's context and needs, thereby generating coding suggestions. In this article, we'll delve into the technical ideas behind GitHub Copilot, drawing on some copilot prompts obtained through reverse engineering, with the hope of providing some assistance in building AI-assisted coding tools. This content is partly translated and organized from Microsoft's blog, and we will also discuss other people's research findings and views.

GitHub Copilot selects relevant code snippets or comments through algorithms and uses context understanding to generate coding suggestions. GitHub Copilot has developed a sophisticated prompt engineering strategy, prioritizing information about developer context in the prompt library, using a vector database to create a customized coding experience for developers working in private repositories or handling proprietary code.

To make developers using GitHub Copilot feel like they are collaborating with another programmer, GitHub's machine learning experts have been researching, developing, and testing new features, many of which focus on enhancing the AI programmer's context understanding. This is because good communication is crucial for collaborative programming, and inferring context is essential for achieving good communication.

To uncover the work behind the scenes, the original author asked GitHub researchers and engineers about their efforts to improve Copilot's context understanding. Here is what they discovered.

From OpenAI's Codex Model to GitHub Copilot

When OpenAI released GPT-3 in June 2020, GitHub knew that developers would benefit from a product that utilized this model specifically for coding. Thus, they provided input to OpenAI to help build Codex, a descendant of GPT-3 and LLM that would drive GitHub Copilot. This programmer pairing tool was launched as a technical preview in June 2021 and became the world's first large-scale generative AI coding tool in June 2022.

To ensure the model had the best information to make optimal predictions quickly, GitHub's machine learning (ML) researchers conducted a lot of work known as prompt engineering (explained in detail below) so that the model would provide contextually relevant responses with low latency.

Although GitHub is always experimenting with new models, Codex was the first truly robust generative AI model available, and GitHub's machine learning engineer David Slater stated: "The practical experience we gain from iterating on model and prompt improvements is invaluable."

All these experiments ultimately led to a pair programming tool that "freed up developers' time so they could focus on more meaningful work." The tool is even a huge help when starting new projects or files from scratch, as it provides a starting point that developers can adjust and improve as needed, according to GitHub's machine learning researcher Alice Li.

Why Context Matters

Developers use details like pull requests, project folders, and open issues to determine the context of their code. When it comes to generating AI coding tools, Copilot needs to teach these tools what information to use to do the same thing.

Transformer LLMs excel at connections and big-picture thinking. These large language models (LLMs) power generative AI coding tools, which are trained on massive amounts of code and human language. Today's state-of-the-art LLMs are transformers, allowing them to establish connections between the user-inputted text and the model-generated output. This is why today's generative AI tools provide more contextually relevant responses than earlier AI models.

However, AI needs to be told which information is relevant to your code. Currently, transformers that are fast enough to support GitHub Copilot can only handle about 6,000 characters at a time. While this is sufficient to advance and accelerate tasks like code completion and code change summarization, the limited character count means not all of a developer's code can be used as context.

Thus, Copilot's challenge is figuring out not only which data to provide to the model but also how to best sort and input it for optimal suggestions.

How GitHub Copilot Understands Your Code

It all comes down to prompts, which are compilations of integrated IDE code and contextual information for the model to use. These prompts are generated by background algorithms and can produce coding suggestions at any moment while you're coding. This is why GitHub Copilot generates coding suggestions, whether you're writing or have just finished a comment or are handling complex code.

  • Here’s how prompt creation occurs: A set of algorithms first selects relevant code snippets or comments from the current file and other sources. Then, these snippets and comments are prioritized, filtered, and assembled to form the final prompt.

GitHub Copilot's context understanding is constantly maturing. The first version could only consider the file you were working on in the IDE as relevant context. But the Copilot team knew context went beyond that. Now, just a year later, they are experimenting with algorithms to consider your entire codebase to generate customized suggestions.

Here's how they got here:

  • Prompt engineering is the delicate art of creating prompts so that the model offers the most useful predictions to users. Prompts tell LLMs, including GitHub Copilot, what data to process, and in what order, to contextualize your code. Much of this work takes place in the so-called prompt library, where experts collaborate with algorithms to extract and prioritize various pieces of information about the developer's context, creating prompts to be processed by the GitHub Copilot model.

  • Adjacent tabs is a technique that allows GitHub Copilot to process all open files in a developer's IDE, not just the single file being worked on. By opening all files related to their project, developers automatically invoke GitHub Copilot to scan all the data, finding matches between the developer's code around the cursor and adding these matches to the prompt.

In developing adjacent tabs, GitHub Next team and internal ML researchers conducted A/B testing to determine the best parameters for matching code in the IDE with code in open tabs. They found setting a very low threshold for including matches actually provided the best coding suggestions.

By including every little context, adjacent tabs helped relatively increase the number of users accepting GitHub Copilot's suggestions by 5%.

  • The Fill-In-the-Middle (FIM) paradigm broadened the context aperture further. Before FIM, only the code before the cursor was put into the prompt, while the code after the cursor was ignored. (On GitHub, code before the cursor is called the prefix, and code after the cursor is the suffix). With FIM, we can tell the model which part of the prompt is the prefix and which is the suffix.

Even if you're creating a file from scratch and only have a skeleton, Copilot understands that coding isn't linear or sequential. Thus, as you jump around the file, FIM can help GitHub Copilot provide better coding suggestions for the part where your cursor is or for what should appear between the prefix and suffix.

Based on A/B testing, FIM improved relative performance by 10%, meaning developers accepted more than 10% of the suggestions they saw. With optimal cache usage, adjacent tabs and FIM run in the background without adding any latency.

Image

Perhaps it is easier to understand with practical code examples. For example, in the reverse-engineered GitHub Copilot plugin code, the following content was included when constructing prompts:

exports.Priorities  = 
  exports.PromptWishlist  = 
  exports.PromptElementRanges  = 
  exports.PromptChoices  = 
  exports.PromptBackground  = 
  exports.PromptElementKind  = 
    undefined;
const M_prompt_parsing_utils_maybe = require("prompt-parsing-utils");
const M_tokenizer_maybe = require("tokenizer");
var i;
!(function (e) {
  e.BeforeCursor = "BeforeCursor";
  e.AfterCursor = "AfterCursor";
  e.SimilarFile = "SimilarFile";
  e.ImportedFile = "ImportedFile";
  e.LanguageMarker = "LanguageMarker";
  e.PathMarker = "PathMarker";
})((i = exports.PromptElementKind || (exports.PromptElementKind = {})));

// The main function of this code is to define an enumeration type to represent various kinds of prompt elements used internally by the GitHub Copilot system, and it imports some modules that may be related to processing these prompt elements. It defines some variables named `PromptElementKind`, which is an enumeration object used to describe different types of prompt elements. These prompt elements could be different types of information needed during coding, such as code before or after the cursor, similar files, imported files, language markers, and path markers. These prompts might include related code snippets or comments selected from the current file and other sources, then prioritize, filter, and assemble these snippets and comments to form the final prompt. These prompts instruct the model on which data to process and in what order to contextualize the code.

A practical example of a Copilot prompt might look like this:

{ "prefix": "# Path: codeviz\\app.py\n# Compare this snippet from codeviz\\predictions.py:\n# import json\n# import sys\n# import time\n# from manifest import Manifest\n#\n# sys.path.append(file + \"/..\")\n# from common import module_codes, module_deps, module_categories, data_dir, cur_dir\n#\n# gold_annots = json.loads(open(data_dir / \"gold_annotations.js\").read()", "suffix": "if name == 'main':\r\n app.run(debug=True)", "isFimEnabled": true, "promptElementRanges": [ { "kind": "PathMarker", "start": 0, "end": 23 }, { "kind": "SimilarFile", "start": 23, "end": 2219 }, { "kind": "BeforeCursor", "start": 2219, "end": 3142 } ] }


As you can see, this prompt includes a prefix and a suffix. Copilot then sends this prompt (after some formatting) to the model. In this case, Copilot is invoking Codex in "insert mode" (also known as fill-in-the-middle or FIM mode) because the suffix is not empty.

Enhancing Semantic Understanding
--------------------------------

Today, Copilot is attempting to use **vector databases to create personalized coding experiences for developers working in private repositories or with proprietary code**. Generative AI coding tools use something called embeddings to retrieve information from vector databases.

*   **What is a vector database?** It is a database that indexes high-dimensional vectors.

*   **What are high-dimensional vectors?** They are mathematical representations of objects. Because these vectors can simulate objects across multiple dimensions, they can capture the complexity of the object. When properly used to represent code snippets, they can represent the semantic intent of the code, not just its syntax.

*   **What is an embedding?** In the context of coding and LLMs, an embedding is how code snippets are represented as high-dimensional vectors. Because LLMs "know" about programming and human language, they can capture both the syntax and semantics of code in these vectors.

**Here's how they work together:**

*   Algorithms will create embeddings for all snippets in a repository (potentially billions) and store them in a vector database.

*   Then, when you code, algorithms will embed snippets within the IDE.

*   Algorithms will then perform approximate matches between the IDE snippet embeddings and the embeddings stored in the vector database, also in real-time. The vector database allows the algorithm to quickly search for approximate matches on vectors (not just exact matches), even if it stores billions of embeddings.

Developers are familiar with using hash codes to retrieve data, which usually look for exact character matches, explained GitHub's senior ML researcher Alireza Goudarzi. "But embeddings – because they are derived from data-trained LLMs – create semantic proximity between code snippets and natural language prompts."

Read the following three sentences and determine which two are semantically most similar.

*   **Sentence A**: The king moved and captured the pawn.

*   **Sentence B**: The king was crowned at Westminster Abbey.

*   **Sentence C**: Two white rooks are still in the game.

The answer is sentences A and C because both are about chess. While sentences A and B are syntactically or structurally similar due to "the king" being the subject, they are semantically different as "the king" is used in different contexts.

Here's how each statement would be translated into Python. Notice that despite their semantic differences, fragments A and B have syntactic similarities, while fragments A and C have semantic similarities.

### Fragment A:

if king.location() == pawn.location(): board.captures_piece(king, pawn)


### Fragment B:

if king.location() == "Westminster Abbey": king.crown()


### Fragment C:

if len([ r for r in board.pieces("white") if r.type == "rook" ]) == 2: return True


As mentioned, Copilot is still working on retrieval algorithms and is designing the feature for enterprise customers, particularly those looking for a personalized coding experience in private repositories who have explicitly opted into using this feature.

We can further discuss this issue by combining reverse engineering of Copilot's code. To handle large-scale natural language processing tasks, Copilot utilizes a combination of Cushman and ONNX models on the client side. Specifically, Copilot transforms the output of the Cushman model into vector representations and then uses vector similarity calculations to match the most relevant local files.

Besides on-the-spot vectorization and similarity matching, Copilot also uses local similarity calculations and token handling to manage tokens, better handling large-scale natural language processing tasks. For example, the following possible code snippet appeared in Copilot reverse engineering:

e.prototype.useAutoCorrelation = function (e, t) { if (e && !this._isAutoCorrelating) { M_correlation_context_manager.CorrelationContextManager.enable(t); } else { if (!e && this._isAutoCorrelating) { M_correlation_context_manager.CorrelationContextManager.disable(); } } this._isAutoCorrelating = e; }; ```

Summary and Review

Last year, the Copilot team conducted a quantitative study on GitHub Copilot, finding that developers using this software can code at speeds up to 55% faster. This means developers feel more efficient, can complete repetitive tasks more quickly, and can focus more on satisfying work. But our work doesn't stop there.

GitHub’s product and R&D teams, including GitHub Next, have been collaborating with the Microsoft Azure AI platform to continue improving GitHub Copilot's context understanding. Much of the work that helps GitHub Copilot understand your code occurs behind the scenes. As you write and edit your code, GitHub Copilot responds in real-time by generating prompts (or, based on your actions in the IDE, prioritizing and sending relevant information to the model) to provide the best coding suggestions continually.

Learn More

  • GitHub Copilot X envisions the future of AI-powered software development. Discover new content.
  • Learn how the LLMs powering GitHub Copilot are becoming more spectacular.
  • Read the corresponding research to learn how GitHub Copilot influences developer productivity.

Resources Related to Copilot Reverse Engineering:

  • https://github.com/thakkarparth007/copilot-explorer
  • https://github.com/saschaschramm/github-copilot

For more interesting AI experiments and insights, please visit my AI experiment and throughts website https://yunwei37.github.io/My-AI-experiment/ and github repo: https://github.com/yunwei37/My-AI-experiment

Share on Share on