GSOC 2026 Project Ideas

See our page on applying for GSoC 2026: https://github.com/aboutcode-org/aboutcode/wiki/GSOC-2026

There are between 100 and 200 million open source projects and repos out there. Not all of them are equal. Some are much more useful than others, and some could be safely ignored. For instance, the linux kernel is more important, used and popular than a 1st year computer student school assignment project. The goal of this project is to determine when a project is popular and what are the most popular projects. If we do not know what code is used, we can spend a lot of resources to index less used code.

There are some simple approaches to this, using available statistics for downloads or Github stars, but that is not satisfying alone.

An idea would be to consider multiple factors to rank popularity and usage.

For instance: create a (current and updated) graph of dependencies and compute something like a pagerank but for packages
Then create with a metric on the freshness of the code like when last release and how much downloaded or based on git activity (excluding bots). This would grow for used code and decay for declining packages
Then combine this with the dependencies "connectedness"

Or, just a use the graph connections and no download stats, just a giant graph on top of purldb

Or something like this:

Finding strongly connected components
Relate packages ignoring versions
Find most connected
Discount distant connections, boost closest
Apply decay based on version freshness or git activity

The approach would be to start small with a single ecosystem as PoC and then extend this to all packages types.

Ideally, this should be exposed in PurlDB API and integrated in data collection operations.

Priority: High

Size: Large

Difficulty Level: Advanced

Tags:

Python
Django
PostgreSQL
Popularity

Mentors:

@pombredanne
@JonoYang
@AyanSinhaMahapatra

VulnerableCode project ideas

There are two main categories of projects for VulnerableCode:

A. COLLECTION: this category is to mine and collect or infer more new and improved data. This includes collecting new data sources, inferring and improving existing data or collecting new primary data (such as finding a fix commit of a vulnerability)
B. USAGE: this category is about using and consuming the vulnerability database and includes the API proper, the GUI, the integrations, and data sharing, feedback and curation.

VulnerableCode: Process unstructured data sources for vulnerabilities (Category A)

Code Repositories:

https://github.com/aboutcode-org/vulnerablecode

Description:

The project would be to provide a way to effectively mine unstructured data sources for possible unreported vulnerabilities.

For a start this should be focused on a few prominent repos. This project could also find Fix Commits.

Some sources are:

mailing lists
changelogs
reflogs of commit
bug and issue trackers

This requires systems to "understand" vulnerability descriptions: as often security advisories do not provide structured information on which package and package versions are vulnerable. The end goal is creating a system which would infer vulnerable package name and version(s) by parsing the vulnerability description using specialized techniques and heuristics.

There is no need to train a model from scratch, we can use AI models pre-trained on code repositories (maybe https://github.com/bigcode-project/starcoder?) and then fine-tune on some prepared datasets of CVEs in code.

We can either use NLP/machine Learning and automate it all, potentially training data masking algorithms to find these specific data (this also involved creating a dataset) but that's going to be super difficult.

We could also start to craft a curation queue and parse as much as we can to make it easy to curate by humans and progressively also improve some mini NLP models and classification to help further automate the work.

References: https://github.com/aboutcode-org/vulnerablecode/issues/251

Priority: Medium

Size: Large

Difficulty Level: Advanced

Tags:

Python
Django
PostgreSQL
Security
Vulnerability
NLP
AI/ML

Mentors:

@pombredanne
@tg1999
@keshav-space
@Hritik14
@AyanSinhaMahapatra

Related Issues:

https://github.com/aboutcode-org/vulnerablecode/issues/251

ScanCode.io project ideas

ScanCode.io: Create file-system tree view for project scans

Code Repositories:

https://github.com/aboutcode-org/scancode.io

Description:

When large packages/containers are scanned in scancode.io it is useful to have a tree-view to explore thorugh the file-tree for that package/container to look into scan data for a particular subset of the file-tree/directory or to research more into detections and detection issues.

This would be something similar to what we have at scancode-workbench for example: https://scancode-workbench.readthedocs.io/en/develop/ui-reference/directory-tree.html

I.e. we need the following features:

To be able to toggle showing the directory contents from the directory icon
Show nested directory contents in a tree like structure
Have this view ideally in a pane left to the table-view of resources
Show only info from the selected directory in the table-view of resources

Note that we do have a ProjectCodebaseView in the projects page currently in scancode.io but this is fairly limited as it only lets you browse through the codebase one directory at a time (only shows the files/directories in one directory), and lets you navigate to directories in the current directory or the parent directory from there.

Priority: High

Size: Large

Difficulty Level: Intermediate

Tags:

Python
Django
UI/UX
File-system
Navigation

Mentors:

@tdruez
@pombredanne
@AyanSinhaMahapatra

Related Issues:

https://github.com/aboutcode-org/scancode.io/issues/697

ScanCode Toolkit project ideas

Mark required phrases for rules automatically using NLP/AI:

Code Repositories:

https://github.com/aboutcode-org/scancode-toolkit

Description:

Required phrases are present in rules to make sure the rule is not matched to text in a case where the required phrase is not present in the text, which would be a false-positive detection.

We are marking required phrases automatically based on what is present in other rules and license attributes, but this still leaves a lot of rules without them. See https://github.com/aboutcode-org/scancode-toolkit/pull/3924 where we are also adding a script to add required phrases as individual rules if applicable and also adding required phrases added to other rules.

research and choose a model pre-trained on code (StarCoder?)
use the dataset of current SCTK rules to train a model
Mark required phrases in licenses automatically with the model
Test required phrase additions, improve and iterate
Bonus: Create a minimal UI to review rule updates massively

Priority: Medium

Size: Medium

Difficulty Level: Advanced

Tags:

Python
ML/AI
Licenses

Mentors:

@AyanSinhaMahapatra
@tg1999
@pombredanne

Related Issues:

https://github.com/aboutcode-org/scancode-toolkit/issues/2878

Our Project ideas

Here are some project related attributes you need to keep in mind while looking into prospective project ideas, see also: guidance on finding the right project:

Project Priority

The repositories/projects are not sorted in order of importance, instead we have a explicit priority mentioned for each project idea and this can be: Low, Medium or High.
This doesn't mean we will always consider a project proposal with a higher priority idea over a relatively lower priority one, no matter the merit of the proposal. This is only one metric of selection, mostly to prioritize important projects.
You can also suggest your own project ideas/discuss changes/updates/enhancements based on the provided ideas, but you need to really know what you are doing here and have lots of discussions with the maintainers.

Project Length

There are three project lengths:

Small (~90 hours)
Medium (~175 hours)
Large (~350 hours)

If you are proposing an idea from this ideas list, it should match what is listed here, and additionally please have a discussion with the mentors about your proposed length and timeline. Please also open a discussion about the same, if not already present, at https://github.com/aboutcode-org/aboutcode/discussions/categories/gsoc or discuss this in the respective issues.

We have marked our ideas with medium/large based on general estimates, but this could vary. In a few cases they are both used to mark a project as it can be both. We have made conscious effort to make sure projects are not too large, have clear deliverables and can be finished successfully, but still note that these are complex projects and you're likely underestimating the complexity (and how much we'll bug you to make sure everything is up to our standards).

You must discuss your proposal and the size of project you are proposing with a mentor as otherwise we cannot consider your proposal fairly.

We likely would only select medium/large project ideas only as the small projects are too small to get familiar with and contribute meaningfully to any of our projects.

Please also note that there is a difference in the stipend based on what you select, and it would not be fair if you're selecting and working on a large project, but getting paid for a medium one (or vice-versa).

Project Tags

Here are all the tags we use for specific projects, feel free to search this page using these if you only want to look into projects with specific technical background.

[Django], [PostgreSQL], [Web], [DataStructures], [Scanning], [Javascript], [UI], [LiveServer] [API], [Metadata], [PackageManagers], [SBOM], [Security], [BinaryAnalysis], [Scraping], [NLP], [Social], [Communication], [Review], [Decentralized/Distributed], [Curation]

Project Difficulty Level

We are generally using three levels of difficulty to characterize the projects:

Easy
Intermediate
Advanced

If it is a difficult project it means there is significant domain knowledge required to be able to tackle this project successfully, and you must have prior verifiable experience on this (in the form of open source contributions, either on the same topic in our repos, or elsewhere). You must also consult with mentors/maintainers early, ask a lot of domain specific questions and must be ready to research and tackle greenfield projects in certain cases if you choose a project in this difficulty category.

Most other intermediate projects do not require this much domain knowledge and can easily be acquired during proposal writing/contributing, if you're familiar with the tech stack used in the project. But these are still not straight-forward and requires lots of feedback from the mentors. Most projects fall in this category.

There are also easy projects which only require honest time and effort from the participant, and decent knowledge about the tech stack/problem.

Questions?

Please feel free to chime in at https://github.com/aboutcode-org/aboutcode/discussions/133 or in our GSoC 2026 chatroom at https://matrix.to/#/#aboutcode-org_gsoc2026:gitter.im if you have any questions related to AboutCode's participation in GSoC or anything in this page.

http://aboutcode.org/

Uh oh!

GSOC 2026 Project Ideas

Table of Contents

Project Ideas Index

Abbreviations and acronyms used here

PURLdb project ideas

PURLdb - PopularCode - Find and track actually used and most popular open source code

VulnerableCode project ideas

VulnerableCode: Process unstructured data sources for vulnerabilities (Category A)

ScanCode.io project ideas

ScanCode.io: Create file-system tree view for project scans

ScanCode Toolkit project ideas

Mark required phrases for rules automatically using NLP/AI:

Our Project ideas

Project Priority

Project Length

Project Tags

Project Difficulty Level

Questions?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!