Building a map of the GeminiNet
Table of Contents
Introduction
The small web is nice. It’s really nice. It reminds me of that era when corporations were too focused on traditional media such as television or radio and left us alone on the internet where we roamed forums, chatted on IRC or made dirty webpages with HTML and CSS we learnt through weird txt files on the internet.
Gemini is also very nice. Webpages feel fresh and discussion forums have high conversation quality. You can discuss stuff without trolls (mostly), without well organized public opinion management groups. Without ads because corporations have little to no incentive to spend money in a protocol where you can’t put flashy videos and client-side scripts to track consumption habits (well, they’re not part of the protocol specification anyways).
However, I feel that unlike other decentralized services such as Mastodon or Lemmy which you can browse through your usual browser, Gemini has not received a lot of love. Despite a lot of effort by the community and capsules (the Gemini version of webpages) that exist to track the development, it’s not super intuitive and I wanted to see how it was shaped. That’s why I attempted to build a map of Gemini capsules and how they’re interconnected with each other.
Methodology
The crawling
First of all I want to apologize to capsule owners. If someone made a lot of requests to your address, it was possibly me. Hope I didn’t bring down your server or anything.
I think it’s important to clarify that this is a partial crawl. It’s not supposed to be a complete image. I started crawling on gemini://geminiprotocol.net and went through the known capsule lists which are hosted by communities, but I’m sure there are a lot that banned me and couldn’t continue, and others that simply are not listed there.
Something very very nice about Gemini is that the specification is really small. So small, in fact, that anyone with a bit of time can make a rudimentary client and connect to all the available capsules. This is something you absolutely cannot do with the modern net in a semipractical way since it’s packed with edge cases, security concerns, headers and modes that you have to handle gracefully.
The crawler I made to crawl (…) the Geminisphere was basically a botched down version of the one I made for myself. It’s a simple rust client, supports most of the specification (I would like to think) and so it generally works.
In fact, I made 3 versions of that botched down client. * First version was for crawling the capsules and writing the information into a SQLite database. * Second version was for normalizing the database. * Third version was for fetching the root of each hostname.
I probably didn’t need the third. Oh well.
There were some edge cases with crawling I didn’t expect. The Gemininet is estimated to have around 1.042.214 URLs among 4.723 (!) capsules according to gemini.bortzmeyer.org/software/lupa/stats.gmi. It’s really small! However, the problem is precisely that it’s small. A lot of those servers are hosted in a raspi or an old laptop and you can’t make a lot of requests to them.
Which is hard to do when a lot of websites have loops and CGI URLs. They have text-based applications and games for which I had to revise the database from time to time to prune URLs and hostnames. Another pain in the ass were git repositories since they tend to be quite big. At one point the database reached 45GB with 80% of that being trash URLs.
And trash URLs there are a lot. The Gemtext specification defines that:
All the following examples are valid link lines:
=> gemini://example.org/
=> gemini://example.org/ An example link
=> gemini://example.org/foo Another example link at the same host
=> foo/bar/baz.txt A relative link
=> gopher://example.org:70/1 A gopher link
…which means that a lot of websites do things like => wikipedia.org expecting you to handle it gracefully. I mean, that’s okay. They’re not that much. Still annoying though.
There are also search engines and archiving websites that can really fatten up your crawling if you’re not careful.
Finally, you have the dots. Ahhh, the dots. gemini://foo.com/bar/../bar/../bar/./bar/../bar. Here’s where the normalization client comes in.
Summarizing the capsules
Once I had a neat small database of around 10GB pre-pruning some trash parts, I configured a local llamacpp instance with Qwen3.5 8B for summarizing the content of those capsules and capsule subdomains. The parts I was most interested in were the description and categories, but I also wanted to know its language and more detailed categories, so I set this format in the end:
{
"hostname": ...,
"language": ..., # two letter code
"category": hosting|personal|bbs|forum|others # didn't want to complicate myself too much
"subcategories": [ ... ],
"description": ...
}I think you can know a lot of useful information about a capsule from that. Then I made a database search for subdomains and expanded the database(JSON, not the SQLite one) such that:
"hostname": ...
"links": [
{
# the websites that hostname points to
}
...
],
"size": ... # the internal number of URLs
"error": ... # error message or null if it was ok
"redirect_to": ... # redirect address or null if it was ok
"language": ...
"category": ...
"subcategories": ...
"description": ...
"subdomains": [
# same as its parent except this one doesn't have further subdomains
# for simplicity
]For simplicity I considered subdomains to be anything that has a dot before the TLD.
Curating the information
All in all the descriptions seemed to be reasonable. Although I’m sure there is some weird thing around. A lot of the tags were very similar (say, blog/blogs, or site/personal webpage), so I spent some time curating them. Categories don’t make much sense if there’s going to be only 1 member in them after all.
If you see your capsule and want to change something about it, please make a pull request and I’ll accept it with no issues (if you want to make an issue on the repo, that’s ok too).
Presenting the information in a visual way
Which was half the point of doing all this to begin with.
I picked a force graph for the representation since it looked like a reasonable way of understanding the size and outgoing links of capsules, as well as having different levels of detail. I suck at front-end so I’m sorry if it doesn’t work on phone.
You can see the map here.
Some statistics
Basic stats
The database post-pruning ended up being 6.7GB. You can download it here. The JSON ended up being 34MB. I’m sure it can be compressed in some way. I’m sorry if you’re low on data.
Some of the websites with the most URLs
| Hostname | Size | Short description |
|---|---|---|
| yesterweb.org | 817645 | Legacy capsule hosting |
| taz.de | 396572 | German newspaper |
| uploadedlobster.com | 177068 | Rob’s personal blog. Has lots of recordings. |
| jsreed5.org | 142163 | Mostly to a replication of OEIS |
| techrights.org | 140173 | Open source advocacy group |
| hellomouse.net | 128470 | Capsule hosting |
| gemi.dev | 106069 | Mailing lists, search engine and archive |
| apple2.org.za | 102432 | Gopher hole for mirrors.apple2.org.za |
| snork.ca | 65101 | Blog with a mirror of textfiles.com |
| mediocregopher.com | 62164 | Personal blog |
| tuxmachines.org | 62132 | Open source advocacy |
| geminispace.org | 49706 | Probably one of the biggest BBS on Gemini |
| cthulhu28.space | 40986 | Directory and personal blog |
| freeshell.de | 36016 | Personal blog |
| spam.works | 22448 | Tons of mirrors of text files |
| pollux.casa | 20585 | Hosting for Gemini capsules |
Resource types
I ignored common ones such as .gmi. It makes sense that .txt is common as well since it’s frequently used as a replacement for .gmi. Other than that, pictures and zip files seem to have the highest number of resources.
| Extension | Count | Percent |
|---|---|---|
| txt | 264230 | 42.33% |
| jpg | 114116 | 18.28% |
| jpeg | 84290 | 13.50% |
| shtml | 59649 | 9.56% |
| png | 57768 | 9.25% |
| zip | 14936 | 2.39% |
| 12582 | 2.02% | |
| gif | 2729 | 0.44% |
| html | 2262 | 0.36% |
| mp3 | 2192 | 0.35% |
| tar.gz | 1760 | 0.28% |
| ogg | 1451 | 0.23% |
| wav | 1234 | 0.20% |
| xml | 1083 | 0.17% |
| webp | 1037 | 0.17% |
| tar | 829 | 0.13% |
| svg | 738 | 0.12% |
| mp4 | 364 | 0.06% |
| gz | 328 | 0.05% |
| json | 328 | 0.05% |
| csv | 80 | 0.01% |
| webm | 60 | 0.01% |
| mov | 42 | 0.01% |
| appimage | 38 | 0.01% |
| rar | 25 | 0.00% |
| ico | 21 | 0.00% |
| bmp | 21 | 0.00% |
| avi | 5 | 0.00% |
Languages
It’s clear that the Geminisphere is predominantly English. Other than that, Western languages seem to be the most common.
| Language | Capsule Number | Percentage over total |
|---|---|---|
| en | 3694 | 63.99% |
| unknown | 1734 | 30.04% (errors, redirections, etc.) |
| es | 96 | 1.66% |
| fr | 71 | 1.23% |
| de | 43 | 0.74% |
| ru | 30 | 0.51% |
| pt | 15 | 0.25% |
| it | 12 | 0.20% |
| pl | 12 | 0.20% |
| ja | 10 | 0.17% |
| cs | 4 | 0.06% |
| fi | 4 | 0.06% |
| hu | 4 | 0.06% |
| zh | 4 | 0.06% |
| ca | 3 | 0.05% |
| el | 3 | 0.05% |
| eo | 3 | 0.05% |
| he | 3 | 0.05% |
| sv | 3 | 0.05% |
| eu | 2 | 0.03% |
| ga | 2 | 0.03% |
| ko | 2 | 0.03% |
| nl | 2 | 0.03% |
| no | 2 | 0.03% |
| sl | 2 | 0.03% |
| tr | 2 | 0.03% |
| uk | 2 | 0.03% |
| bs | 1 | 0.01% |
| fa | 1 | 0.01% |
| gl | 1 | 0.01% |
| hy | 1 | 0.01% |
| la | 1 | 0.01% |
| lt | 1 | 0.01% |
| sk | 1 | 0.01% |
| sr | 1 | 0.01% |
For comparison, these are the language statistics from gemini.bortzmeyer.org/software/lupa/stats.gmi as of April 2026:
| Language | URL Number | Percentage over total |
|---|---|---|
| en | 514827 | 62.24% |
| unknown | 305044 | 35.04% |
| de | 10289 | 1.18% |
| es | 2832 | 0.32% |
| fr | 2581 | 0.29% |
| ru | 538 | 0.06% |
| sv | 115 | 0.01% |
| sgs | 98 | 0.01% |
Note that my statistics are based on the hostname root instead of URLs, while Bortzmeyer’s are based on header tags. However, they are similar.
External protocols
| Scheme/Protocol | Count |
|---|---|
| gopher | 176031 |
| finger | 83254 |
| telnet | 19234 |
| gwit | 18389 |
| nex | 17430 |
| spartan | 13832 |
| ftp | 4000 |
| git | 3278 |
| scroll | 3239 |
| irc | 2986 |
| news | 2746 |
| guppy | 1858 |
| xmpp | 1644 |
| moz-extension | 1445 |
| ninep | 1342 |
| chrome-extension | 952 |
| misfin | 782 |
| snews | 767 |
| ssh | 667 |
| scorpion | 597 |
| text | 589 |
| mailto | 492 |
| file | 331 |
| nntp | 307 |
| ttps | 256 |
| ipfs | 163 |
| titan | 85 |
| gophers | 72 |
| ircs | 63 |
| ttp | 21 |
| applewebdata | 21 |
| sftp | 20 |
I omitted those under 20 since they were not very common. It seems that Gopher still is the most popular alternative to Gemini. This makes sense since one of the main requirements of Gemini was to provide a more modern alternative to Gopher. Finger and Telnet are also there.
Some people left file:// with local paths, which indicates that Gemini users are also humans.
Lots of mail addresses too. I don’t know what’s with the Mozilla and Chrome extensions.
Wrap-up
Well, that was fun.
I’m bad with wrap-ups. Hopefully this helped increase the visibility of Gemini. Having alternative networks such as it is very important in a time where countries are even planning to set barriers to even start using computers or installing your own operating system.
If you haven’t tried Gemini, I suggest having a look at the specification and planning for a weekend project. I might never build a web browser, but I would like to think this helped me understand a bit more of how all this works.
Of course, I know not everyone is able or willing. For those people, I recommend this short introduction to Gemini which explains serves as an introduction in a much better way than I could.