Building a map of the GeminiNet

The map in question (see here!)

Table of Contents

Introduction

The small web is nice. It’s really nice. It reminds me of that era when corporations were too focused on traditional media such as television or radio and left us alone on the internet where we roamed forums, chatted on IRC or made dirty webpages with HTML and CSS we learnt through weird txt files on the internet.

Gemini is also very nice. Webpages feel fresh and discussion forums have high conversation quality. You can discuss stuff without trolls (mostly), without well organized public opinion management groups. Without ads because corporations have little to no incentive to spend money in a protocol where you can’t put flashy videos and client-side scripts to track consumption habits (well, they’re not part of the protocol specification anyways).

However, I feel that unlike other decentralized services such as Mastodon or Lemmy which you can browse through your usual browser, Gemini has not received a lot of love. Despite a lot of effort by the community and capsules (the Gemini version of webpages) that exist to track the development, it’s not super intuitive and I wanted to see how it was shaped. That’s why I attempted to build a map of Gemini capsules and how they’re interconnected with each other.

Methodology

The crawling

First of all I want to apologize to capsule owners. If someone made a lot of requests to your address, it was possibly me. Hope I didn’t bring down your server or anything.

I think it’s important to clarify that this is a partial crawl. It’s not supposed to be a complete image. I started crawling on gemini://geminiprotocol.net and went through the known capsule lists which are hosted by communities, but I’m sure there are a lot that banned me and couldn’t continue, and others that simply are not listed there.

Something very very nice about Gemini is that the specification is really small. So small, in fact, that anyone with a bit of time can make a rudimentary client and connect to all the available capsules. This is something you absolutely cannot do with the modern net in a semipractical way since it’s packed with edge cases, security concerns, headers and modes that you have to handle gracefully.

The crawler I made to crawl (…) the Geminisphere was basically a botched down version of the one I made for myself. It’s a simple rust client, supports most of the specification (I would like to think) and so it generally works.

In fact, I made 3 versions of that botched down client. * First version was for crawling the capsules and writing the information into a SQLite database. * Second version was for normalizing the database. * Third version was for fetching the root of each hostname.

I probably didn’t need the third. Oh well.

There were some edge cases with crawling I didn’t expect. The Gemininet is estimated to have around 1.042.214 URLs among 4.723 (!) capsules according to gemini.bortzmeyer.org/software/lupa/stats.gmi. It’s really small! However, the problem is precisely that it’s small. A lot of those servers are hosted in a raspi or an old laptop and you can’t make a lot of requests to them.

Which is hard to do when a lot of websites have loops and CGI URLs. They have text-based applications and games for which I had to revise the database from time to time to prune URLs and hostnames. Another pain in the ass were git repositories since they tend to be quite big. At one point the database reached 45GB with 80% of that being trash URLs.

And trash URLs there are a lot. The Gemtext specification defines that:

All the following examples are valid link lines:

=> gemini://example.org/
=> gemini://example.org/ An example link
=> gemini://example.org/foo Another example link at the same host
=> foo/bar/baz.txt  A relative link
=> gopher://example.org:70/1 A gopher link

…which means that a lot of websites do things like => wikipedia.org expecting you to handle it gracefully. I mean, that’s okay. They’re not that much. Still annoying though.

There are also search engines and archiving websites that can really fatten up your crawling if you’re not careful.

Finally, you have the dots. Ahhh, the dots. gemini://foo.com/bar/../bar/../bar/./bar/../bar. Here’s where the normalization client comes in.

Summarizing the capsules

Once I had a neat small database of around 10GB pre-pruning some trash parts, I configured a local llamacpp instance with Qwen3.5 8B for summarizing the content of those capsules and capsule subdomains. The parts I was most interested in were the description and categories, but I also wanted to know its language and more detailed categories, so I set this format in the end:

{
    "hostname": ...,
    "language": ..., # two letter code
    "category": hosting|personal|bbs|forum|others # didn't want to complicate myself too much
    "subcategories": [ ... ],
    "description": ...
}

I think you can know a lot of useful information about a capsule from that. Then I made a database search for subdomains and expanded the database(JSON, not the SQLite one) such that:

    "hostname": ...
    "links": [
        {
            # the websites that hostname points to
        }
        ...
    ],
    "size": ... # the internal number of URLs
    "error": ... # error message or null if it was ok
    "redirect_to": ... # redirect address or null if it was ok
    "language": ...
    "category": ...
    "subcategories": ...
    "description": ...
    "subdomains": [
        # same as its parent except this one doesn't have further subdomains
        # for simplicity
    ]

For simplicity I considered subdomains to be anything that has a dot before the TLD.

Curating the information

All in all the descriptions seemed to be reasonable. Although I’m sure there is some weird thing around. A lot of the tags were very similar (say, blog/blogs, or site/personal webpage), so I spent some time curating them. Categories don’t make much sense if there’s going to be only 1 member in them after all.

If you see your capsule and want to change something about it, please make a pull request and I’ll accept it with no issues (if you want to make an issue on the repo, that’s ok too).

Presenting the information in a visual way

Which was half the point of doing all this to begin with.

I picked a force graph for the representation since it looked like a reasonable way of understanding the size and outgoing links of capsules, as well as having different levels of detail. I suck at front-end so I’m sorry if it doesn’t work on phone.

You can see the map here.

Some statistics

Basic stats

The database post-pruning ended up being 6.7GB. You can download it here. The JSON ended up being 34MB. I’m sure it can be compressed in some way. I’m sorry if you’re low on data.

Some of the websites with the most URLs

Hostname Size Short description
yesterweb.org 817645 Legacy capsule hosting
taz.de 396572 German newspaper
uploadedlobster.com 177068 Rob’s personal blog. Has lots of recordings.
jsreed5.org 142163 Mostly to a replication of OEIS
techrights.org 140173 Open source advocacy group
hellomouse.net 128470 Capsule hosting
gemi.dev 106069 Mailing lists, search engine and archive
apple2.org.za 102432 Gopher hole for mirrors.apple2.org.za
snork.ca 65101 Blog with a mirror of textfiles.com
mediocregopher.com 62164 Personal blog
tuxmachines.org 62132 Open source advocacy
geminispace.org 49706 Probably one of the biggest BBS on Gemini
cthulhu28.space 40986 Directory and personal blog
freeshell.de 36016 Personal blog
spam.works 22448 Tons of mirrors of text files
pollux.casa 20585 Hosting for Gemini capsules

Resource types

I ignored common ones such as .gmi. It makes sense that .txt is common as well since it’s frequently used as a replacement for .gmi. Other than that, pictures and zip files seem to have the highest number of resources.

Extension Count Percent
txt 264230 42.33%
jpg 114116 18.28%
jpeg 84290 13.50%
shtml 59649 9.56%
png 57768 9.25%
zip 14936 2.39%
pdf 12582 2.02%
gif 2729 0.44%
html 2262 0.36%
mp3 2192 0.35%
tar.gz 1760 0.28%
ogg 1451 0.23%
wav 1234 0.20%
xml 1083 0.17%
webp 1037 0.17%
tar 829 0.13%
svg 738 0.12%
mp4 364 0.06%
gz 328 0.05%
json 328 0.05%
csv 80 0.01%
webm 60 0.01%
mov 42 0.01%
appimage 38 0.01%
rar 25 0.00%
ico 21 0.00%
bmp 21 0.00%
avi 5 0.00%

Languages

It’s clear that the Geminisphere is predominantly English. Other than that, Western languages seem to be the most common.

Language Capsule Number Percentage over total
en 3694 63.99%
unknown 1734 30.04% (errors, redirections, etc.)
es 96 1.66%
fr 71 1.23%
de 43 0.74%
ru 30 0.51%
pt 15 0.25%
it 12 0.20%
pl 12 0.20%
ja 10 0.17%
cs 4 0.06%
fi 4 0.06%
hu 4 0.06%
zh 4 0.06%
ca 3 0.05%
el 3 0.05%
eo 3 0.05%
he 3 0.05%
sv 3 0.05%
eu 2 0.03%
ga 2 0.03%
ko 2 0.03%
nl 2 0.03%
no 2 0.03%
sl 2 0.03%
tr 2 0.03%
uk 2 0.03%
bs 1 0.01%
fa 1 0.01%
gl 1 0.01%
hy 1 0.01%
la 1 0.01%
lt 1 0.01%
sk 1 0.01%
sr 1 0.01%

For comparison, these are the language statistics from gemini.bortzmeyer.org/software/lupa/stats.gmi as of April 2026:

Language URL Number Percentage over total
en 514827 62.24%
unknown 305044 35.04%
de 10289 1.18%
es 2832 0.32%
fr 2581 0.29%
ru 538 0.06%
sv 115 0.01%
sgs 98 0.01%
Bortzmeyer’s stats as of April 2026

Note that my statistics are based on the hostname root instead of URLs, while Bortzmeyer’s are based on header tags. However, they are similar.

External protocols

Scheme/Protocol Count
gopher 176031
finger 83254
telnet 19234
gwit 18389
nex 17430
spartan 13832
ftp 4000
git 3278
scroll 3239
irc 2986
news 2746
guppy 1858
xmpp 1644
moz-extension 1445
ninep 1342
chrome-extension 952
misfin 782
snews 767
ssh 667
scorpion 597
text 589
mailto 492
file 331
nntp 307
ttps 256
ipfs 163
titan 85
gophers 72
ircs 63
ttp 21
applewebdata 21
sftp 20

I omitted those under 20 since they were not very common. It seems that Gopher still is the most popular alternative to Gemini. This makes sense since one of the main requirements of Gemini was to provide a more modern alternative to Gopher. Finger and Telnet are also there.

Some people left file:// with local paths, which indicates that Gemini users are also humans.

Lots of mail addresses too. I don’t know what’s with the Mozilla and Chrome extensions.

Wrap-up

Well, that was fun.

I’m bad with wrap-ups. Hopefully this helped increase the visibility of Gemini. Having alternative networks such as it is very important in a time where countries are even planning to set barriers to even start using computers or installing your own operating system.

If you haven’t tried Gemini, I suggest having a look at the specification and planning for a weekend project. I might never build a web browser, but I would like to think this helped me understand a bit more of how all this works.

Of course, I know not everyone is able or willing. For those people, I recommend this short introduction to Gemini which explains serves as an introduction in a much better way than I could.