Building a map of the GeminiNet

Introduction
Methodology
Some statistics
Wrap-up

Introduction

The small web is nice. It’s really nice. It reminds me of that era when corporations were too focused on traditional media such as television or radio and left us alone on the internet where we roamed forums, chatted on IRC or made dirty webpages with HTML and CSS we learnt through weird txt files on the internet.

Gemini is also very nice. Webpages feel fresh and discussion forums have high conversation quality. You can discuss stuff without trolls (mostly), without well organized public opinion management groups. Without ads because corporations have little to no incentive to spend money in a protocol where you can’t put flashy videos and client-side scripts to track consumption habits (well, they’re not part of the protocol specification anyways).

However, I feel that unlike other decentralized services such as Mastodon or Lemmy which you can browse through your usual browser, Gemini has not received a lot of love. Despite a lot of effort by the community and capsules (the Gemini version of webpages) that exist to track the development, it’s not super intuitive and I wanted to see how it was shaped. That’s why I attempted to build a map of Gemini capsules and how they’re interconnected with each other.

Methodology

The crawling

First of all I want to apologize to capsule owners. If someone made a lot of requests to your address, it was possibly me. Hope I didn’t bring down your server or anything.

I think it’s important to clarify that this is a partial crawl. It’s not supposed to be a complete image. I started crawling on gemini://geminiprotocol.net and went through the known capsule lists which are hosted by communities, but I’m sure there are a lot that banned me and couldn’t continue, and others that simply are not listed there.

Something very very nice about Gemini is that the specification is really small. So small, in fact, that anyone with a bit of time can make a rudimentary client and connect to all the available capsules. This is something you absolutely cannot do with the modern net in a semipractical way since it’s packed with edge cases, security concerns, headers and modes that you have to handle gracefully.

The crawler I made to crawl (…) the Geminisphere was basically a botched down version of the one I made for myself. It’s a simple rust client, supports most of the specification (I would like to think) and so it generally works.

In fact, I made 3 versions of that botched down client. * First version was for crawling the capsules and writing the information into a SQLite database. * Second version was for normalizing the database. * Third version was for fetching the root of each hostname.

I probably didn’t need the third. Oh well.

There were some edge cases with crawling I didn’t expect. The Gemininet is estimated to have around 1.042.214 URLs among 4.723 (!) capsules according to gemini.bortzmeyer.org/software/lupa/stats.gmi. It’s really small! However, the problem is precisely that it’s small. A lot of those servers are hosted in a raspi or an old laptop and you can’t make a lot of requests to them.

Which is hard to do when a lot of websites have loops and CGI URLs. They have text-based applications and games for which I had to revise the database from time to time to prune URLs and hostnames. Another pain in the ass were git repositories since they tend to be quite big. At one point the database reached 45GB with 80% of that being trash URLs.

And trash URLs there are a lot. The Gemtext specification defines that:

All the following examples are valid link lines:

=> gemini://example.org/
=> gemini://example.org/ An example link
=> gemini://example.org/foo Another example link at the same host
=> foo/bar/baz.txt  A relative link
=> gopher://example.org:70/1 A gopher link

…which means that a lot of websites do things like => wikipedia.org expecting you to handle it gracefully. I mean, that’s okay. They’re not that much. Still annoying though.

There are also search engines and archiving websites that can really fatten up your crawling if you’re not careful.

Finally, you have the dots. Ahhh, the dots. gemini://foo.com/bar/../bar/../bar/./bar/../bar. Here’s where the normalization client comes in.

Summarizing the capsules

Once I had a neat small database of around 10GB pre-pruning some trash parts, I configured a local llamacpp instance with Qwen3.5 8B for summarizing the content of those capsules and capsule subdomains. The parts I was most interested in were the description and categories, but I also wanted to know its language and more detailed categories, so I set this format in the end:

{
    "hostname": ...,
    "language": ..., # two letter code
    "category": hosting|personal|bbs|forum|others # didn't want to complicate myself too much
    "subcategories": [ ... ],
    "description": ...
}

I think you can know a lot of useful information about a capsule from that. Then I made a database search for subdomains and expanded the database(JSON, not the SQLite one) such that:

    "hostname": ...
    "links": [
        {
            # the websites that hostname points to
        }
        ...
    ],
    "size": ... # the internal number of URLs
    "error": ... # error message or null if it was ok
    "redirect_to": ... # redirect address or null if it was ok
    "language": ...
    "category": ...
    "subcategories": ...
    "description": ...
    "subdomains": [
        # same as its parent except this one doesn't have further subdomains
        # for simplicity
    ]

For simplicity I considered subdomains to be anything that has a dot before the TLD.

Curating the information

All in all the descriptions seemed to be reasonable. Although I’m sure there is some weird thing around. A lot of the tags were very similar (say, blog/blogs, or site/personal webpage), so I spent some time curating them. Categories don’t make much sense if there’s going to be only 1 member in them after all.

If you see your capsule and want to change something about it, please make a pull request and I’ll accept it with no issues (if you want to make an issue on the repo, that’s ok too).

Presenting the information in a visual way

Which was half the point of doing all this to begin with.

I picked a force graph for the representation since it looked like a reasonable way of understanding the size and outgoing links of capsules, as well as having different levels of detail. I suck at front-end so I’m sorry if it doesn’t work on phone.

You can see the map here.

Some statistics

Basic stats

The database post-pruning ended up being 6.7GB. You can download it here. The JSON ended up being 34MB. I’m sure it can be compressed in some way. I’m sorry if you’re low on data.

Some of the websites with the most URLs

Hostname	Size	Short description
yesterweb.org	817645	Legacy capsule hosting
taz.de	396572	German newspaper
uploadedlobster.com	177068	Rob’s personal blog. Has lots of recordings.
jsreed5.org	142163	Mostly to a replication of OEIS
techrights.org	140173	Open source advocacy group
hellomouse.net	128470	Capsule hosting
gemi.dev	106069	Mailing lists, search engine and archive
apple2.org.za	102432	Gopher hole for mirrors.apple2.org.za
snork.ca	65101	Blog with a mirror of textfiles.com
mediocregopher.com	62164	Personal blog
tuxmachines.org	62132	Open source advocacy
geminispace.org	49706	Probably one of the biggest BBS on Gemini
cthulhu28.space	40986	Directory and personal blog
freeshell.de	36016	Personal blog
spam.works	22448	Tons of mirrors of text files
pollux.casa	20585	Hosting for Gemini capsules

Resource types

I ignored common ones such as .gmi. It makes sense that .txt is common as well since it’s frequently used as a replacement for .gmi. Other than that, pictures and zip files seem to have the highest number of resources.

Extension	Count	Percent
txt	264230	42.33%
jpg	114116	18.28%
jpeg	84290	13.50%
shtml	59649	9.56%
png	57768	9.25%
zip	14936	2.39%
pdf	12582	2.02%
gif	2729	0.44%
html	2262	0.36%
mp3	2192	0.35%
tar.gz	1760	0.28%
ogg	1451	0.23%
wav	1234	0.20%
xml	1083	0.17%
webp	1037	0.17%
tar	829	0.13%
svg	738	0.12%
mp4	364	0.06%
gz	328	0.05%
json	328	0.05%
csv	80	0.01%
webm	60	0.01%
mov	42	0.01%
appimage	38	0.01%
rar	25	0.00%
ico	21	0.00%
bmp	21	0.00%
avi	5	0.00%

Languages

It’s clear that the Geminisphere is predominantly English. Other than that, Western languages seem to be the most common.

Language	Capsule Number	Percentage over total
en	3694	63.99%
unknown	1734	30.04% (errors, redirections, etc.)
es	96	1.66%
fr	71	1.23%
de	43	0.74%
ru	30	0.51%
pt	15	0.25%
it	12	0.20%
pl	12	0.20%
ja	10	0.17%
cs	4	0.06%
fi	4	0.06%
hu	4	0.06%
zh	4	0.06%
ca	3	0.05%
el	3	0.05%
eo	3	0.05%
he	3	0.05%
sv	3	0.05%
eu	2	0.03%
ga	2	0.03%
ko	2	0.03%
nl	2	0.03%
no	2	0.03%
sl	2	0.03%
tr	2	0.03%
uk	2	0.03%
bs	1	0.01%
fa	1	0.01%
gl	1	0.01%
hy	1	0.01%
la	1	0.01%
lt	1	0.01%
sk	1	0.01%
sr	1	0.01%

For comparison, these are the language statistics from gemini.bortzmeyer.org/software/lupa/stats.gmi as of April 2026:

Language	URL Number	Percentage over total
en	514827	62.24%
unknown	305044	35.04%
de	10289	1.18%
es	2832	0.32%
fr	2581	0.29%
ru	538	0.06%
sv	115	0.01%
sgs	98	0.01%

Note that my statistics are based on the hostname root instead of URLs, while Bortzmeyer’s are based on header tags. However, they are similar.

External protocols

Scheme/Protocol	Count
gopher	176031
finger	83254
telnet	19234
gwit	18389
nex	17430
spartan	13832
ftp	4000
git	3278
scroll	3239
irc	2986
news	2746
guppy	1858
xmpp	1644
moz-extension	1445
ninep	1342
chrome-extension	952
misfin	782
snews	767
ssh	667
scorpion	597
text	589
mailto	492
file	331
nntp	307
ttps	256
ipfs	163
titan	85
gophers	72
ircs	63
ttp	21
applewebdata	21
sftp	20

I omitted those under 20 since they were not very common. It seems that Gopher still is the most popular alternative to Gemini. This makes sense since one of the main requirements of Gemini was to provide a more modern alternative to Gopher. Finger and Telnet are also there.

Some people left file:// with local paths, which indicates that Gemini users are also humans.

Lots of mail addresses too. I don’t know what’s with the Mozilla and Chrome extensions.

Wrap-up

Well, that was fun.

I’m bad with wrap-ups. Hopefully this helped increase the visibility of Gemini. Having alternative networks such as it is very important in a time where countries are even planning to set barriers to even start using computers or installing your own operating system.

If you haven’t tried Gemini, I suggest having a look at the specification and planning for a weekend project. I might never build a web browser, but I would like to think this helped me understand a bit more of how all this works.

Of course, I know not everyone is able or willing. For those people, I recommend this short introduction to Gemini which explains serves as an introduction in a much better way than I could.