Mastodon Usage — Counting Toots With Kafka, DuckDB, and Seaborn 🐘🦆📊

Exploratory data analysis of user activity, server popularity, and language usage on Mastodon

Published in

Better Programming

9 min readFeb 23, 2023

Mastodon is a decentralized social networking platform. Users are members of a specific Mastodon instance, and servers are capable of joining other servers to form a federated social network.
I wanted to start exploring Mastodon usage, and perform exploratory data analysis of user activity, server popularity and language usage. I used distributed stream processing tools to collect data from multiple instances to get a glimpse into what’s happening in the fediverse.

This article covers the tools for data collection and processing (Apache Kafka stream processing). If this doesn’t interest you, you can jump straight to the data analysis (DuckDB and Seaborn). For the enthusiastic, you can run the code.

Tools used

Mastodon.py — Python library for interacting with the Mastodon API
Apache Kafka — distributed event streaming platform
DuckDB — in-process SQL OLAP database and the HTTPFS DuckDB extension for reading remote/writing remote files of object storage using the S3 API
MinIO — S3 compatible server
Seaborn — visualization library

Data Collection — The Mastodon Listener

If you’re not interested in the data collection… jump straight to the data analysis.

There is a large collection of Mastodon servers with various subjects, topics, and interesting communities. Accessing a public stream is generally possible without authenticating, so no account is required to work out what’s happening on each server.

Not every message is sent to every server in the decentralized Mastodon network. Generally, public toots from instance A will only be sent to instance B if a user from B follows that user from A.

I wrote a Python application, mastodonlisten, to listen for public posts from a given server. I could collect toots from popular and niche instances by running multiple listeners. Each Python listener collects public toots from that server and publishes them to a private Kafka broker. This code illustrates how multiple Mastodon listeners can be run in the background:

python mastodonlisten.py --baseURL https://mastodon.social --enableKafka &

python mastodonlisten.py --baseURL https://universeodon.com --enableKafka &

python mastodonlisten.py --baseURL https://hachyderm.io --enableKafka &

Kafka Connect

I’ve now got multiple Mastodon listeners feeding public posts from multiple servers into a single Kafka topic. My next task is to understand what’s happening with all this activity from the decentralised network.

I decided to incrementally dump the “toots” into Parquet files on an S3 object store. Parquet is a columnar storage that is optimised for analytical querying. I used Kafka Connect to stream data between my Kafka topic and landed the data into S3 using the S3SinkConnector.

That sounds like a lot of work, but the TL;DR is that I can instruct Kafka Connect to do everything for me with a bit of configuration. To consume the Mastodon topic from Kafka and create a new parquet file on S3 for every 1,000 records, I can use this configuration:

{
    "name": "mastodon-sink-s3",
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "topics": "mastodon-topic",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "flush.size": "1000",
    "s3.bucket.name": "mastodon",
    "aws.access.key.id": "minio",
    "aws.secret.access.key": "minio123",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "store.url": "http://minio:9000"
}

To check this is working correctly, I can see new files are being regularly created by looking in the MinIO web-based object browser.

Data analysis

Now that we have collected a week of Mastodon activity, let’s look at some data. These steps are detailed in the notebook.

Observations in this section are based on naive interpretation on a few days worth of data. Please don’t rely on any of this analysis, but feel free to use these techniques yourself to explore and learn.

First, here’s some quick statistics from the data collected over ten days (February 3 to February 12):

Number of Mastodon toots seen 1,622,149
Number of unique Mastodon users 142,877
Number of unique Mastodon instances 8,309
Number of languages seen 131
Shortest toot is 0 characters, average toot length is 151 characters, and longest toot is 68,991 characters (If you’re curious, the longest toot was a silly comment created by repeating the same emoji 68,930 times)
Total of all toots 245,245,677 characters
To hold 1.6 million toots in DuckDB’s memory takes just 745.5MB (which is tiny!)
Time it takes to calculate the above statistics in a single SQL query is 0.7 seconds (wow — fast!)

DuckDB’s Python client can be used directly in Jupyter notebook. The first step is to import the relevant libraries. DuckDB Python package can run queries directly on Pandas. With a few SqlMagic settings, it’s possible to configure the notebook to directly output data to Pandas.

%load_ext sql
%sql duckdb:///:memory:
%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Plus, we can use the HTTPFS DuckDB extension for reading remote/writing remote files of object storage using the S3 API.

%%sql
INSTALL httpfs;
LOAD httpfs;

Establish S3 endpoint

Here we’re using a local MinIO as an Open Source, Amazon S3-compatible server (and no, you shouldn’t share your secret_access_key). Set the S3 endpoint settings like this:

%%sql
set s3_endpoint='localhost:9000';
set s3_access_key_id='minio';
set s3_secret_access_key='minio123';
set s3_use_ssl=false;
set s3_region='us-east-1';
set s3_url_style='path';

And I can now query the parquet files directly from s3 with this command:

%%sql
select *
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');

Reading parquet from s3 without leaving the notebook

This is pretty cool. We can read the parquet data directly, which is sitting in our S3 bucket.

DuckDB SQL To Process Mastodon Activity

Before moving on, I had a bit of data cleanup, which I could do within DuckDB, loading remote parquet files (from s3).

Map the ISO 639–1 (two-letter) language code (zh, cy, en) to a language description (Chinese, Welsh, English). We can create alanguage lookup table and load languages from language.csv.
Calculate created_timestamp from the created_at integer. The created_at timestamp is calculated as the number of seconds from EPOC (1/1/1970)
Determine the originating instance with a regular expression to strip the URL
Create the table mastodon_toot as a join of mastodon_toot_raw to language

CREATE TABLE language(lang_iso VARCHAR PRIMARY KEY, language_name VARCHAR);

insert into language
select *
from read_csv('./language.csv', AUTO_DETECT=TRUE, header=True);

create table mastodon_toot_raw as
select m_id
, created_at, ('EPOCH'::TIMESTAMP + INTERVAL (created_at::INT) seconds)::TIMESTAMPTZ  as created_tz
, app
, url
, regexp_replace(regexp_replace(url, '^http[s]://', ''), '/.*$', '') as from_instance
, base_url
, language
, favourites
, username
, bot
, tags
, characters
, mastodon_text
from read_parquet('s3://mastodon/topics/mastodon-topic/partition=0/*');
create table mastodon_toot as
select mr.*, ln.language_name
from mastodon_toot_raw mr 
left outer join language ln on (mr.language = ln.lang_iso);

Being able to do this cleanup and transformation in SQL and execute it in 0.8 seconds is like magic to me.

Very fast processing — less then a second

Daily Mastodon Usage

We can query the mastodon_toot table directly to see the number of toots, users each day by counting and grouping the activity by the day. We can use the mode aggregate function to find the most frequent “bot” and “not-bot” users to find the most active Mastodon users.

%%sql
select strftime(created_tz, '%Y/%m/%d %a') as "Created day"
, count(*) as "Num toots"
, count(distinct(username)) as "Num users"
, count(distinct(from_instance)) as "Num urls"
, mode(case when bot='False' then username end) as "Most freq non-bot"
, mode(case when bot='True' then username end) as "Most freq bot"
, mode(base_url) as "Most freq host"
from mastodon_toot
group by 1
order by 1;

️️️The first few days were sporadic as I played with the data collection. Once everything was set up, I was generally seeing the following:

200,000 toots a day from 50,000 users
mastodon.social was the most popular host
news organisations are the biggest generator of content (and they don’t always set the “bot” attribute)

The Mastodon App Landscape

Which clients are used to accessing the Mastodon instance. We take the query the mastodon_toot table, excluding "bots," and load query results into the mastodon_app_df Panda dataframe. Here’s how to do that:

%%sql
mastodon_app_df << 
    select *
    from mastodon_toot
    where app is not null 
    and app <> ''
    and bot='False';

Seaborn is a visualization library for statistical graphics in Python, built on the top of matplotlib. It also works really well with Panda data structures.

We can use seaborn.countplot to show the counts of Mastodon app usage observations in each categorical bin using bars. Note we are limiting this to the ten highest occurrences.

sns.countplot(data=mastodon_app_df, y="app", order=mastodon_app_df.app.value_counts().iloc[:10].index)

Web, Ivory, and Moa are popular ways of tooting

The Mastodon application landscape is rapidly changing. Web usage is the preferred client, with mobile apps like Ivory, Moa, Tusky, and the Mastodon app.

The Mastodon API attempts to report an application for the client used to post the toot. Generally, this attribute does not federate and is, therefore, undefined for remote toots.

Time of Day Mastodon Usage

Let’s see when Mastodon is used throughout the day and night. I want to get a raw hourly count of toots each hour of each day. We can load the results of this query into the mastodon_usage_df dataframe.

%%sql
mastodon_usage_df << 
    select strftime(created_tz, '%Y/%m/%d %a') as created_day
    , date_part('hour', created_tz) as created_hour
    , count(*) as num
    from mastodon_toot
    group by 1,2 
    order by 1,2;

sns.lineplot(data=mastodon_usage_df, x="created_hour", y="num", hue="created_day").set_xticks(range(24))

It was interesting to see that the daily activity follows a similar usage pattern. Here’s what that looks like:

The lowest activity was seen at 3:00 p.m. in Australia (12:00 p.m. in China, 8:00 p.m. in California, and 4:00 a.m. in London)
The highest activity was seen at 2:00 a.m. in Australia (11:00 p.m. in China, 7:00 a.m. in California, and 3:00 p.m. in London)

Language Usage

The server or the client can specify the language of the toot, so it’s not always an accurate indicator of the language within the toot. So consider this a wildly inaccurate investigation of language tags.

%%sql
mastodon_usage_df << 
    select *
    from mastodon_toot;

sns.countplot(data=mastodon_usage_df, y="language_name", order=mastodon_usage_df.language_name.value_counts().iloc[:20].index)

English is prominent language, followed by Japanese

Toot Length by Language Usage

I was also curious about the length of toots over different languages.

%%sql
mastodon_lang_df << 
    select *
    from mastodon_toot
    where language not in ('unknown');

sns.boxplot(data=mastodon_lang_df, x="characters", y="language_name", whis=100, orient="h", order=mastodon_lang_df.language_name.value_counts().iloc[:20].index)

What’s interesting to see is the typical Chinese, Japanese, and Korean toots are shorter than English, whereas Galicia and Finish messages are longer. A possible explanation is logographic languages (like Mandarin) may be able to convey more with fewer characters.

Closing Thoughts

The rise of Mastodon is something I’ve been really interested in. The open-sharing nature has helped with the rapid adoption by communities and new users (myself included).

It’s been great to explore the fediverse with powerful open-source distributed stream processing tools. Performing exploratory data analysis in a Jupyter notebook with DuckDB is like pressing a turbo button. Reading parquet from S3 without leaving the notebook is neat, and DuckDB’s ability to run queries directly on Pandas data without ever importing or copying any data is really snappy.

I’m going to conclude with my two favorite statistics: DuckDB memory used to hold 1.6 million toots is just 745.5MB, and to process my results in 0.7 seconds is like a superpower 🪄