Prior to starting a historical data migration, ensure you do the following:
- Create a project on our US or EU Cloud.
- Sign up to a paid product analytics plan on the billing page (historic imports are free but this unlocks the necessary features).
- Set the
historical_migration
option totrue
when capturing events in the migration.
⚠️ Experimental: The following migration guide is not comprehensive and generates data that is less accurate than the data captured by PostHog. Contributions to improve it are welcome, feel free to open a PR here.
Compared to other tools, Plausible is easy to get data out of, but it is difficult to convert to PostHog's schema. This is because data exported from Plausible is aggregated, rather than the individual events that make up the aggregation (like we expect in PostHog).
This means we can try to convert data, but it won't be 100% accurate. For example, Plausible data tells there is a certain number of visitors, pageviews, and visits, but it doesn't tell you which user did what event at what time. We need break down this data to generate individual events that create an equivalent aggregation.
This guide aims to convert aggregate data into events where possible and capture them in PostHog.
Terminology differences: Although pageviews, custom events, and many properties are the same between Plausible and PostHog, there are two major terminology differences:
1. Export data from Plausible
To export data from Plausible, go to your site settings and click on the Imports & Exports tab. Once here, click Prepare download and download the collection of csv
files once ready.
2. Convert Plausible data and capture it in PostHog
🛠️ Work in progress: The following script converts and capture visitors and pageviews, but is not comprehensive. For example, it doesn't include:
- Person or event properties like UTMs or channels
- Session data (generating
$session_id
and correct timing for session duration)- Bounce rate
- Custom events
- Entry or exit pages
Once you have your exported data, you can run the following script to convert it to PostHog's schema and capture pageviews. This script randomly generates $distinct_id
values for each visitor and randomly assigns them pageviews. It requires the names and locations of your exported data as well as your Plausible Site Timezone which you can find in your site settings.
import csvfrom datetime import datetimefrom backports.zoneinfo import ZoneInfoimport randomfrom uuid_extensions import uuid7import numpy as npfrom posthog import Posthog# Replace with your export file names and locationvisitors_filename = 'plausible-export/imported_visitors_20241001_20241008.csv'pages_filename = 'plausible-export/imported_pages_20241001_20241008.csv'# Replace with your site timezoneutc = 'US/Pacific'posthog = Posthog('<ph_client_api_key>',host='https://us.i.posthog.com',debug=True,historical_migration=True)def generate_distinct_id(date):ns_since_epoch = int(date.timestamp() * 1e9)return str(uuid7(ns=ns_since_epoch))def convert_date_to_timestamp(date):date_obj = datetime.strptime(date, '%Y-%m-%d')local_datetime = date_obj.replace(hour=0, minute=random.randint(0, 59), second=random.randint(0, 59), tzinfo=ZoneInfo(utc))utc_datetime = local_datetime.astimezone(ZoneInfo("UTC"))return utc_datetimedef distribute_pageviews(visitors, total_pageviews):# Ensure each visitor has at least 1 pageviewmin_pageviews = np.ones(visitors, dtype=int)remaining_pageviews = total_pageviews - visitorsif remaining_pageviews > 0:# Distribute remaining pageviewsweights = np.random.dirichlet(np.ones(visitors)) * remaining_pageviewsadditional_pageviews = np.round(weights).astype(int)pageviews_per_visitor = min_pageviews + additional_pageviewselse:# If total_pageviews <= visitors, each visitor gets 1 pageviewpageviews_per_visitor = min_pageviews# Adjust for any rounding discrepanciesdifference = total_pageviews - pageviews_per_visitor.sum()for _ in range(abs(int(difference))):if difference > 0:pageviews_per_visitor[np.argmin(pageviews_per_visitor)] += 1else:pageviews_per_visitor[np.argmax(pageviews_per_visitor)] -= 1return pageviews_per_visitor.tolist()with open(visitors_filename, 'r') as visitors_file:visitors_reader = csv.DictReader(visitors_file)for visitor_row in visitors_reader:date = visitor_row['date']# Create persons with distinct IDspersons = []visitors = int(visitor_row['visitors'])pageviews_distribution = distribute_pageviews(visitors, int(visitor_row['pageviews']))for i in range(visitors):timestamp = convert_date_to_timestamp(date)persons.append({'distinct_id': generate_distinct_id(timestamp),'pageviews': pageviews_distribution[i]})# Generate pageviews from pages and personswith open(pages_filename, 'r') as pages_file:pages_reader = csv.DictReader(pages_file)counter = 0for page_row in pages_reader:if page_row['date'] != date:continuefor i in range(int(page_row['pageviews'])):person = random.choice(persons)posthog.capture(distinct_id=person['distinct_id'],event='$pageview',properties={'$current_url': f"https://{page_row['hostname']}{page_row['page']}",'$host': page_row['hostname'],'$pathname': page_row['page']},timestamp=convert_date_to_timestamp(date))if person['pageviews'] == 1:persons.remove(person)else:person['pageviews'] -= 1counter += 1
Once you've run the script successfully, you should see data in PostHog. You can then continue your PostHog setup by installing the snippet or a library.