elephant.md

BigQuery Data Platform - Pending Work

@NickBrooks-ks3lspecs
arlem

Status: Core infrastructure complete, data syncs and personalization pending Last Updated: February 2026


What’s Been Implemented

The BigQuery data warehouse is operational with the core infrastructure in place:

ComponentStatusNotes
GCP ProjectDonearlem-analytics-warehouse, australia-southeast1
BigQuery SchemaDone8 tables: visitors, sessions, events, shopifyorders, metaadsspend, klaviyoprofiles, instagramposts, gscqueries
Cosmos to BigQuery SyncDoneHourly sync, 5,846 visitors/8,158 sessions/16,971 events migrated
Shopify Orders SyncDone49 orders synced with product categorisation
Meta Ads SyncDone286 rows (Feb 2023 - Jul 2024, $1,342.65 spend)
Analytics ViewsDoneorderattribution, campaignroas, weeklyperformance, sourcesummary, channel_influence
BigQuery MCP ToolsDone5 tools in arlem-analytics MCP server
Visitor Segment HookDoneuseVisitorSegment() returns Cold/Interested/Warm/Hot/Customer
Segment Reverse SyncDoneBigQuery to Cosmos to API response to localStorage

Data Flow:

Website Events --> Cosmos DB --> BigQuery (hourly)
                                    |
External APIs (Shopify, Meta) ------+
                                    |
BigQuery --> Analytics Views --> MCP Tools --> Claude Code
         --> Cosmos --> API --> localStorage --> useVisitorSegment()

Pending Work

1. Data Sync: Klaviyo Profiles

Priority: Medium Effort: 4-6 hours

Sync Klaviyo profiles to BigQuery for email-based identity resolution.

Why it matters:

  • Link email subscribers to visitor journeys
  • Enable email-based retargeting
  • Currently only 0.1% of visitors have email captured

Implementation:

  • Create scheduledKlaviyoSync.ts (daily timer trigger)
  • Use existing Klaviyo MCP API patterns
  • Sync profiles updated since last sync
  • Link to visitor_id via email match
  • Insert to klaviyo_profiles table

Table schema (already exists):

klaviyo_profiles (profile_id, email, first_name, last_name, created, subscribed, lists, visitor_id, synced_at)

2. Data Sync: Instagram Posts

Priority: Low Effort: 4-6 hours

Sync Instagram posts with insights to BigQuery.

Why it matters:

  • Track which posts drive traffic and sales
  • Currently can’t answer “Which Instagram posts lead to revenue?”

Implementation:

  • Create scheduledInstagramSync.ts (daily timer trigger)
  • Use existing Instagram MCP API patterns
  • Sync posts with engagement metrics
  • Insert to instagram_posts table

Table schema (already exists):

instagram_posts (post_id, posted_at, caption, media_type, permalink, impressions, reach, engagement, saves, synced_at)

3. Data Sync: Google Search Console

Priority: Medium Effort: 2-4 hours

Current state: Table exists but has 0 rows. Pipeline broken or never enabled.

Why it matters:

  • Track search performance and SEO health
  • Required for content gap analysis

Implementation:

  • Create scheduledGSCSync.ts (daily timer trigger)
  • Use existing GSC MCP API patterns
  • Note: GSC data has 3-day delay
  • Insert to gsc_queries table

Table schema (already exists):

gsc_queries (date, query, page, clicks, impressions, ctr, position, synced_at)

4. Personalization: Segment-Aware Components

Priority: High Effort: 1-2 days

The useVisitorSegment() hook is working but no components use it yet.

Implementation:

  • Create ForSegment wrapper component
  • Create ForPropensity wrapper component
  • Implement segment-specific UI:

- Cold: Brand story banner, trust badges - Interested: “Compare styles” prompt, size guide visibility - Warm: Customer reviews, “Still deciding?” message - Hot: Cart reminder bar, urgency messaging - Customer: Welcome back, cross-sell suggestions

Example usage (from spec):

<ForSegment segments={['Warm', 'Hot', 'Customer']}>
  <CustomerReviews product={product} />
</ForSegment>

5. Personalization: Propensity Scoring

Priority: High Effort: 1-2 days

Add propensity score alongside journey segment for finer-grained personalization.

The validated scoring model:

propensity_score (0-16) =
  Gallery swipes: (20+: +5, 10-19: +3, 5-9: +2, 1-4: +1)
  Sessions: (3+: +3, 2: +2)
  Products viewed: (3+: +2, 2+: +1)
  AU/NZ location: +1
  High-intent source (IG Click, Google Shopping): +2

Key thresholds:

ScorePrecisionRecallUse Case
>= 516%51.5%Standard personalization
>= 834%27%Discount offers
>= 1054%15%VIP treatment

Implementation:

  • Compute propensity_score in BigQuery (already in schema)
  • Return score in /api/ping response
  • Store in localStorage as _ps
  • Create usePropensityScore() hook
  • Trigger offers based on thresholds

6. Tracking Gaps to Fix

Priority: Critical (for attribution accuracy)

GapImpactFix
Order-to-visitor linkage78% of purchases unattributedStore visitor_id in Shopify order notes via cart attributes
Click ID persistencefbclid/gclid lost at checkoutServer-side tracking (Meta CAPI, Google Enhanced)
Email capture rate0.1% of visitorsExit-intent popup, “Save cart” feature
Checkout eventsNo visibility between cart and purchaseAdd checkoutstarted, checkoutcompleted events

7. Validation Tasks

Priority: Medium Effort: 2-4 hours

  • Validate Cosmos to BigQuery data integrity
  • Validate Shopify order counts (49 synced vs actual)
  • Spot-check 10 random visitor journeys
  • Compare BigQuery ROAS with Meta dashboard
  • Document expected differences
  • Performance test sync functions (< 15 min)
  • Test segment API response time (< 500ms)
  • Review and adjust journey stage thresholds

Key Insights from Analysis

These findings from the initial BigQuery analysis inform the pending work:

FindingImplication
Instagram Click converts 7.5x better than DirectShift ad spend to Instagram Click
Gallery swipes 20+ = 50%+ cart rateSurface gallery earlier, trigger CTAs after swipes
Sunday converts 2x vs TuesdaySchedule campaigns for Sunday
517 Warm visitors stuckHigh-intent, need personalization nudge
86% cart abandonmentAbove average but recoverable; address shipping/size uncertainty
TikTok: 355 visitors, 0 cartsPause or fix targeting
Retargeting: 2% of spend, 0 cartsIncrease budget to 30%, segment properly

Files Reference

The original implementation is documented in:

  • analytics/data/constants.ts - BigQuery config
  • analytics/types/bigquery.ts - Row types
  • analytics/repositories/bigquery.ts - Client
  • analytics/repositories/bigqueryTransformers.ts - Data transforms
  • analytics/src/functions/scheduledBigQuerySync.ts - Hourly sync
  • analytics/src/functions/scheduledShopifySync.ts - Shopify orders
  • analytics/src/functions/scheduledMetaAdsSync.ts - Meta Ads spend
  • analytics/sql/views/*.sql - Analytics views
  • mcp/src/arlem-analytics/tools/bigquery.ts - MCP tools
  • web/lib/hooks/use-visitor-segment.ts - Segment hook

Cost

Current monthly cost: ~$5-10 (BigQuery streaming inserts are the main variable)

BigQuery storage is 12x cheaper than Cosmos DB, so adding capability actually reduced overall costs.