Mini-difi vs. brute-force linking tests
This demonstration takes the output from midway through a Sorcha run with the demoset of 100k TNOs before linking is applied.
[ ]:
import pandas as pd
import numpy as np
from astropy.coordinates import SkyCoord
import astropy.units as u
Inputs
[2]:
df = pd.read_csv("midSorcha_TNOs.csv").drop(columns=["Unnamed: 0"]).reset_index()
print(f"Number of objects: {len(df['ssObjectId'].unique())}")
print(f"Number of observations: {len(df)}")
Number of objects: 994
Number of observations: 62354
The “brute force” miniDifi implementation.
This should be a (hopefully) more pedagogically easier to follow implementation of the mock linker algorithm.
The results should match the fast (but harder to follow) implementation in Sorcha.
[3]:
#
# Compute if the current night has observations that will qualify as traclkets
#
def compute_nights(df, night_start_utc_days = 17.0 / 24.0):
# this computes the integer night (== the MJD of the evening the observations started)
# for the observations in df.
tshifted = df["midPointTai"] - night_start_utc_days
return tshifted.astype(int)
def has_tracklet(df, minlen_arcsec=1, maxdt_minutes=90, min_observations=2):
# this computes whether a given night has /any/ observations close enough to be
# made into a tracklet.
#
# WARNING: the input dataframe must be for _a single_ object.
if "ssObjectId" in df.columns:
assert len(df["ssObjectId"].unique()) == 1, "This function must be run on a single object"
maxdt = maxdt_minutes / 60. / 24.
minlen = minlen_arcsec / 3600.
detections = []
if len(df)< min_observations:
return False
if min_observations == 1: # for edge case where one observation is needed for a tracklet
return True
for i in range(len(df)):
for j in range(i+1, len(df)):
t0, t1 = df["midPointTai"].iloc[i], df["midPointTai"].iloc[j]
ra0, ra1 = df["ra"].iloc[i], df["ra"].iloc[j]
dec0, dec1 = df["decl"].iloc[i], df["decl"].iloc[j]
# is this pair too spread out in time?
dt = t1 - t0
assert dt > 0, (dt, t1, t0)
if dt > maxdt:
continue
coord1 = SkyCoord(ra=ra0*u.degree, dec=dec0*u.degree, frame='icrs')
coord2 = SkyCoord(ra=ra1*u.degree, dec=dec1*u.degree, frame='icrs')
angular_distance = coord1.separation(coord2)
dist = angular_distance.deg
# is this pair too close in distance?
if dist < minlen:
continue
# this combo will make a tracklet; return true!
##print(f'dt = {dt*24*60:.2f} min, sep = {dist*3600:.2f}"')
detections.append(i)
detections.append(j)
if len(set(detections)) >= min_observations:
return True
return False
def nights_with_tracklets(df):
# returns a list (as ndarray) of nights (MJD) with tracklets
if "ssObjectId" in df.columns:
assert len(df["ssObjectId"].unique()) == 1, "This function must be run on a single object"
# Filter the DataFrame to only keep rows where the 'night' value appears 2 or more times
df_filtered = df[df.groupby('night')['night'].transform('count') >= 1].reset_index()
# Now, for each night compute if it has a tracklet by calling has_tracklet()
nightHasTracklet = df_filtered.groupby("night").apply(has_tracklet, include_groups=False)
nightsWithTracklets = nightHasTracklet[nightHasTracklet]
return nightsWithTracklets.index.values
def enumerateOpportunities(nights, window=14, nlink=3):
#
# Enumerate all discovery opportunities given the list of nights with tracklets.
#
# Note: window is inclusive of the most recent night, and exclusive of the -window night.
#
# For example, if window=2, that means we'd search for tracklets tonight and yesteday,
# but not day before yesterday. I.e., if tonight is the MJD of current night, the condition
# to include a night into the tracklet suite is:
#
# tonight - window < tracklet_night and tracklet_night <= tonight
#
# (i.e., it's \lt on the first comparison, not \le).
# we'll be collecting unique opportunities here
opportunities = set()
if len(nights) == 0:
return opportunities
# go from the first to last night, and for each find the
# tracklets in the trailing window
for t1 in range(nights.min(), nights.max()+window+1):
t0 = t1 - window # trailing window
tracklets = tuple(nights[(t0 < nights) & (nights <= t1)])
# not enough tracklets to link? continue...
if len(tracklets) < nlink:
continue
opportunities.add(tracklets)
return opportunities
def linkObject(df):
#
# find and return the list of all discovery opportunities for a given
# object.
#
# WARNING: the df must be for a single object (!!).
#
nightsWithTracklets = nights_with_tracklets(df)
opps = enumerateOpportunities(nightsWithTracklets)
return opps
def linkObservations(df):
#
# The main driver. Given a list of observations of multiple objects,
# analyze whether each object can be linked and return the number of
# linking opportunities.
#
df["night"] = compute_nights(df)
res = df.groupby("ssObjectId").apply(
lambda df: len(linkObject(df)),
include_groups=False)
return res
Tests
Run the brute-force linker. It returns a series with the number of discovery opportunities for each object:
[4]:
%%time
bf = linkObservations(df)
bf
CPU times: user 11.4 s, sys: 36.9 ms, total: 11.5 s
Wall time: 11.5 s
[4]:
ssObjectId
STC001TFa 9
STC001TGa 2
STC001THa 17
STC001TIa 2
STC001TJa 2
..
STC0029va 4
STC0029wa 7
STC0029xa 1
STC0029ya 14
STC0029za 14
Length: 994, dtype: int64
Now run the Sorcha miniDifi implementation.
Note: if you’re paying attention to runtime, run this cell at least twice as the first time you’re running there will be lots of overhead with module loading and JIT compilation.
[5]:
%%time
import sorcha.modules.PPMiniDifi as md
df_all = pd.read_csv("midSorcha_TNOs.csv").drop(columns=["Unnamed: 0"]).reset_index()
# Convert to ndarray
nameLen = df_all["ssObjectId"].str.len().max()
obsv = np.asarray(
df_all.to_records(
index=False,
column_dtypes=dict(_name=f"a{nameLen}", diaSourceId="u8", midPointTai="f8", ra="f8", decl="f8"),
)
)
# run minidifi
minidifi = md.linkObservations(obsv, seed=42, p=1, objectId="ssObjectId", maxdt_minutes=90, minlen_arcsec=1, min_observations=2, window=14, nlink=3, night_start_utc_days=17.0 / 24.0)
CPU times: user 1.1 s, sys: 190 ms, total: 1.29 s
Wall time: 1.47 s
Comparison
If the two implementations match, there should be no assertions and an the dataframe of differences will be empty in the end.
[6]:
bf_res = pd.DataFrame({"bf_discoveryChances": bf})
md_res = pd.DataFrame(minidifi).set_index("ssObjectId")[["discoveryChances"]]
# assert all objects are there in both dataframes
assert np.all(bf_res.index.values == md_res.index.values)
print(f"Total number of objects: {len(bf_res)}")
# join the two dataframes
res = bf_res.join(md_res)
res["equal"] = res["bf_discoveryChances"] == res["discoveryChances"]
print(f"Total number of equal objects: {np.count_nonzero(res['equal'])}")
# find where they differ
print("Differences between miniDifi and brute force (note: empty is good!):")
res[~res["equal"]]
Total number of objects: 994
Total number of equal objects: 994
Differences between miniDifi and brute force (note: empty is good!):
[6]:
| bf_discoveryChances | discoveryChances | equal | |
|---|---|---|---|
| ssObjectId |