Recursive Goose: February 2013

I’m deleting the gigabytes of uncompressed visualizations I produced during the Observing Dark Worlds competition on Kaggle.com to help to understand the problem, and compressing the few images I’ll keep.

The competition consisted in finding the center of dark matter halos from detecting how they affected the ellipticities of galaxies between us and the halos. Observed galaxies have random ellipticities when no halo is present and this additive noise complicates the problem.

Here is a typical picture, and one of the simplest visualizations, showing the amount of tangential ellipticity for every position in sky 58:

Other images show how well a particular model fits for each pixel of a high resolution image. This means that the optimizer has to evaluate the model multiple times for each pixel and the tangential and cross-tangential ellipticities of all galaxies also have to be re-computed for each pixel (720,000,000 evaluations for a 1200 x 1200 image and a sky with 500 galaxies).

This was quite slow initially, with most of the time spent evaluating the trigonometric functions used to compute the two ellipticity components. Changing the number representation from double to float gave a small improvement, but it was still slow. There are two related functions involved:

Cos2Atan

This seemed like a good excuse to try out Mathematica’s MiniMaxApproximation function in the NumericalMath`Approximations` package.

After some experiments, I ended up with a formula approximating these functions within about 0.001 by using a 4th degree polynomial and the following code:

double Cos2ArcTanApprox(double x)
{
return 0.9989279587792748 + x * (0.06351399783942091 + x * (- 2.6025942588287068 + x * (1.9881034919631744 - x * 0.44697660682523027)));
}

void SinCos2ArcTan(double xHalo, double yHalo, GalaxyInfo& info, double *pCos, double *pSin)
{
double dx = info.x - xHalo;
double dy = info.y - yHalo;

    double adx = fabs(dx);
    double ady = fabs(dy);
    double c;
    if (adx > ady)
    {
        c = Cos2ArcTanApprox(ady / adx);
    }
    else
    {
        c = -Cos2ArcTanApprox(adx / ady);
    }

    double s = sqrt(1 - c * c);
    if (dx * dy < 0)
        s = -s;

*pCos = c;
*pSin = s;
}

This method was twice as fast compared to calling sin(), cos() and atan(), and together with multithreading allowed to get fresh visualizations for all skies of the ideas I explored in a reasonable amount of time.

To find the positions of the halos, I tried a number of variants of the following method:

Use a model where the halo adds to the tangential ellipticity of a galaxy in the following measure, depending on their mutual distance d:

The model has two parameters A and B (halo intensity and scale) and I also got good results by using a single parameter, i.e. by using a formula inferred from the training data that roughly linked the values of A and B.
Optionally model strong lensing effects, where the tangential ellipticity of a galaxy very close to a halo may be strongly negative rather than strongly positive.
Find the best values for the (x, y, A, B) parameters for a halo by starting with fixed reasonable values and optimizing with the Nelder-Mead method.
The function to maximize is the likelihood of the parameters, assuming that halos have an additive effects on tangential ellipticities and that when there are no halos the ellipticities have a random gaussian distribution with zero mean and standard deviation of 0.21 (it can be measured in the single-halo training skies).
Both tangential and cross-tangential ellipticities are used.
To avoid local minima close to isolated galaxies with strong ellipticity, start the process multiple times from each of the intersections of a 10 x 10 grid.
After a halo position, intensity and scale are found, subtract its effects from the ellipticities of all galaxies and iterate to find more halos.

The next picture shows the likelihood function for sky 107. A strong halo is visible in the left image. After halo subtraction and re-scaling, a second weaker halo becomes visible (right image):

I tried different models for the decay of the halo effect, but multiple functions gave similar results and noise made it hard to compare them, so I just chose the simplest formula that gave good results (the one shown above). In any case, for the single halo skies the exact function didn’t seem to matter much, provided that it was not too different from the function above.

When subtracting the halos, the ability to correctly identify more halos was affected by how close the formula came to the actual (unknown) model. To avoid having to consider a wide variety of formulas, I created a model that can approximate many curves: take the logarithm of both the ellipticities and distances and find the 3rd degree polynomial that best interpolates the transformed tangential ellipticities for all galaxies.

When looking for (x, y), the simple model with just two parameters (A and B) was used, otherwise with a polynomial the optimizer ended up overfitting. The polynomial fit was only done after fixing the halo coordinates. With four model parameters instead of two, it could approximate more closely a variety of possible models, hopefully improving the quality of the data after the halo subtraction.

The position found for the first halo is slightly skewed by the presence of other halos, but attempts to simultaneously optimize the positions of all the halos resulted in overfitting in my experiments.

Trying an elliptical simmetry for the halo effects also resulted in overfitting, although it might have been a good starting point to detect cases where there are two halos close to each other.

Ending on a lighter note, here are two more visualization that go in a mildly more artistic direction.
The first one shows the contribution of each galaxy to the tangential ellipticity (each circle is centered on a galaxy):

The visualization for a failed model inspired my six year old daughter to contribute Smile :

Recursive Goose

Monday, February 4, 2013

Views from the Dark Worlds