[Info-ingres] stats on tables with more than 1million rows

Fri Jan 27 21:56:46 UTC 2017

Yeah - I guess that's a forever problem when sampling is used (which it is under the covers given the implicit sampling you're talking about). This is actually a chronic issue in query optimization - trying to extrapolate from a sample to figure out the number of unique values in the whole table. We use a statistical heuristic that I dug out of a paper somewhere to do this (the jack knife estimation technique, I'm sure you'll be glad to know). But it is simply that and isn't always very accurate.

In a case like this where the sample clearly contains all unique values, I suppose it would be pretty trivial to intuit that the column is in all likelihood distinct for the whole table. I can easily do something like that.

Doug.

On Friday, January 27, 2017 at 8:32:08 AM UTC-5, Martin Bowes wrote:
> Hi All,
> 
>  
> 
> It appears that optimizedb automatically switches to sampling once a table breaks 1million rows, which is cool.
> 
>  
> 
> But now I have a table with unique keys which the stats insist is a non-unique key with an average count per value of 1.6.
> 
>  
> 
> Is this a problem?
> 
>  
> 
> Martin Bowes
> 
>