[Info-ingres] stats on tables with more than 1million rows

Sat Jan 28 05:15:19 UTC 2017

Problem is, if you've sampled, you might have simply managed to get unique values.  I can easily
imagine a table with values like 1, 1, 2, 2, 3, 3, 4, 4, getting only one of the two values when
sampling, thus the sample would look unique but the true count per value is 2.

I don't know if there is a good answer, other than not sampling.

> On Jan 27, 2017, at 11:56 PM, douglas.inkster at gmail.com wrote:
> 
> Yeah - I guess that's a forever problem when sampling is used (which it is under the covers given the implicit sampling you're talking about). This is actually a chronic issue in query optimization - trying to extrapolate from a sample to figure out the number of unique values in the whole table. We use a statistical heuristic that I dug out of a paper somewhere to do this (the jack knife estimation technique, I'm sure you'll be glad to know). But it is simply that and isn't always very accurate.
> 
> In a case like this where the sample clearly contains all unique values, I suppose it would be pretty trivial to intuit that the column is in all likelihood distinct for the whole table. I can easily do something like that.
> 
> Doug.
> 
> On Friday, January 27, 2017 at 8:32:08 AM UTC-5, Martin Bowes wrote:
>> Hi All,
>> 
>>  
>> 
>> It appears that optimizedb automatically switches to sampling once a table breaks 1million rows, which is cool.
>> 
>>  
>> 
>> But now I have a table with unique keys which the stats insist is a non-unique key with an average count per value of 1.6.
>> 
>>  
>> 
>> Is this a problem?
>> 
>>  
>> 
>> Martin Bowes
>> 
>>  
> 
> _______________________________________________
> Info-ingres mailing list
> Info-ingres at lists.planetingres.org
> http://lists.planetingres.org/mailman/listinfo/info-ingres