[Info-ingres] stats on tables with more than 1million rows

Wed Feb 1 08:13:46 UTC 2017

Cool, thanks for the tip, I'll change to -zns.

Marty

-----Original Message-----
From: douglas.inkster at gmail.com [mailto:douglas.inkster at gmail.com] 
Sent: 31 January 2017 19:59
To: info-ingres at lists.planetingres.org
Subject: Re: [Info-ingres] stats on tables with more than 1million rows

On Tuesday, January 31, 2017 at 10:12:04 AM UTC-5, Laframboise, André (BAC/LAC) wrote:
> Yep, according to my old 2006 ref manual. At that time, the default was 500,000 rows.
> 
> -zns
> 
> Disables the default behavior of creating histograms from a maximum 500,000 row sample. Using this parameter assures that all rows are read from a table during the histogram building process.
> 
> 
> 
> André Laframboise
> 
> Conseiller Principal Base de Données, Direction générale de 
> l'Innovation et du Dirigeant principal de l'information Bibliothèque 
> et Archives Canada / Gouvernement du Canada 
> andre.laframboise at canada.ca / Tél. : 613-298-1346
> 
> Senior Database Advisor, Innovation and Chief Information Officer 
> Branch Library and Archives Canada / Government of Canada 
> andre.laframboise at canada.ca / Tel: 613-298-1346
> 
> -----Original Message-----
> From: info-ingres-bounces at lists.planetingres.org 
> [mailto:info-ingres-bounces at lists.planetingres.org] On Behalf Of Karl 
> and Betty Schendel
> Sent: Tuesday, January 31, 2017 10:07 AM
> To: info-ingres at lists.planetingres.org
> Subject: Re: [Info-ingres] stats on tables with more than 1million 
> rows
> 
> -zns turns off sampling, if I am remembering it rightly.
> 
> I don't know how convenient it is for optimizedb to look at the config file, but it's an idea worth looking into.
> 
> Karl
> 
> 
> > On Jan 31, 2017, at 2:52 AM, Martin Bowes <martin.bowes at ndph.ox.ac.uk> wrote:
> > 
> > That sounds like a good idea.
> > 
> > Could we also get a configuration option to set the sampling threshold rather than have it fixed at 1million rows?
> > 
> > Also the optimizedb command has no clear way to turn off sampling, I just set -zs100 and hoped for the best. A clearer command switch or a better documented one would be useful.
> > 
> > Marty
> > 
> > -----Original Message-----
> > From: Karl and Betty Schendel [mailto:schendel at kbcomputer.com]
> > Sent: 30 January 2017 16:29
> > To: Ingres and related product discussion forum
> > Subject: Re: [Info-ingres] stats on tables with more than 1million 
> > rows
> > 
> > If the column is constrained in some way to be unique, then sampling certainly should not assume otherwise (unless the sample gets non-unique values!).  I don't think optimizedb or create stats makes this inference at present, and it should.  That is something we can take a look at.
> > 
> > Karl
> > 
> > 
> >> On Jan 30, 2017, at 3:41 AM, Martin Bowes <martin.bowes at ndph.ox.ac.uk> wrote:
> >> 
> >> Yes it's a single column unique primary key. So I would expect a count 1.
> >> 
> >> Marty
> >> 
> >> -----Original Message-----
> >> From: Paul White [mailto:paul.white at shift7solutions.com.au]
> >> Sent: 28 January 2017 06:17
> >> To: 'Ingres and related product discussion forum'
> >> Subject: Re: [Info-ingres] stats on tables with more than 1million 
> >> rows
> >> 
> >> Hi Marty and Karl,
> >> 
> >> If there is just a single column as unique primary key, the count needs to be 1, yes?
> >> 
> >> Paul
> >> 
> >> -----Original Message-----
> >> From: info-ingres-bounces at lists.planetingres.org
> >> [mailto:info-ingres-bounces at lists.planetingres.org] On Behalf Of 
> >> Karl and Betty Schendel
> >> Sent: Saturday, 28 January 2017 3:15 PM
> >> To: Ingres and related product discussion forum
> >> Subject: Re: [Info-ingres] stats on tables with more than 1million 
> >> rows
> >> 
> >> Problem is, if you've sampled, you might have simply managed to get unique values.  I can easily imagine a table with values like 1, 1, 2, 2, 3, 3, 4, 4, getting only one of the two values when sampling, thus the sample would look unique but the true count per value is 2.
> >> 
> >> I don't know if there is a good answer, other than not sampling.
> >> 
> >> _______________________________________________
> >> Info-ingres mailing list
> >> Info-ingres at lists.planetingres.org
> >> http://lists.planetingres.org/mailman/listinfo/info-ingres
> >> _______________________________________________
> >> Info-ingres mailing list
> >> Info-ingres at lists.planetingres.org
> >> http://lists.planetingres.org/mailman/listinfo/info-ingres
> > 
> > _______________________________________________
> > Info-ingres mailing list
> > Info-ingres at lists.planetingres.org
> > http://lists.planetingres.org/mailman/listinfo/info-ingres
> 
> _______________________________________________
> Info-ingres mailing list
> Info-ingres at lists.planetingres.org
> http://lists.planetingres.org/mailman/listinfo/info-ingres

Back to Karl's comments - the Ingres catalog is pretty difficult when it comes to determining column uniqueness. It's actually easier to do in the "create statistics" code than in optimizedb. His example of a table  with (1, 1, 2, 2, 3, 3, ...) is valid, however, given the manner in which we do sampling, I'd say it is overwhelmingly likely that at least one pair of the same values will show up, even in a small sample. So my original suggestion still has merit, I'd say, in particular since it is so easy to implement. I guess the 2 of us should take it offline for further discussion. I will say that it has always been a bit embarrassing that we ignore uniqueness when computing selectivity estimates.
_______________________________________________
Info-ingres mailing list
Info-ingres at lists.planetingres.org
http://lists.planetingres.org/mailman/listinfo/info-ingres