Posts Tagged ‘Queries’

Optimizing differences

Friday, March 19th, 2010

I would like to use this installment of my blog to show some differences between VFP and MYSQL on how problems need to be solved. On the top it seems to me that the MYSQL optimizer needs a quite different strategy and some more help then the “rushmore” one in VFP. There is probably some room for improvements left in both approaches but I usually stop optimizing when I get under 2 seconds on getting results.

The task was to roll back the AR aging to a prior date. Basically a standard business problem. Accounting comes, looks at last months aging, makes some adjustments and then wants to rerun the aging, or the bank wants it as of a particular date or things just went nuts and everyone was busy with … whatevever …. and did not get to run the aging at the proper time. So to handle this a query was devised to back out all invoices and applications as to a certain date.

In the late 80’s the customer got and SBT accounting system and that was then modified and modernized over the decades so the  invoices, credits and payments are  stored in the invoice master file. fields if interest: Invamt, Paidamt, Invdte, Balance, custno, salesmn etc. Cash and credit applications are in a file called cc_apply in VFP the problem was solved with a query that ran into a little under 2 seconds and looked like this

SELECT a.custno, ;
a.invno AS invno, ;
a.invdte AS invdte, ;
a.balance + NVL((SELECT SUM(f.apply + f.disc) ;
FROM “accounts receivables!cc_apply” f;
WHERE f.invno = a.invno AND TTOD(f.appdate) > ld_date), 0) ;
- NVL((SELECT SUM(f.apply) ;
FROM “accounts receivables!cc_apply” f;
WHERE f.ccati = a.invno AND TTOD(f.appdate) > ld_date), 0) AS balance, ;
a.paidamt – NVL((SELECT SUM(f.apply + f.disc) ;
FROM “accounts receivables!cc_apply” f;
WHERE f.invno = a.invno AND TTOD(f.appdate) > ld_date), 0) AS paidamt, ;
a.invamt – NVL((SELECT SUM(f.apply) ;
FROM “accounts receivables!cc_apply” f;
WHERE f.ccati = a.invno AND TTOD(f.appdate) > ld_date), 0) AS invamt ;
FROM “accounts receivables!invoice master” a, customer\customer cu, ;
(SELECT DISTINCT e.invno as invno FROM (SELECT d.invno as invno ;
FROM “accounts receivables!invoice master” d ;
WHERE d.invdte <= ld_date AND d.balance # 0 AND !d.arstat $ “Vv” ;
UNION SELECT d.invno as invno ;
FROM “accounts receivables!invoice master” d ,;
“accounts receivables!cc_apply” b ;
WHERE d.invno = b.invno AND d.invdte <= ld_date AND d.balance = 0 AND !d.arstat $ “Vv” ; AND TTOD(b.appdate) > ld_date ;
AND (b.apply # 0 OR b.disc # 0) ;
UNION SELECT d.invno as invno ;
FROM “accounts receivables!invoice master” d ,;
“accounts receivables!cc_apply” b ;
WHERE d.invno = b.ccati AND d.invdte <= ld_date AND d.balance = 0 AND !d.arstat $ “Vv” ; AND TTOD(b.appdate) > ld_date ;
AND (b.apply # 0 OR b.disc # 0)) e) c ;
WHERE a.invno = c.invno AND a.custno = cu.custno AND !a.arstat $ “Vv” AND &lc_filter;
ORDER BY a.custno, a.invdte, a.invno ;
INTO CURSOR c_arorec NOFILTER READWRITE

&lc_filter was used to filter out some other parameters like sales rep, state, customer etc.

The query was ported into MYSQL, NVL() was replaced with COALESCE()  and the statement that took about 2 seconds under VFP took a little over 50 in MYSQL. So a different approach was needed. MYSQL, even with hints seemed to have problems using the right indexes or come up with a reasonable plan as the EXPLAIN statement showed. Even though VFP (Visual foxpro) does not have a feature to show you exactly what it is doing, on a proper written query it will do most the selecting inside the indexes if possible as that is cheaper. The bottle neck in VFP is usually the network. As the processing gets done at the workstation you want to cut he total # of bytes sent to the workstation for processing down.

In  MYSQL its disk reads. I could not convince the optimizer to do things in the sequence I wanted on that query so I had no choice than the break that query and spoon feed it to the optimizer. From past trials it appears to me that the way to go with MYSQL is to cut down rowcount and disk read/writes. So here is the way I got MYSQL to solve the above problem in under 2 seconds.

For starters to cut the row count down as early as possible I decided to employ some level of denormalisation and put the custno, salesman, state,invdte, dtepaid into the cc_apply file. I keep that in sync in the INSERT and update triggers and with foreign keys. This puts a little overhead on the INSERT and UPDATE statement but pays of big time at the time of data SELECTION. Furthermore, as we were going to do this in multiple SQL statement we wrap those statement into a stored procedure to handle our parameters and selections like

CREATE DEFINER=`whoever`@`%` PROCEDURE `arorec`(v_date DATE,
v_state varchar(2), v_salesmn varchar(2), v_custno varchar(6), v_sum TINYINT)
READS SQL DATA
DETERMINISTIC
BEGIN
SET @v_date = v_date;
SET @v_custno = v_custno;
SET @v_state = v_state;
SET @v_salesmn = v_salesmn;
SET @v_sum = v_sum; — summary or not
/*Note need to use tempres and tempcr as you cannot use temp table(tempsel) 2x in a union*/
DROP TABLE IF EXISTS tempsel;
DROP TABLE IF EXISTS tempres;
DROP TABLE IF EXISTS tempcr;

SO for starters we needed all the invoices that were created before the cut of date and that had moneys applied to them after the cut of date. As the resulting set is definitly way below the available ram on the target server we will put this resutl set into a temporary table in memory with

IF @v_custno IS null AND @v_state IS NULL AND @v_salesmn IS NULL THEN
/* GET EVERYONE*/
PREPARE stmt1 FROM ”
CREATE TEMPORARY TABLE tempsel ENGINE = MEMORY SELECT invno, Sum(apply) as apply, SUM(disc) AS disc
FROM cc_apply WHERE invdte <= ? AND appdate > ? GROUP BY invno” ;
EXECUTE stmt1 USING @v_date, @v_date;
ELSEIF @v_custno IS NOT NULL AND @v_salesmn IS NULL THEN
/* State is irrelevant as each customer has one address customere only*/
…….The selections for other criteria

Next we had to join this with the invoice and CM header records in the invoice master table plus add all the invoices that were created the cutoff date, are still open and did not have anything applied to them. Normally we could do that in a UNION, but MYSQL does not let you use a UNION on a TEMPORARY TABLE so we use another temporary table like this

PREPARE stmt1 FROM ”
CREATE TEMPORARY TABLE tempres ENGINE = MEMORY
SELECT ma.custno, ma.ticket AS refno, ma.dtepaid, ma.artype, ma.ponum AS checkno, 0000000000.00 AS future,
ma.invno, ma.balance + te.apply – te.disc AS BALANCE, ma.entered, ma.invdte, ma.invamt, ma.paidamt, getpd(@v_date, ma.invdte) AS gp
FROM `invoice master` ma INNER JOIN tempsel te ON ma.invno = te.invno
WHERE ma.invdte <= ? AND artype IN (‘ ‘, ‘I’, ‘C’, ‘W’)”;
EXECUTE stmt1 USING @v_date;
DEALLOCATE PREPARE stmt1;
ALTER TABLE tempres ADD INDEX custno(custno);
/* UNION of invoices still open without anything applied and futures*/

IF @v_custno IS null AND @v_state IS NULL AND @v_salesmn IS NULL THEN
Prepare stmt1 FROM ”
INSERT INTO tempres
SELECT ma.custno, ma.ticket AS refno, ma.dtepaid, ma.artype, ma.ponum AS checkno, 0000000000.00 AS future,
ma.invno, ma.balance, ma.entered, ma.invdte, ma.invamt, ma.paidamt, getpd(@v_date, ma.invdte) AS gp
FROM `invoice master` ma LEFt JOIN tempsel t2 ON ma.invno = t2.invno
WHERE t2.invno IS NULL and ma.open = 1 and ma.invdte <= ? AND artype IN (‘ ‘, ‘I’, ‘C’, ‘W’) UNION SELECT ma.custno, ma.ticket AS refno, ma.dtepaid, ma.artype, ma.ponum AS checkno, ma.invamt AS future, ma.invno, ma.balance + COALESCE(cc.apply, 0) – COALESCE(cc.disc, 0) AS BALANCE, ma.entered, ma.invdte, ma.invamt, ma.paidamt, getpd(@v_date, ma.invdte) AS gp FROM `invoice master` ma LEFT JOIN cc_apply cc ON ma.invno = cc.invno WHERE ma.invdte > ? AND ma.entered <= ? AND artype IN (‘ ‘, ‘I’, ‘C’, ‘W’)”;
EXECUTE stmt1 USING @v_date, @v_date, @v_date;
ELSEIF @v_custno IS NOT NULL AND @v_salesmn IS NULL THEN
……. The other statements for selection by salesmn, state, etc

We then do the same for the payments and end up with a temporary table “TEMPRES” that holds all the records needed for our aging.

In VFP we then used “SET RELATION TO” to get access to some other relevant fields for our reporting but it is much more efficient to have the MYSQL server send all the data in one result set so we are doing this in the final SELECT statement that looks something like this

SELECT te.*, cu.email, cu.statml, cu.statfmt, cu.company, cu.phone, cu.faxno, cu.contact, cu.address1,
cu.address2, cu.city, cu.state AS state, cu.zip, cu.pterms, cu.lastpay AS lastpay, cu.limit,
cu.lpymt, cu.ldate AS ldate, cu.lsale, cu.inactive AS inactive,
IF(te.gp=’C',te.invamt – te.paidamt,0.00) AS current,
IF(te.gp=’1′,te.invamt – te.paidamt,0.00) AS pd1,
IF(te.gp=’2′,te.invamt – te.paidamt,0.00) AS pd2,
IF(te.gp=’3′,te.invamt – te.paidamt,0.00) AS pd3,
IF(te.gp=’3′,te.invamt – te.paidamt,0.00) AS pd4,
IF(te.gp=’O',te.invamt – te.paidamt,0.00) AS pdover
FROM tempres te INNER JOIN customer cu ON te.custno = cu.custno
ORDER by te.custno ASC, te.invdte ASC, te.invno ASC;

The getpd() is a stored function that determines based on parameters stored in another table if we want the output in 30/60/90/120, 7/14/28/35. based on the end of the week, end of month or any date in between to allow full flexibility in the aging output.

For those interested.  On a table with about 500,000 invoices records, 300,000 records in the application table, on my development server on my laptop the whole SP takes 1.4 seconds in the worst case scenario. Granted the complete stored procedure with formattin as shown above is over 300 lines long with many statements as compared to one straight forward SQL statement in VFP. But the end result is on par with VFP with the added benefit to provide improved results over slow connections over the internet

Keeping save – Injection attacks etc.

Sunday, March 7th, 2010

I had someone point out to me that some of the sample code I have used might be vulnerable to an insertion attack. So I feel it is time to address this subject.

First when using examples I try to just focus on the subject of the current blog and often strip a lot of things that are irrelevant to the point I am trying to make out. But now to the subject of keeping your code save.

If you write and SQL statement looking something like

lc_sql = “SELECT * from customer WHERE custno = ‘” + lc_custno + “‘”

Someone could enter “‘; TRUNCATE TABLE customer;’” resulting in  a SQL statement looking like

SELECT * from customer WHERE custno = ”; TRUNCATE TABLE customer; ”

Which could be quite disasterous. So to avoid this we could use parameterized queries which I cannot as I am using a seperate function to wrap around the SQLEXEC() function to do all kinds of things like connection maintenance, Error handling, Logging and even in some cases sending an email to the IT when certain conditions arise. So the above mentioned problem needs to be addressed when creating the sql statement.

The first line of protection is to properly escape any string I send. The table below is copied from the MYSQL manual showing what needs escaping

\0 An ASCII NUL (0x00) character.
\' A single quote (“'”) character.
\" A double quote (“"”) character.
\b A backspace character.
\n A newline (linefeed) character.
\r A carriage return character.
\t A tab character.
\Z ASCII 26 (Control-Z). See note following the table.
\\ A backslash (“\”) character.
\% A “%” character. See note following the table.
\_ A “_” character. See note following the table.

So I wrote a vfp function called addslashes()

FUNCTION Addslashes

LPARAMETERS lc_str, ll_notrim

lc_str = STRTRAN(lc_str,”\”, “\\”)

lc_str = STRTRAN(lc_str,”‘”, “\’”)

lc_str = STRTRAN(lc_str,’”‘, ‘\”‘)

lc_str = STRTRAN(lc_str,’%', ‘\%’)

lc_str = STRTRAN(lc_str,’_', ‘\_’)

lc_str = STRTRAN(lc_str,CHR(0), ‘\0′)

lc_str = STRTRAN(lc_str,CHR(10), ‘\n’)

lc_str = STRTRAN(lc_str,CHR(13), ‘\r’)

lc_str = STRTRAN(lc_str,CHR(26), ‘\z’)

lc_str = STRTRAN(lc_str,CHR(9), ‘\t’)

lc_str = STRTRAN(lc_str,CHR(8), ‘\b’)

IF !ll_notrim THEN

      lc_str = RTRIM(lc_Str)

ENDIF

RETURN lc_str

I could Have used a FOR … NEXT Loop with a CASE statement and loop through the string but found that that was slower than calling the STRTRAN() function multiple times. So now we run every string through this routine and the above SQL statement becomes

lc_sql = “SELECT * FROM customer WHERE custno = ‘” + ADDSLASHES(lc_custno) + “‘”

And the above attempt to mess with the system will produce an sql now looking like

SELECT * from customer WHERE custno = ‘\’; TRUNCATE TABLE customer; \”

Which would just return an empty record set.

This will also allow me now to insert and retrieve string like “John’s Barber Shop” or multi line entries. Furthermore I made it a rule to emulate strong typing when it comes to SQL statements. Simply to to prevent a “Garbage in”/”Garbage out” Scenario. So Each textget has its validation routine and is bound to a local cursor. VFP allows you to change the type of a variable by simple assigning some different type of data to it but fields in a cursor are of a fixed type. This also allows us to use functions to build the SQL statements based on the underlying cursor. So every string and Memofield gets treated with ADDSLASHES() each numeric with STR(), Each date field WITH DTOS() to make a MYSQL valid date string etc.

At this point some might say it looks like a lot of work and some of this is replicating what VFP does through remote views. So why use SQL Pass through and not remote views? The single aswer is performance. This way I can control what to send when and how and avoid data being sent unnecessarily or at inoportune times.

Converting and optimizing SQL Statements

Saturday, March 6th, 2010

Now it became time to conveert some SQL statements. Even though many SELECT SQL statements will run unchanged in MYSQL it might be wise to check and see if they are performing well. There are some differences in how VFP and MYSQL optimize queries and a query that just performs great in VFP might be very slow in MYSQL. And then there are some cases where we have to replace things that are possible in VFP but not supported in MYSQL.

First lets assume that our tables all have lots of records. After all it does not pay to spend time on optimizing on tables with a few dozend records. I also will not go into VFP optimization here as this is a MYSQL blog.

Lets assume for our first case we might have a table with invoice detail records and a table with items that are on sale. So now we want to get maybe 2 queries. One to show a total of all the items that were on sale and the second a total of all the items not on sale. I will leave date selections etc out of this as I just want to focus in on some performance issues

In VFP we might have had something like

SELECT <fieldlist> FROM detail det WHERE INDEXSEEK(det.item, .f., “promotions”, “item”)

SELECT <fieldlist> FROM detail det WHERE .NOT. INDEXSEEK(det.item, .f., “promotions”, “item”)

respectively depending which set we want. In VFP it was quite fine to use and INDEXSEEK in the Where clause. I sometimes used it in place of a join as it seemed to work faster and in previous version of VFP was a way to get around limits in the number of JOINS and the complexity of a query vfp supports.

In MYSQL now you could replace the INDEXSEEK with an IN or EXISTS(SELECT….) Subquery  so the query rewritten will look something like

SELECT <fieldlist> FROM detail det WHERE EXISTS (SELECT * FROM promotions p WHERE p.item = det.item)

Or the “NOT EXISTS” version for the oposite result. You could also use IN but in some cases that might even be more inefficient as IN retrieves the whole matching subset and EXISTS stops at the first occurance.

If we run the above query through “EXPLAIN” to see what MYSQL is doing we will see that on ther “SELECT <fieldlist> FROM detail det” it will perform a table scan and  then the item  index of the promotions table. So having a large “details” table that has also an “item” index this is not very efficient. In MYSQL the way to improve on that is to rewrite the query as a join. like

SELECT <field list> FROM detail det JOIN promotions p ON det.item = p.item

In this case MYSQL will use both item indexes and only retrieve the items matching. It gets a little bit more complicated with the case where we want to select items that have no promotions but we can do a join there too with

SELECT <field list> FROM detail det LEFT JOIN promotions p ON det.item = p.item WHER p.item = NULL

This one again will make the selection on the index level and depending on the amount of records you are dealing with result in great performance gains.

Lets now look at another scenario. Lets assume we have a invoice master table with  as “balance” field and we want to filter out as part of a query all records that are open so we are looking for “BALANCE <> 0″ In vfp we created an index on balance  and that improved the statement

SELECT <fieldlist> FROM master WHERE balance <> 0

drastically. Now if we do the same in mysql and we check with

EXPLAIN SELECT <fieldlist> FROM master WHERE balance <> 0

We will find that MYSQL is doing a table scan. MYSQL seems to have a problem with “<> <keyfield>”

You can get a slight improvement by

SELECT <fieldlist> FROM master WHERE balance < UNION SELECT <fieldlist> FROM master WHERE balance > 0

As it seems MYSQL has an easier time using the index with just the > and the < and a UNION but you are running into unnecessary overhead there too. What I have found most efficient so far is to add a field to the table

unpaid BIT

And then create an index on this field and set it in the BEFORE UPDATE and BEFORE INSERT triggers with

IF NEW.balance = 0 THEN SET NEW.unpaid = 0; ELSE SET NEW.unpaid = 1; END IF;

THE sql statement now looks like

SELECT <fieldlist> FROM master WHERE unpaid = 1

I used the BIT type field as it nicely translates into a VFP logical field with .f. for 0 and .t. FOR 1. But you probably can achieve the same by using an TINYINT or CHR(1)  or BOOLEAN field.

The performance improvement was drastic.

Beyond this it is always good to remember to

1.) make sure you have indexes on the fields in the “WHERE” clause and they are being used. USE “EXPLAIN ….” in the MYSQL query editor to find out.

2.) keep the row count to a minimum especially if you are using subqueries. So it often makes sense to first cut the number of records down with a subquery. I will go into detail on this in a seperate blog

As always comments/suggestions are welcome as this is all work in progress