Push it down: Lack of optimization of UNION and GROUP BY VIEWS in MySQL
I just got an e-mail from a desperate MySQL user who stumbled across a discussion on the forums that I had some time ago with MySQL's Oleksandr Byelkin. The user hit the same problem I had (and have) when using VIEWs: Whenever MySQL has to use ALGORITHM=TEMPTABLE it's virtually impossible to use the VIEW on all but very small tables.
This issue affects mainly VIEWs that make use of a UNION or GROUP BY. The problem is that the MySQL optimizer currently doesn't push down WHERE conditions from outer queries into subqueries and thus also VIEWs (at least not in these cases, I don't know if there are other situations where a pushdown occurs). This means that before MySQL can start to solve the final outer query, it has to build a temporary table using all the rows of the underlying base table. You won't notice this on tables with just a few rows, but it becomes really bad on a table with around 100 million rows, especially if the actual outer query would only need to get a tiny subset of those.
The user asked me for help on this. I couldn't give him some really good advice, but from what I heard, the MySQL team should be working hard on further optimization of subqueries and VIEWs. I'll certainly take advantage of the MySQL Users Conference starting Monday and try to get some latest update on the plans and progress.
I understand that the algorithms to push down conditions into VIEWs are certainly non-trivial for all but the most basic cases (it starts to get tough when you use functions in the GROUP BY, for example). I know that other RDBMSs do optimize with pushdowns, but I currently don't know how their algorithms perform on more complicated queries (that's unfortunately the ones I usually hit).
The user asked me, how I currently work around the problem and what I would suggest to him. These are the things I mentioned:
- In one case I just added a hard-coded restriction right into the
VIEWto only process a small amount of all available rows. We have a table with approximately 100 million rows of time series of meteorological data. 90% of the applications only query the latest hours. So I just built a VIEW that queries only the last 12 hours (and builds it's temporary table from that), the application can then further restrict the range it needs in the outer query. That's reasonably quick. Of course the other 10% of the applications just can't use this VIEW for now (but I can remove the restriction when MySQL does a better optimization). It depends on your application if you can find a similar restriction - you could also use multiple views and your application switches dynamically between them. - Build sort of a materialized view yourself. Just create a real table that holds the result set from the
VIEW(rebuild this table every now and then), that way you can even make use of keys. This seems to be a good approach if your base table is strictly incremental: You can just add new rows to the materializedVIEWand don't have to work through all the old rows again (you could use a simpleINSERT INTO ... SELECTthat just processes rows that have been inserted into the base table since the last execution). It becomes a problem though, if you really have to process all the data again, as this could use too much time on huge tables. - Use a simple stored procedure instead of a VIEW and allow to control the range of data you want to select through the parameters. Of course you don't get the flexibility of a VIEW, as you can't easily use the result set from a stored procedure in further queries.
- Build a more sophisticated stored procedure with prepared statements or add an additional layer on the application level that build the final query dynamically. The application layer is a bad idea if you use different programming languages to access the same data, as you'd have to code this additional layer multiple times. The stored procedure seems like a good approach, but even with prepared statements you'll never achieve the flexibility of a real
VIEW. If you still did, this would actually mean that you implemented everything that should be in MySQL's query optimizer with nothing but pure SQL (I promise you my congratulations, but I'd also ask you why you didn't just add the code to MySQL itself).
You may find a lot of other strategies, but most of them will probably just be variations of the above. The best solution would still be that MySQL learns how to push conditions down into VIEWS that make use of ALGORITHM=TEMPTABLE as soon as possible.