Friday, April 13, 2007

pgsql COPY doesn't fire rules

I've got a very large database at $DAYJOB. Two and a half years worth of data takes up around 500GB. The largest table is around 80GB. I use postgresql, and with the right indexes, performance is very good. I've sort of become a specialist at database performance tuning since I've never had the benefit of working with anyone else who could avoid sequential scans on large tables.

So raw query or insert/update/delete performance isn't a problem. However, database management *is* a problem. On rare occasions, I need to vacuum a table, or many tables. That's very hard to do when the table is so large (I stopped the vacuum after 2 hours). It's the same thing when I have to backup a table. pg_dump -t takes a very long time to scan a very large table.

So I decided that I needed to partition the larger tables. Perhaps into per-month tables. My program to load data into the database uses COPY because that's the fastest way to load data into a table. Unfortunately, COPY doesn't fire rules or triggers, so the ON INSERT DO INSTEAD rules didn't fire and the data was still going into the base (parent) tables.

Just yesterday I was scanning through the pgsql-general list and saw Tom Lane say that COPY doesn't fire rules. It didn't strike me as significant at the time since I was thinking of something else. Tonight though, as I was working on it again and, again, finding that rows were going into the base table (or the whole transaction failing, since I decided to add a constraint on the parent so that no rows could insert into it at all), I finally put it together and knew why COPY wasn't doing what I wanted.

It was a simple thing to replace the COPY code with insert statements. Inserts will run more slowly now, (although I might get some performance gain by upgrading to 8.2 and using the new multi-row insert syntax) but the database will be much more maintainable. It'll probably run faster too since most queries will be against the most recent months, so indexes and whole tables for the most recent months will be more likely to fit in the OS cache/buffers and postgres' shared memory.

In any case, I'll keep the old non-partitioned database and the new partitioned database, inserting data into both. When I prove that the new partitioned database is stable and faster, I'll retire the old non-partitioned database.

My last concern is aesthetic. With new rules being added every month (and 2 or more years of data being loaded) a \d on a base table will yield a very ugly list of ON INSERT DO INSTEAD RULES. There's nothing to be done about that though, unless maybe I move the parent tables into a schema and replace them in the public schema with views :-).

No comments: