Sunday, August 21, 2005

PCTFREE on Pg

I've done a quick hack for a space reservation in a page to use same page on tuple updating. It's similar to Oracle's PCTFREE stuff. heap_update() can use a (reserved) free space on the same page, but heap_insert() can't.
I allocated 1024 bytes for reserve space in each page, then it shows an improvement on the pgbench score 10% or more on my box.

-------------- normal -------------------
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 60.434280 (including connections establishing)
tps = 60.461362 (excluding connections establishing)
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 57.350695 (including connections establishing)
tps = 57.371082 (excluding connections establishing)
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 55.928602 (including connections establishing)
tps = 55.951602 (excluding connections establishing)

-------------- pctfree -------------------
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 67.552621 (including connections establishing)
tps = 67.586533 (excluding connections establishing)
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 69.260082 (including connections establishing)
tps = 69.290771 (excluding connections establishing)
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
number of clients: 16
number of transactions per client: 1000
number of transactions actually processed: 16000/16000
tps = 69.466603 (including connections establishing)
tps = 69.497294 (excluding connections establishing)

Sunday, August 14, 2005

CRC64 and MMX

I've done experimental implementation of CRC64 routines using MMX 64 bit registers.
It makes COMP_CRC64() 10~30% faster than the marco written in C on my PentiumIII laptop.
But my Pentium4 box shows the MMX code is slower 2~3 times than plain C code.

What's happened in the processor?

#define COMP_CRC64_MMX(crc, data, len) do { uint64 __crc0 = (crc).crc0; unsigned char *__data = (unsigned char *) (data); uint32 __len = (len); while (__len-- > 0) { __asm__ __volatile__ ( "movq (%1),%%mm0;" /* __crc0 -> %%mm0 */ "movq %%mm0,%%mm4;" /* __crc0 -> %%mm4 */ "movq (%2),%%mm1;" /* *__data -> %%mm1 */ "movl %3,%%eax;" /* __crc64_const_vals */ "movq (%%eax),%%mm5;" /* load '56' */ "movq 8(%%eax),%%mm6;" /* load '0xff' */ "movq 16(%%eax),%%mm7;" /* load '8' */ "psrlq %%mm5,%%mm0;" /* __crc0(%%mm0) >> 56 */ "pxor %%mm1,%%mm0;" /* __crc0(%%mm0) ^ *data */ "pand %%mm6,%%mm0;" /* __crc0(%%mm0) & 0xff */ "mov %4,%%ebx;" /* crc_table */ "movd %%mm0,%%eax;" /* move __tab_index to the register */ "imul $8,%%eax;" /* 8 bytes per table entry */ "addl %%eax,%%ebx;" "movq (%%ebx),%%mm0;" /* crc_table[__tab_index] -> %%mm0 */ "psllq %%mm7,%%mm4;" /* %%mm4 << 8 */ "pxor %%mm4,%%mm0;" /* crc_table[__tab_index] ^ (__crc0 << 8) */ "movq %%mm0,%0;" "emms;" : "+g"(__crc0) : "r"(&__crc0), "r"(__data), "r"(__crc64_const_vals), "r"(crc_table) : "%eax", "%ebx" ); __data++; }; /* while() */ (crc).crc0 = __crc0; } while (0);

Sunday, May 29, 2005

Documents for OSS RDBMS evaluation

Interesting documents have been published from the Northeast Asia OSS Promotion Forum.

http://www.ipa.go.jp/software/open/forum/NEAforum.html

I hope these documents can be useful for the OSS RDBMS guys.

Interesting for you?

Monday, May 02, 2005

'IN PROGRESS' transactions at the end of recoverying.

If there are 'IN PROGRESS' transactions at the end of recoverying, what can the recoverying node do? If the 'IN PROGRESS' transaction is found, the recoverying RM can't determine it's going to be committed or aborted. The RM can't know even it is the end of transaction or not (it's still continuing).

So the recoverying RM must ask another node what to do for 'IN PROGRESS' transactions.

When asking, the recoverying RM must identify the transaction using the identifier of the originater node of the transaction and the transaction id on the originator node.

BTW, how does the recoverying RM know 'recoverying is completed'?

Transaction isolation on recovery

Recovery can be realized only replaying WS log sequencially? I'm not sure that.

If there is a WS log sequence as below.

(1) XID=101, INSERT INTO t1 VALUES ( 1, 'Name 1' );
(2) XID=102, INSERT INTO t1 VALUES ( 2, 'Name 2' );
(3) XID=102, ABORT;
(4) XID=101, COMMIT;

(2) and (3) must be ignored when the RM recover. But if the recovery session replays this sequence in a single transaction,

(1) XID=501, INSERT INTO t1 VALUES ( 1, 'Name 1' );
(2) XID=501, INSERT INTO t1 VALUES ( 2, 'Name 2' );
(3) XID=501, ABORT;
(4) XID=501, COMMIT;

above operations don't insert any records.

So (1),(4) and (2),(3) must be isolated.

But how? I don't think using many transaction sessions in recoverying is a smart way.

Sunday, May 01, 2005

Some thoughts on recovery

1.) Recovery Scenarios

I think there are two types of the recovery scenario.

- Adding a new node to the cluster.
- Adding a crashed node to the cluster.

The difference is, a crashed node is still having old snapshot when it crashed, so we can recovery with only WS log replaying.

However, if the older WS log has gone by reusing log files, and no one has enough older WS data required for recovering a crashed node, the crashed node must be recovered like a new node.

2.) Boundary between the snapshot and WS log

When we do recovering with WS log replaying, the boundary between the snapshot of the database, which is going to be recovered, and WS log is very important.

Because WS log is the logical log. It mustn't be applied twice.

3.) WS data

I think each WS data should have the originator (source) node identifier and the transaction id on the originator node.

BTW, according to my understanding, each WSS is generated and sent by issuing DML command. Right?

4.) Replaying WS data

I'm not sure all WS can be replayed in single recovery transaction.

At least, the RM needs to recognize the WS is belonging to a committed transaction or an aborted transaction.

I think WS mustn't be replay without isolation of the transactions, because WS is represented as DMLs. It's logical log records.

If the RM replays the WS log in single transaction, the DMLs belonging to the aborted transactions must be ignored.