How much do you know about how the specific conventions work under the hood? From what I remember, cdecl is pretty simple; caller pushes arg's on stack in reverse order, callee handles creation of new stack frame and preservation of ebp/esp (including "cleanup" work for the original argument pushing). fastcall afaik is the same except the first two arguments go through ecx/edx for speed. Anyways, I think it really depends on context as far as size; fastcall should be faster in a general sense unless for some reason the arguments were already available on the stack, which is highly unlikely (but a trick used in some 4k stuff I've seen around).
Honestly, though, I think at 64k it's a bit overkill to worry about such things, as it DOES (in my opinion) make the code a bit less readable for perhaps 2-6 bytes gained per function body (not to mention you're also locking the code to x86 by explicitly setting any conventions). At 4k in asm, however, I often drop conventions all together and define them per-function; for instance using FPU stack for arguments where possible for simple routines. The only time it really matters at that level is interop between non-asm code, in which case cdecl is quite straightforward and easy to target.