www.virtualdub.org
VirtualDub is a video capture/processing utility for 32-bit Windows platforms (95/98/ME/NT4/2000/XP), licensed under the 18 GNU General Public License (GPL). It lacks the editing power of a general-purpose editor such as Adobe Premiere, but is streamlined for fast linear operations over video. It has batch-processing capabilities for processing large numbers of files and can be extended with 19 third-party video filters. VirtualDub is mainly geared toward processing AVI files, although it can read (not write) MPEG-1 and also handle sets of BMP images. I basically 20 started VirtualDub in college to do some quick capture-and-encoding that I wanted done; I released it on the web and others found it useful, so I've been tinkering around with its code ever since. Not only does naked disable the frame pointer omission (FPO) optimization and prevent inlining, but it also doesn't stop the compiler from using spill space if it needs to -- which means you basically have to set up a stack frame anyway. I've been trying for some time to get YV12 support working perfectly, but at this point it looks like a wash. The problem is that different drivers and applications are inconsistent about how they treat or format odd-width and odd-height YV12 images. Some do that and have unused space betweeen the Cr and Cb planes (weird). Now, if people had sense, they would have handled this the way that MPEG and JPEG do, and simply require that the bitmap always be padded to the nearest even boundaries and that the extra pixels be ignored on decoding. Unfortunately, no one seems to have bothered to ever define the YV12 format properly in this regard, and thus we have massive confusion. I dropped the older KB entries, but they're basically redundant with the change log in VirtualDub. VirtualDub-AMD64 caption I put together a new Athlon 64 based system a few days ago, installed the preview of Windows XP for 64-bit Extended Systems and the prerelease AMD64 compiler, and hacked up the VirtualDub source code a bit. It plays MPEG-1 files, but nearly all of the assembly optimizations are disabled and none of the video filters work, so it's still far behind the 32-bit version, but it's still neat to be able to experiment with 64-bit code. The incremental improvements in the compiler simply aren't worth putting up with the braindead, butt-slow IDE. NET Framework, which currently doesn't work under WOW32. VC6, with the pre-release VC8 compiler from the Windows Server 2003 DDK. This is a bit clumsy since the VC6 debugger doesn't understand VC7+ debug info, and certainly can't debug a 64-bit app, so I have to use the beta AMD64 WinDbg instead, but at least I have the AMD64 build in the same project file as the 32-bit build. Having a configuration called "Win32 Release AMD64" is a bit weird, however. There are two major bottlenecks to getting VirtualDub running smoothly on AMD64: the compiler doesn't support inline assembly, and the OS doesn't support MMX for 64-bit tasks. The VC6 processor pack was quite bad and tended to generate about two move instructions for every ALU op; NET 2003, but it still isn't able to resolve binary ops of the form A+A correctly, which I use a lot. But there is an even worse problem -- note that the compiler moved MMX ops below the emms instruction. The problem is that the global optimizer doesn't see emms instructions as a barrier and freely flows ops around it, leading to incorrect code. Believe it or not, using __asm emms doesn't work either -- the only workaround I know of is to use volatile to hammer down the flow before the emms, and that's ridiculous. Never mind that the intrinsics version is also quite unreadable. But wait -- we're in the era of Pentium 4 and Athlon 64 CPUs. We don't need this emms junk, because we should be using SSE2! NET 2003 What I want push ebp mov ebp,esp pxor xmm0,xmm0 movdqa xmm1,xmm0 movd xmm0,dword ptr ebp+8 punpcklbw xmm0,xmm1 pshuflw xmm1,xmm0,0FFh pmullw xmm0,xmm1 psrlw xmm0,8 movdqa xmm1,xmm0 packuswb xmm1,xmm0 and esp,0FFFFFFF0h movd eax,xmm1 mov esp,ebp pop ebp ret push ebp mov ebp,esp pxor xmm0,xmm0 movdqa xmm1,xmm0 movd xmm0,dword ptr ebp+8 punpcklbw xmm0,xmm1 pshuflw xmm1,xmm0,0FFh pmullw xmm0,xmm1 psrlw xmm0,8 movdqa xmm1,xmm0 packuswb xmm1,xmm0 and esp,0FFFFFFF0h movd eax,xmm1 mov esp,ebp pop ebp ret pxor xmm1,xmm1 movd xmm0,dword ptr esp+4 punpcklbw xmm0,xmm1 pshuflw xmm1,xmm0,0FFh pmullw xmm0,xmm1 psrlw xmm0,8 packuswb xmm0,xmm0 movd eax,xmm0 ret The code is at least correct this time, but it is still full of unnecessary data movement, which consumes decode and execution bandwidth. Now for the real kicker: those extraneous moves hurt on a Pentium 4, because on a P4, a register-to-register MMX/SSE/SSE2 move has a latency of 6 clocks. If you have extra shifter or ALU bandwidth you can attack this by replacing movdqa with pshufd or pxor+por, but you can't do this when the compiler is generating code from intrinsics. And before you say that performance doesn't matter so much, remember that the purpose of those intrinsics is so that you can optimize hotspots using CPU-specific optimizations. This all only pertains to the Microsoft Visual C++ compiler, and as it turns out, the Intel C/C++ Compiler generates much better MMX and SSE2 code. As it stands right now, though, I still have to use Visual C++, and that means I'm still going to have to hand-roll a lot of assembly code for performance. And with AMD64, that means I'm going to have to duplicate and reflow a lot of it.
|