Berkeley CSUA MOTD 1998/11/28

Berkeley CSUA MOTD:1998:November:28 Saturday <Friday, Sunday>
WIKI \| FAQ \| Tech FAQ
`http://csua.com/feed/`
1998/11/28-30 [Computer/HW/Memory] UID:15036 Activity:high
11/28   Let's talk about virtual memory:
\__ if you'r edealing with virtual memory which
   far exceeds physical memory, you've already lost.
        \_ if you'r edealing with virtual memory which far exceeds
        \_ First of all, the guy is talking about
           number crunching, not image processing.
           It is likely that he's going to be
           addressing all of the memory he's
           crunching with.  And second, no one said
           it was impossible--it just is painfully
           slow.  disk is like 6 orders of magnitude
           slower than RAM.
           \_ uhh. "number crunching" applications
              usually exhibit greater locality
              than almost any other app, if
              optimized properly. -nick
                \_ which will help not at all if
                   it's using more than physical RAM.
                   \_ What part of "locality" don't
                      you understand, twink?  Nick knows
                      what he's talking about.
                   \_ Whether you can get away with crunching >> than RAM
                      depends strongly on the precise task.  A blocked
                      matrix multiply might perform well, especially if
                      you somehow anticipate the data needed next and
                      stream it into RAM beforehand so you don't have to
                      eat the large latency while it is paged in.....
                      Having a RAID array (expensive) also helps.
                      A non-sparse matrix-vector multiply, however, requires
                      1 memory reference for every 2 flops (not counting the
                      memory reference for the vector element).
                      memory reference for the vector element or result).
                      Assuming .5 flop per tick on a 400MHz P-II, we'd need
                      floats from the matrix at 100MHz, or 400MB/sec.
                      SDRAM might sustain that, but if the matrix were
                      SDRAM might sustain that, but if the matrix were much
                      larger than memory....  Performance would drop 10x
                      at least.  --PeterM
                      \_ I don't think ILP is heavily influenced by how
                         much a process' virtual memory compares to the
                         physical memory.  Virtual memory pages are usually
                         on the order of 64kb.  Compare that to, say, a Cray
                         vector register file which is a 32x32 64bit matrix.
                         That's 8kb and it takes several clock cycles for
                         a number crunching program to process the data in
                         the 64kb page anyway.  But this guy is talking about
                         a Pentium and a K6 to do number crunching.  I don't
                         think he's going to benefit from that kind of ILP
                         and even if he did he would still benefit from
                         the spatial and temporal locality of the program.
           physical memory, you've already lost.
           (\_ This has pretty much been my experience--PeterM )
               \_ First of all, the guy is talking about number
                  crunching, not image processing.  It is likely that
                  he's going to be addressing all of the memory he's
                  crunching with.  And second, no one said it was
                  impossible--it just is painfully slow.  disk is like 6
                  orders of magnitude slower than RAM.
                  \_ uhh. "number crunching" applications usually
                     exhibit greater locality than almost any other app,
                     if optimized properly. -nick
                     \_ which will help not at all if it's using more
                        than physical RAM.
                        \_ What part of "locality" don't you understand,
                           twink?  Nick knows what he's talking about.
                           \_ Whether you can get away with crunching >>
                              than RAM depends strongly on the precise
                              task.  A blocked matrix multiply might
                              perform well, especially if you somehow
                              anticipate the data needed next and stream
                              it into RAM beforehand so you don't have
                              to eat the large latency while it is paged
                              in.....  Having a RAID array (expensive)
                              also helps.  A non-sparse matrix-vector
                              multiply, however, requires 1 memory
                              reference for every 2 flops (not counting
                              the memory reference for the vector
                              element or result).  Assuming .5 flop per
                              tick on a 400MHz P-II, we'd need floats
                              from the matrix at 100MHz, or 400MB/sec.
                              SDRAM might sustain that, but if the
                              matrix were much larger than memory....
                              Performance would drop 10x at least.
                              --PeterM
                              \_ I don't think ILP is heavily influenced by
                                \_ ILP ==> "Instruction Level Parallelism"
                                   I don't understand why you mention it
                                   here.  --PeterM
                                   \_ You mentioned vector ops which is a
                                      form of instruction level parallelism.
                              how much a process' virtual memory
                              compares to the physical memory.  Virtual
                              memory pages are usually on the order of
                              64kb.  Compare that to, say, a Cray vector
                              register file which is a 32x32 64bit
                              matrix.  That's 8kb and it takes several
                              clock cycles for a number crunching
                              program to process the data in the 64kb
                              page anyway.  But this guy is talking
                              about a Pentium and a K6 to do number
                              crunching.  I don't think he's going to
                              benefit from that kind of ILP and even if
                              he did he would still benefit from the
                              spatial and temporal locality of the
                              program.
Berkeley CSUA MOTD:1998:November:28 Saturday <Friday, Sunday>