Menu

serious NCO 3.9.5 problem

Developers
2008-12-03
2013-10-17
  • Davide DelVento (was javacorner)

    Folks,
    as I reported previously (see https://sourceforge.net/forum/message.php?msg_id=5532064 ) I was able to install NCO 3.9.5 on bluefire.
    Unfortunately, yesterday I discovery that something is broken: trying to process some WRF output as follows gives unreliable results! I'm not sure if this is a compilation issue or a software bug.

    /usr/local/apps/nco-3.9.5/bin/ncea -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o new_nco_output.nc
    /usr/local/apps/nco-3.9.2/bin/ncea -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o old_nco_output.nc

    and then

    ncdump old_nco_output.nc > old_nco_output.ncdump
    ncdump new_nco_output.nc > new_nco_output.ncdump

    and then

    /usr/local/bin/diff -wu new_nco_output.ncdump old_nco_output.ncdump | less

    The numbers are COMPLETELY different! And even worst, the results are different every time I run the 3.9.5 ncea.
    Suspecting a race condition, I tried using only one thread (default is 4):

    /usr/local/apps/nco-3.9.5/bin/ncea -t1 -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o new_nco_output.nc

    This way, the results are fine. But please note that the 3.9.2 data are obtained with 4 threads, as you will see in the ncdump file.

    You can find wrfinput_d01_mem1 wrfinput_d01_mem3 in /blhome/ddvento/nco_problem on bluefire.

    Regards,
    Davide Del Vento, Consulting Services Software Engineer
    NCAR Computational & Information Services Laboratory
    http://www.cisl.ucar.edu/hss/csg/
    office: Mesa Lab, Room 42B
    phone:  1233

     
    • Charlie Zender

      Charlie Zender - 2008-12-03

      Davide,

      I looked at this quickly and verified that the input files are netCDF classic format.
      Yours is the first report of NCO threading problems with netCDF3 files.
      This is troubling. I have marked this as TODO nco956.

      We know there are threading problems with netCDF4 files _unless_
      the underlying HDF library is built with the --enable-threadsafe option (which nobody
      does).
      To solve/workaround this (TODO nco939), we made NCO 3.9.6-beta turn off threading when built with netCDF4.
      Sounds like the same solution will work in your case (it automatically sets threads to 1).

      What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
      Can't think of any source code changes that would cause this.
      Were they built with the same compiler?

      We will figure this out.
      In the meantime, I suggest you revert to 3.9.2.

      P.S. Figuring this out would go quicker if bluefire had an up-to-date GSL installation.
      NCO 3.9.6 depends on GSL and I can´t test fixes on bluefire until GSL works.
      Currently it fails to build for reasons shown in

      ptmp/zender/gsl-1.11/gsl*.foo

      Thanks,
      Charlie

       
      • Davide DelVento (was javacorner)

        Charlie,

        > We know there are threading problems with netCDF4 files _unless_
        > the underlying HDF library is built with the --enable-threadsafe option (which nobody
        > does).
        Surely we didn't, but I can recompile it. Do you have additional info about it? If I understand it correctly, there might be other applications suffering from this problem!

        > What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
        > Can't think of any source code changes that would cause this.
        > Were they built with the same compiler?
        Yes, but not exactly with the same version, because we always use the latest stable compiler available (which of course changes). And probably at that time we didn't have netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" means for NCO 3.9.2)

        Thanks,

        Davide Del Vento, Consulting Services Software Engineer
        NCAR Computational & Information Services Laboratory
        http://www.cisl.ucar.edu/hss/csg/
        office: Mesa Lab, Room 42B
        phone: 1233

         
        • Charlie Zender

          Charlie Zender - 2008-12-03

          > Surely we didn't, but I can recompile it.
          The problem with building HDF with --enable-threadsafe is that, I believe,
          either the HDF fortran library or the netCDF4 fortran library will not build,
          at least in a straightforward way. Ask netCDF4/HDF people for more info.
          Bottom line, we just turn off threading support in NCO  built with netCDF4 to keep things simple. Only ncwa and ncap2 benefit significantly from threading, so the
          overall NCO performance loss is not a big deal.

          > Were they built with the same compiler? 
          >Yes, but not exactly with the same version, because we always use the latest stable >compiler available (which of course changes). And probably at that time we didn't have >netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" >means for NCO 3.9.2)
          it means the executable ran with 4 threads, as it should have.
          your current report is the first in which NCO with libnetcdf3 has shown threading
          problems.

          p.s. another option you have is to try building 3.9.5 with threading disabled with the

          CPPFLAGS='-U_OPENMP' ./configure --prefix ....

           
          • Davide DelVento (was javacorner)

            Ok, thanks.
            I recompiled it with CPPFLAGS='-U_OPENMP' and of course it works.
            Bye,
            ;Dav

             
    • Charlie Zender

      Charlie Zender - 2009-01-14

      OK, we are trying to track down this problem now.
      It is our highest priority.
      I can reproduce it on bluefire.

      Henry, I've copied the wrfinput* input files to /data/zender/tmp
      on both esmf.ess.uci.edu (AIX) and dust.ess.uci.edu (LINUX).
      Please see if you can reproduce the problems above on either/both
      architectures using (first) the head code branch.
      If you can reproduce, then go back a few versions, to 3.9.1, say,
      and see if it works and if it does then please bracket the offending patch.
      Problem should show up if you change -t 1 to -t 4 in the following:

      ncea -t 1 -v U,V,W,PH,T,MU,QVAPOR -p ${DATA}/tmp wrfinput_d01_mem1 wrfinput_d01_mem3 -o ~/new_nco_output.nc

      Thanks,
      Charlie

       
    • Charlie Zender

      Charlie Zender - 2009-01-22

      OK, Henry identified and fixed the problem.
      The fix is a (one-line!) patch to the file nco/src/nco/nco_msa.c
      distributed with NCO version 3.9.5 (no other versions are affected).
      The following should be added at line 560:
      ************************************************
      560c560
      <
      ---
      >   var_in->nc_id=in_id;

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.