Folks,
as I reported previously (see https://sourceforge.net/forum/message.php?msg_id=5532064 ) I was able to install NCO 3.9.5 on bluefire.
Unfortunately, yesterday I discovery that something is broken: trying to process some WRF output as follows gives unreliable results! I'm not sure if this is a compilation issue or a software bug.
/usr/local/bin/diff -wu new_nco_output.ncdump old_nco_output.ncdump | less
The numbers are COMPLETELY different! And even worst, the results are different every time I run the 3.9.5 ncea.
Suspecting a race condition, I tried using only one thread (default is 4):
I looked at this quickly and verified that the input files are netCDF classic format.
Yours is the first report of NCO threading problems with netCDF3 files.
This is troubling. I have marked this as TODO nco956.
We know there are threading problems with netCDF4 files _unless_
the underlying HDF library is built with the --enable-threadsafe option (which nobody
does).
To solve/workaround this (TODO nco939), we made NCO 3.9.6-beta turn off threading when built with netCDF4.
Sounds like the same solution will work in your case (it automatically sets threads to 1).
What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
Can't think of any source code changes that would cause this.
Were they built with the same compiler?
We will figure this out.
In the meantime, I suggest you revert to 3.9.2.
P.S. Figuring this out would go quicker if bluefire had an up-to-date GSL installation.
NCO 3.9.6 depends on GSL and I can´t test fixes on bluefire until GSL works.
Currently it fails to build for reasons shown in
ptmp/zender/gsl-1.11/gsl*.foo
Thanks,
Charlie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> We know there are threading problems with netCDF4 files _unless_
> the underlying HDF library is built with the --enable-threadsafe option (which nobody
> does).
Surely we didn't, but I can recompile it. Do you have additional info about it? If I understand it correctly, there might be other applications suffering from this problem!
> What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
> Can't think of any source code changes that would cause this.
> Were they built with the same compiler?
Yes, but not exactly with the same version, because we always use the latest stable compiler available (which of course changes). And probably at that time we didn't have netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" means for NCO 3.9.2)
> Surely we didn't, but I can recompile it.
The problem with building HDF with --enable-threadsafe is that, I believe,
either the HDF fortran library or the netCDF4 fortran library will not build,
at least in a straightforward way. Ask netCDF4/HDF people for more info.
Bottom line, we just turn off threading support in NCO built with netCDF4 to keep things simple. Only ncwa and ncap2 benefit significantly from threading, so the
overall NCO performance loss is not a big deal.
> Were they built with the same compiler?
>Yes, but not exactly with the same version, because we always use the latest stable >compiler available (which of course changes). And probably at that time we didn't have >netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" >means for NCO 3.9.2)
it means the executable ran with 4 threads, as it should have.
your current report is the first in which NCO with libnetcdf3 has shown threading
problems.
p.s. another option you have is to try building 3.9.5 with threading disabled with the
CPPFLAGS='-U_OPENMP' ./configure --prefix ....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, we are trying to track down this problem now.
It is our highest priority.
I can reproduce it on bluefire.
Henry, I've copied the wrfinput* input files to /data/zender/tmp
on both esmf.ess.uci.edu (AIX) and dust.ess.uci.edu (LINUX).
Please see if you can reproduce the problems above on either/both
architectures using (first) the head code branch.
If you can reproduce, then go back a few versions, to 3.9.1, say,
and see if it works and if it does then please bracket the offending patch.
Problem should show up if you change -t 1 to -t 4 in the following:
OK, Henry identified and fixed the problem.
The fix is a (one-line!) patch to the file nco/src/nco/nco_msa.c
distributed with NCO version 3.9.5 (no other versions are affected).
The following should be added at line 560:
************************************************
560c560
<
---
> var_in->nc_id=in_id;
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Folks,
as I reported previously (see https://sourceforge.net/forum/message.php?msg_id=5532064 ) I was able to install NCO 3.9.5 on bluefire.
Unfortunately, yesterday I discovery that something is broken: trying to process some WRF output as follows gives unreliable results! I'm not sure if this is a compilation issue or a software bug.
/usr/local/apps/nco-3.9.5/bin/ncea -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o new_nco_output.nc
/usr/local/apps/nco-3.9.2/bin/ncea -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o old_nco_output.nc
and then
ncdump old_nco_output.nc > old_nco_output.ncdump
ncdump new_nco_output.nc > new_nco_output.ncdump
and then
/usr/local/bin/diff -wu new_nco_output.ncdump old_nco_output.ncdump | less
The numbers are COMPLETELY different! And even worst, the results are different every time I run the 3.9.5 ncea.
Suspecting a race condition, I tried using only one thread (default is 4):
/usr/local/apps/nco-3.9.5/bin/ncea -t1 -v U,V,W,PH,T,MU,QVAPOR wrfinput_d01_mem1 wrfinput_d01_mem3 -o new_nco_output.nc
This way, the results are fine. But please note that the 3.9.2 data are obtained with 4 threads, as you will see in the ncdump file.
You can find wrfinput_d01_mem1 wrfinput_d01_mem3 in /blhome/ddvento/nco_problem on bluefire.
Regards,
Davide Del Vento, Consulting Services Software Engineer
NCAR Computational & Information Services Laboratory
http://www.cisl.ucar.edu/hss/csg/
office: Mesa Lab, Room 42B
phone: 1233
Davide,
I looked at this quickly and verified that the input files are netCDF classic format.
Yours is the first report of NCO threading problems with netCDF3 files.
This is troubling. I have marked this as TODO nco956.
We know there are threading problems with netCDF4 files _unless_
the underlying HDF library is built with the --enable-threadsafe option (which nobody
does).
To solve/workaround this (TODO nco939), we made NCO 3.9.6-beta turn off threading when built with netCDF4.
Sounds like the same solution will work in your case (it automatically sets threads to 1).
What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
Can't think of any source code changes that would cause this.
Were they built with the same compiler?
We will figure this out.
In the meantime, I suggest you revert to 3.9.2.
P.S. Figuring this out would go quicker if bluefire had an up-to-date GSL installation.
NCO 3.9.6 depends on GSL and I can´t test fixes on bluefire until GSL works.
Currently it fails to build for reasons shown in
ptmp/zender/gsl-1.11/gsl*.foo
Thanks,
Charlie
Charlie,
> We know there are threading problems with netCDF4 files _unless_
> the underlying HDF library is built with the --enable-threadsafe option (which nobody
> does).
Surely we didn't, but I can recompile it. Do you have additional info about it? If I understand it correctly, there might be other applications suffering from this problem!
> What is baffling is why NCO 3.9.2 works and 3.9.5 fails.
> Can't think of any source code changes that would cause this.
> Were they built with the same compiler?
Yes, but not exactly with the same version, because we always use the latest stable compiler available (which of course changes). And probably at that time we didn't have netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" means for NCO 3.9.2)
Thanks,
Davide Del Vento, Consulting Services Software Engineer
NCAR Computational & Information Services Laboratory
http://www.cisl.ucar.edu/hss/csg/
office: Mesa Lab, Room 42B
phone: 1233
> Surely we didn't, but I can recompile it.
The problem with building HDF with --enable-threadsafe is that, I believe,
either the HDF fortran library or the netCDF4 fortran library will not build,
at least in a straightforward way. Ask netCDF4/HDF people for more info.
Bottom line, we just turn off threading support in NCO built with netCDF4 to keep things simple. Only ncwa and ncap2 benefit significantly from threading, so the
overall NCO performance loss is not a big deal.
> Were they built with the same compiler?
>Yes, but not exactly with the same version, because we always use the latest stable >compiler available (which of course changes). And probably at that time we didn't have >netCDF4 installed (if this is the case, I wonder what "nco_openmp_thread_number = 4" >means for NCO 3.9.2)
it means the executable ran with 4 threads, as it should have.
your current report is the first in which NCO with libnetcdf3 has shown threading
problems.
p.s. another option you have is to try building 3.9.5 with threading disabled with the
CPPFLAGS='-U_OPENMP' ./configure --prefix ....
Ok, thanks.
I recompiled it with CPPFLAGS='-U_OPENMP' and of course it works.
Bye,
;Dav
OK, we are trying to track down this problem now.
It is our highest priority.
I can reproduce it on bluefire.
Henry, I've copied the wrfinput* input files to /data/zender/tmp
on both esmf.ess.uci.edu (AIX) and dust.ess.uci.edu (LINUX).
Please see if you can reproduce the problems above on either/both
architectures using (first) the head code branch.
If you can reproduce, then go back a few versions, to 3.9.1, say,
and see if it works and if it does then please bracket the offending patch.
Problem should show up if you change -t 1 to -t 4 in the following:
ncea -t 1 -v U,V,W,PH,T,MU,QVAPOR -p ${DATA}/tmp wrfinput_d01_mem1 wrfinput_d01_mem3 -o ~/new_nco_output.nc
Thanks,
Charlie
OK, Henry identified and fixed the problem.
The fix is a (one-line!) patch to the file nco/src/nco/nco_msa.c
distributed with NCO version 3.9.5 (no other versions are affected).
The following should be added at line 560:
************************************************
560c560
<
---
> var_in->nc_id=in_id;