|
5/25 |
2002/6/11 [Computer/SW/Unix] UID:25062 Activity:moderate |
6/10 A while ago there was so thread about linux clients and solaris nfs. Today i found out that the default nfs packet size on 7.3 (redhat) is 32k not the old linux 8k. This causes clients to pound on our nfs server. Dunno if it helps. \_ from mount_nfs(1M) rsize=n Set the read buffer size to n bytes. The default value is 32768 when using Version 3 of the NFS protocol. The default can be negotiated down if the server prefers a smaller transfer size. When using Version 2, the default value is 8192. \_ After the tip from akopps, I found these relevant bugid http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64921 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64984 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65069 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65410 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65707 http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65772 |
5/25 |
|
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64921 It will hang Actual Results: Mountpoint hangs Expected Results: Mountpoints should not hang unless the server is indefinitely broken Additional info: Using nfsvers=2 ------- Additional Comment 27 #1 From William R. Fulmer on 2002-05-16 11:28 ------- I don't know about reads, but I can vouch for NFS v3 having problems. I can do large read without problems, but any large writes over a network with any kind of real latency and the NFS client goes beserk. It seems to get into a state where it does endless retries and the delay between retries decreases as time progresses sending a flood of nfs packets over the net. I was going across two local routers and manages to take down 6 or seven of our production subnets. Strangely enough, trying to reproduce the event across a local switch (with no latency) didn't work. Johnson on 2002-05-16 15:26 ------- We recommend that you use UDP for now. NFS/TCP was just recently enabled at all in the upstream kernel source tree, and it is functional for some uses but clearly not for yours. Detillieux on 2002-05-17 12:59 ------- It does appear to be using UDP, and not TCP on my network as well, but the NFS version did seem to be 3. If NFS 3 support is still not mature, I'm not sure why it would have been made the new default. This option is useful for hosts that can run multiple NFS servers. I think it is for running multiple instances of an nfs server on the same box, but I'm not sure. The option that specifies the protocol version is not in the man page (or I missed it). Detillieux on 2002-05-17 17:34 ------- Well, nfsvers=2 and vers=2 appear to accomplish exactly the same thing, as far as the kernel is concerned. In either case, the option simply appears as "v2" when you look at /proc/mounts, so I'd just as soon stick to the documented options (even if the docs are out of date). That would apply to mountprog=n and nfsprog=n, but mountvers=n and nfsvers=n are meant for specifying protocol versions. The mountd and nfsd daemons support multiple protocol versions, for reverse compatibility, regardless of how many program instances you may be running. If I don't change autofs to use nfsv2, the box gets in sorry shape very quickly. Lots of "nfs: task XXXXX can't get a request slot" errors, and X session trying to use the NFS/NIS homedir locks up hard. I am a bit confused because I thought that nfs 3 was TCP only. In summary NFS 3 work well for me as long as I specify tcp in my mount entries. Regards, Joe ------- Additional Comment 35 #9 From Ben LaHaise on 2002-05-28 17:46 ------- *** DEL: 36 Bug 64984 :DEL has been marked as a duplicate of this bug. Unless you want to use NFS version 3, nothing more needs to be done to fix this problem. I have tried a succession of kernels and options in a quest for decent performance and still haven't arrived at a satisfactory solution. I'd rather them be a little slow than panic every so often. When trying to write to the NFS mounted directory it hangs. I have tried the following setting on the client side: -fstype=nfs,hard,intr,nodev,nosuid,quota,rsize=8192,wsize=8192 servername:/home/& -fstype=nfs,hard,intr,nodev,nosuid,quota,nfsvers=2,rsize=8192,wsize=8192 servername:/home/& but it did not solve the problem. When trying to write to the NFS mounted directory it hangs. I have tried the following setting on the client side: -fstype=nfs,hard,intr,nodev,nosuid,quota,rsize=8192,wsize=8192 servername:/home/& -fstype=nfs,hard,intr,nodev,nosuid,quota,nfsvers=2,rsize=8192,wsize=8192 servername:/home/& but it did not solve the problem. Thanks, Venkat ------- Additional Comment 51 #22 From Jason Corley on 2002-09-13 15:37 ------- I also am seeing this problem. The problems I was seeing from mozilla were a combo of both nfs problems and a mozilla bug. I use this along with 'soft' even though I know all the docs say don't do 'soft'. As soon as I removed the 'timeo=300' (but not 'soft') my performance was normal again. I don't understand how timeo should be making a difference since according to the man page it is only for when the server is not responding. |
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=64984 Mount user files via nfs from a Network Appliance file server. During this time, other commands on the machine such as ps aux, dmesg, and top will hang. Let me know if there is additional information that i can provide. Can you check on this info on your box, and see if your case matches? After investigation, it turns out that somehow my 3c905C is experiencing A LOT of collisions, on a switched network. With TCP, this just reduces throughput, but on UDP (which NFS uses,) it generated so many retries and resends between my box and the server that I was flooding the NFS server. My network analyst also had me try forcing the hardware into 10baseT/half duplex mode, just in case the collisions were the result of autonegotiation failures. The odd thing is that this is a switched network, meaning that I should never see ethernet collisions on my interface, yet ifconfig reports exactly that. Also, the network guy says that he is seeing FAR more collisions on his end than I see on my end. Phil: can you poke around and see if this is the same problem you are having? I got lots of messages like this: eth0: vortex_error(): status 0xe081 eth0: vortex_error(): status 0xe481 (I think -- I'm reciting from memory. Also, the ratio of 0xe081 to 0xe481 errors is around 5:1. Doing the nfs writes of a 2 - 3 mb mail file appears to kill the response on our main nfs server (a Network Appliance) for other users as well as myself. We ran a capture on that switched port using an Etherpeek sniffer for several minutes. It shows a lot of fragment errors between my machine and the file server: The message "An IP datagram has been fragmented by the host application or a router, and one of the fragments is missing". The Cisco switch is not showing errors on the port my machine is connected to, but there are collisions on the port that the file server is connected to. Let me know if there is more information that i can provide. Phil Kaslo ------- Additional Comment 31 #4 From Need Real Name on 2002-05-15 19:23 ------- I did run " modprobe 3c59x debug=7 ; Only I wish you'd set the priority much higher (to whatever the highest level is). Does the problem go away if you set rsize=1024 and wsize=1024 on the client? Is there any packet loss when pinging the solaris server with ping -s 8300 or ping -s 4300? This sounds like two seperate bugs and should probably be entered seperately -- it's easier to mark bugs as duplicate than to deal with two threads of information in a single bug. For the person with the 3com, double check the duplex of the link and try the above. You may need to force full duplex on 3com driver with full_duplex=1 and try again. The cisco switch indicates the port that it is connected to is at auto-full duplex, and auto-100 speed. I again tried a copy of a 2 mb mail file, and edit of it using mail, d, and q. It still takes about 80 seconds to write it out, during which commands hang, and after, dmesg reports nfs: server sinagua not responding, still trying nfs: server sinagua ok' I can't keep running these tests in a production environment here, because of the effect it has on our main nfs file server, and on other users. Phil ------- Additional Comment 37 #10 From Ben LaHaise on 2002-05-16 19:09 ------- What about the results of the pings? I start to see retransmits climb at 16384, but everything is still usable. BTW, I have 0 packet loss (on over a 1000 iterations) when I'm not trying to access the NFS server (I didn't try it when I do access the NFS server). Phil ------- Additional Comment 41 #14 From Need Real Name on 2002-05-16 19:59 ------- FYI, the NFS server I referred to in my case is also a Network Appliance, and I see the same symptoms. Not sure whether the collisions are actually relevant or not, but I would guess that ethernet has to be playing a role somewhere or Phil and I would not be getting the same errors from 3c59x. Also, I'm on 10bT/half whereas Phil's on 100bT/full, which may explain why I see actual collisions and he doesn't. Like Phil, though, I can't test this much, since I'm in a production environment. I then reasoned that my card (3c905B) might have gone bad and replaced it with an Intel EtherExpress Pro card. Also, I've tried tuning the NFS options (I had previously used the default settings, which always worked find) by setting rsize and wsize to 8192, and increasing the timeo value. The increase of the timeo value (I've now got it set to 20, which is 2 seconds) has done the most to make my performance acceptable, but I can still trigger the problem by using NFS to write or read a large file. Detillieux on 2002-05-17 12:53 ------- Just wondering if this problem is related to that in DEL: 44 bug # 64921 :DEL , NFS version 3 hangs? Neil and Trond can both give a good set of recommendations on which patches to use and these problems only appear to be reported on red hat's kernels not stock ones patched with neil's patches. I'll try the fixes mentioned throughout this thread -- esp. Phil's since he seems to have a similar configuration to me. Filesystems which are mounted via 'mount' must have '-o rsize=8192,wsize=8192' specified as arguments to the mount command. Either of these two solutions will work independently for me, and I am currently using alone as it's easier to maintain. Phil ------- Additional Comment 51 #22 From Ben LaHaise on 2002-05-21 21:40 ------- A fix is being made to the kernel. The default rw size will be set to 8KB, but can still be configured to 32KB by the user. I had to change the *maximum* size (NFS_MAX_FILE_IO_BUFFER_SIZE) to fix the problem, diminishing it from 32768 to 8192. My incredibly uninformed guess is that although my system has a default size of 4096, it somehow negotiates the larger 32768 transfer size with the NFS server regardless of its 4096 default, and then the whole thing hangs. I had to limit the maximum size to inhibit this behavior (or explicitly specify 8192 at mount time as the transfer size). I will do a recheck to see what changed during the upgrade, but I guess it's clear the hw is not the culprit and maybe even RH's own kernels (I was not using them before upgrade and I'm not now, still the problem popped up after the upgrade). Never do critical changes late at night, right after an upgrade. |
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65069 Reporter: (Need Real Name) 25 Assigned To: (Ben LaHaise) Status: CLOSED 26 Resolution: DUPLICATE Commit Bug Comments Opened by (Need Real Name) on 2002-05-16 18:36 When I was copying 300k data from local disk to NetApp # ls -l /usr/local/bin -rwxr-xr-x 1 root root 300969 May 16 14:27 rstlistend lrwxrwxrwx 1 root root 25 May 16 14:27 rstterm -> /usr/local/bin/rs # cp -af rst* /tmp/ crashed NetApp. My home dir is mounted as filerdude:/vol/vol0/home0/hjl /home/hjl nfs rw,v3,rsize=32768,wsize=32768,hard,udp,lock,addr=filerdude 0 0 The Linux NFS client machine was transmitting about 10MB/s over 100Mb interface to NetApp when it happened. I am experiencing similar problems except our NetApps just come to their knees. If I force the mount to be NFS version 2 the problem goes away. Write process will hang, then after a while the NFS server will start to fail. There is no way to kill write process on the client, because it is in a "D" state, but after about 10 min. Client will start to write if start tcpdump on it, if I stop tcpdump write process will hang in a couple of seconds again, I resume tcpdump and writes will resume and I can kill it. One only needs to force to nfs v2 or drop the w/rsize to 8k to solve it. But by default it is setting the buffers to 32K which is too high of a default. This throttles the netapp in some yet unknown way (it is unreponsive to all other requests, and the client that is doing a sustained write goes up to 100% utilization of its network interface). All nfs servers I have test this against (Solaris, IRIX) seem to be affected to some extent. We have put out a notice to our Stanford users warning of this issue, as it looks just like a DoS attack. The large rsize & wsize of 32K gets dropped for lack of buffer space moving from Gig to Fast ethernet. This is a real pain but can the root cause be that GigE is not using flow control properly? |
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65410 For large transfers, the linux server locks up if the mount uses nfsvers=3. If I force the mount to nfsvers=2, the machine does not lock up. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. I need to know what really broke/changed and what the correct workaround is (I've got NT-based servers running NFS that don't do NFS V2 well). This is EXACTLY the kind of situation my upper management uses for "Linux/Red Hat isn't ready for prime time" arguments. Setting "nfsvers=2" in /etc/fstab fixes the problem when copying a large file. However, periodically (usually after some period of inactivity on the Redhat system), when typing a simple command (like typing 'ls' while your current directory is an NFS mounted filesystem) will hang the system in the same way resulting in a series of these messages: nfs server <hostname> not responding, timed out This is occuring on a brand new Dell Precision 530MT. If I first COPY the file from the NFS directory to a local one on the windows box, then word opens the file just fine. The NFS client on the windows box is the standard microsoft Unix tools add-on. Once per minute, a cron job wakes up, walks through the directory, copies the files locally (I had to do this extra step because of the same problem as above with word), then processes them. Two problems happen: A) Files get placed in the NFS directory, then mysteriously disappear. I knowthis because of the logs kept on the various servers. I can see that the files were placed onto the NFS server and I can also see gaps in the serial numbers on the processing machine's logs. This seems to imply it is an issue with the NFS server, not the client. B) The NFS directory structure does not update properly. The processing machine (Perl script) reads in the directory listing, then one at a time, copies a file locally, processes it and deletes it from the NFS server. I get emails telling me that various files cannot be found. What is happening is that the file gets processed and deleted. A minute later, the cron job starts again, looks through the NFS directory and gets some of the SAME FILENAMES again. When it goes to process these, they have been deleted a minute prior, so there is an error. I verified in the logs that the "bad" files did indeed get processed one minute prior. It appears to be a problem with how the NFS directory structure is managed. Locally, the directory is fine, but to a client, it is screwed up. These problems are holding me back from installing Linux at several client sites. I cannot afford to have an NFS solution that is not 100%. I will have to go with a Novell or (ick) ms solution if this is not resolved. In any event, this bug is stale since it has not been updated with the effects of using newer kernel errata (-5 or newer). |
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65707 Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: The fastest way to trig the bug seems to be by untar:ing a large file on the mounted NFS filesystem. Actual Results: All attempts to access the NFS filesytem hangs completely. The following is logged in /var/log/messages: kernel: nfs: task nnnn can't get a request slot I don't get any "not responding, still trying" as far as I can see, however. Expected Results: The tar file should have been untar:ed. Note that the solaris server has set "nfssrv:nfs_portmon=1" in /etc/system which disallows NFS client connections from ports above 1024. Just mentioning this because you might a experiencing a similar problem. Here at Brookhaven National Laboratory, we are experiencing exactly the same problem. The problem seems to only show up after the client has been up for a while. It seems to hang when doing a flush operation on the NFS client side. I can mitigate the hang to a simple I/O error for the app by mounting it soft,intr, but this only helps to the point that I don't need to reboot the client. I found this bug (a similar incidence anyway) in Sunsolve as bugid 4764852. Anyhow, the bug also suggests that it may be a problem with the NIC driver. There is nothing to indicate there is anything wrong with the driver for my NIC (3com 3C905), and people with other NICs have complained about the same problem. We had this happen occasionally (once a month or so), but since upgrading to the latest kernel, it is a showstopper. Configuration: The server I am talking to is an Sun8 box. When this happens, it fills the network-pipe 100% with retransmissions from the server to the client. I want to point out this is not a real resolution and someone @ RedHat should look at this. |
bugzilla.redhat.com/bugzilla/show_bug.cgi?id=65772 Note that if I attempt to do the same task in a directory mounted on a linux box, all is well. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Actual Results: 'ar' successfully creates its temporary file in the directory, yet when it tries to close(), it hangs. I can magically un-hang the process by ssh-ing to the box, and the process completes successfully. Additional info: Sometimes this works though, but I can't figure out why! Sometimes it doesn't work with a single file, sometimes it does. It sounds like the driver is missing a wakeup, which in turn causes NFS traffic to be delayed. Here's what dmesg says: 3c59x: Donald Becker and others. |