Saturday, December 14, 2013

Solaris 10 networking problem : Many input errors and CRC

www.unixbabuforum.inI have a network problem with all of my Solaris servers (mainly SB6000 and X4640). When I look on the switches (Foundry FGS-24G) ports where I connected my blades and X4640 I see many input errors and CRC : 

telnet@xxxxxxx#sh int eth 0/1/1 
GigabitEthernet0/1/1 is up, line protocol is up 
Hardware is GigabitEthernet, address is 0012.f2e3.6080 (bia 0012.f2e3.6080) 
Configured speed auto, actual 1Gbit, configured duplex fdx, actual fdx 
Configured mdi mode AUTO, actual MDIX 
Member of 4 L2 VLANs, port is tagged, port state is FORWARDING 
BPDU guard is Disabled, ROOT protect is Disabled 
Link Error Dampening is Disabled 
STP configured to ON, priority is level0, mac-learning is enabled 
Flow Control is config enabled, oper enabled, negotiation disabled 
mirror disabled, monitor disabled 
Not member of any active trunks 
Not member of any configured trunks 
Port name is xxxxxxxxxxxx 
Inter-Packet Gap (IPG) is 96 bit times 
IP MTU 1500 bytes 
300 second input rate: 2688 bits/sec, 4 packets/sec, 0.00% utilization 
300 second output rate: 16648 bits/sec, 9 packets/sec, 0.00% utilization 
5895095 packets input, 604268339 bytes, 0 no buffer 
Received 199925 broadcasts, 0 multicasts, 5695170 unicasts 
1057 input errors, 691 CRC, 0 frame, 0 ignored 
0 runts, 0 giants 
19744774 packets output, 6964507122 bytes, 0 underruns 
Transmitted 2048306 broadcasts, 6046781 multicasts, 11649687 unicasts 
0 output errors, 0 collisions 
Relay Agent Information option: Disabled 

When I 'snoop' the traffic I see many TCP retransmissions on Solaris side. 

I use 2 different NICs (and so 2 different drivers) : igb and e1000g and I see the problem on both of them. Fact is that a big latency occurs due to TCP retransmission (around 900ms) and this is a big problem because these servers are web servers.

www.unixbabuforum.inYou can check for any problems in /var/adm/messages. 
# netstat 

www.unixbabuforum.inFirst verify that the error and CRC counts are increasing and not just something left over from initial connection (your description doesn't make that clear). Clear your stats and look again. 

Usually, high error and CRC counts are due to duplex mismatch and usually seen on the end that is half duplex (if the other end is full duplex). This isn't the case here since you are showing the end that is in fdx. Usually you get high runt counts too for mismatch duplex. However, verify that the servers are running 1Gbps and fdx (use dladm). 

Netstat on the servers would show if you have errors on that end as well. 

These kind of problems are usually physical (as above, or bad cable); but it seems less likely that bad cables would affect all servers equally. 

Much more arcane: we have found one switch type that has a bug in the Ethernet chipset, when hard coded speed and duplex, where the Rx wire pair is reversed and that seems to disturb the electrical line balance. We proved that by building a special cable with the Rx pair reversed on one end and it worked. The workaround was to autoneg speed and duplex. This only occurred with certain device models connected (our expensive Ethernet test set) while other devices with seemed to be more forgiving.


Post a Comment

Design by BABU | Dedicated to grandfather | welcome to BABU-UNIX-FORUM