My service provider has a latency issue over three years.
The problem is happening on the first hop which is after the Layer 2 Service Selective Getaway device. The ISP never accepts error if happen on their side. I started doing a various test to try to determine the exact point where the problem originated.
The problem may occur between backbone provider RedBack and ISP router, but I test on Two RedBack device, one of them is Huawei (old-one) and the other one is new Ericsson SSR.
The problem still occurs both backbone devices. This is also good evidence, but I don’t want to leave any possibility to ISP.
I check forward traffic to another ISP routers over same Layer2 network which contains same xDSL, SSG, Switches. The result shows no problem with anyone.
Also, if anything happens on shared L2 traffic, many users will be affected without related to ISP.
Anyway, I also see this problem on many users in the same City and Airport (DLM).
Can we go deeper?
Let us figure out.
The latency issues sometimes come with connection loss. First hop latency goes to 60, 75, 120 and voila, no respond.
The response is not only the Layer 3 traffic also PPP packets are not responded. PPPoE packages are not achieved to the ISP router, it means L2 traffic also has a problem. At the meantime, system reporting a problem with the SSG connection if I am not mistaken.
Does the connection issue happen on RX port or TX port?
To find out, we can try to audit latency.
So, I write a uping tool. This tool runs the client and server-side.
Clients send current time (3 bytes) and some dummy data (3 bytes), server replaces dummy data with arrive time and send back to the client. Clients make a comparison to determine the latency.
You can see the latency when the connections have no problem. Incoming and outgoing latency may different, this is normal due to routing plans.
Now I just must run uping again when there is a problem, and I can get an even better idea by comparing the normal time with the problem time.