Matt Wells
7bd2344f41
increase regular page download timeout from 30
...
seconds to 60 seconds to accommodate some slower websites.
2016-05-18 10:05:02 -07:00
Matt Wells
d7a6a0a1ff
fix gap.com redirects that require us
...
setting multiple cookies in the spider request.
2016-02-09 13:38:59 -08:00
Matt Wells
9247e15210
make spider use HTTP/1.1 not 1.0 since
...
some sites have been found to return
406 unacceptable periodically because of it.
2016-02-05 10:14:09 -08:00
Zak Betz
fc3ba95226
Fix host selection for downloading when nospider directives are present.
...
It was always choosing the first host sequentially with spidering
enabled. Now it looks at the other hosts in the shard selected.
2015-11-29 21:36:19 -07:00
Zak Betz
9ff387a898
More fixes to prevent spider traffic from hitting hosts with nospider
...
directive.
Bug fix for msg20 lookups always being directed away from noquery hosts.
2015-11-13 15:03:02 -07:00
Matt
37cc4f2ba8
Merge branch 'diffbot-testing' into testing
2015-11-09 11:13:42 -07:00
Matt Wells
776b94396e
a new ban msg for http status 503
2015-10-22 13:23:02 -07:00
Matt
39214a9dc6
Merge branch 'diffbot-testing' into testing
2015-10-02 19:26:15 -06:00
Matt Wells
93943a0cab
some pages legitamately have no outlinks, no need to think
...
they were banned.
2015-09-24 14:01:23 -07:00
Matt
100888d691
fix file/dir creation permissions bugs
2015-09-21 12:44:41 -06:00
Matt
74cde33a3a
just use the user's umask val for all file/dir creation
2015-09-21 11:33:38 -06:00
Matt
ce7b06fc4d
all files made are now group writable.
...
if you don't like that then you can make
a special group and set the directory just
group writable for that group using chmod g+s <dir>.
2015-09-21 11:19:34 -06:00
Matt
8299197cca
comments in <script> tags are a
...
convultion. deal with all four types and
their precedence issues. all of this is
to find the proper end of the </script> and
not a </script> or <script> that is being
printed out in the javascript in the <script> tag.
2015-08-28 16:31:22 -06:00
Matt
1badb8cd07
fix up hammer queue table print out on
...
sockets pages. make crawl delay link to the
robots.txt.
2015-08-25 11:07:54 -06:00
Matt
ea2c2d7190
show read buf of http sockets as well as the send buf
...
in the tool tip.
2015-08-25 10:53:16 -06:00
Matt Wells
7fcc2ab4e1
in the sockets table page,
...
show url download requests that are queued up to prevent
hammering an ip. also show the first 500 bytes of the send buf
in the http server sockets table.
2015-08-25 09:34:45 -07:00
Matt
a821d8bc41
Merge branch 'ia' into ia-zak
2015-05-05 23:46:16 -07:00
Matt Wells
d6bb5e98f8
more ban detect fixes
2015-05-04 14:40:52 -07:00
Matt
86800a0656
if a root/seed url has no outlinks, assumed banned.
2015-05-04 14:23:28 -07:00
Matt
a3672701f6
Merge branch 'diffbot-testing' into ia
2015-05-03 12:08:25 -07:00
Matt Wells
31e54df0c8
fix core from ::read returning 0.
...
do not do ban checks when autobackoff and autoproxies are disabled.
2015-05-03 10:17:36 -07:00
Matt Wells
821c6cb424
fix core
2015-05-03 00:33:47 -07:00
Matt Wells
2421bf3d1d
ia checkpoint
2015-05-02 23:51:19 +00:00
Matt Wells
6d39bb5df8
added hack to log controls to avoid sending
...
msg13 (download requests) to host #0 until we fix
the streaming bug better.
2015-05-02 10:32:13 -07:00
Matt Wells
dccc1667ec
added logdebugmsg13 to find out why urls are getting stuck
...
on host 0 in msg13 handler.
turn crawldelay backoff logic off by default until we fix
case of mis-detection on captchas and maybe some other things.
fix core when loading twitchy table on startup.
2015-05-01 13:19:45 -07:00
Matt
16bf1cf063
make auto back off a parm. i could see you'd want to
...
disable that if the ban/throttle detection is wrong.
2015-04-30 17:32:51 -07:00
Matt
656d89d98d
spider proxy simplifications. remove global parms and just make
...
coll specific for now.
2015-04-30 17:18:57 -07:00
Matt Wells
20912541af
remove global spider proxy parms for simpliclty
2015-04-30 17:06:35 -07:00
Matt Wells
3051310d16
fix core from bulk job ip ban detection in msg13
2015-04-30 16:59:02 -07:00
Matt
a15c9fd4c6
more fixes for auto proxies
2015-04-30 16:52:46 -07:00
Matt
7f1ac7460f
fixes for auto backoff
2015-04-30 16:34:11 -07:00
Matt Wells
1825f6bd27
retry download if was in the twitchy table
...
at start of download, and not using proxies at all.
2015-04-30 16:06:13 -07:00
Matt
0970975a57
tested auto proxy use and auto spider (non-proxy) backoff to
...
3 second crawldelay successfully on the stamps site.
2015-04-30 15:31:09 -07:00
Matt
75c05ef9a9
twitchy updates
2015-04-30 14:18:23 -07:00
Matt
66db73f494
if we can't use any proxies and we detected a url as banned
...
then just use a crawldelay of 3 seconds.
2015-04-30 14:16:17 -07:00
Matt
6d8bb19962
checkpoint for auto proxy logic
2015-04-30 13:28:57 -07:00
Matt
bdff012152
checkpoint for auto-proxy logic.
2015-04-29 18:15:54 -07:00
Matt Wells
0656cc4c72
fix a core on seraph host #6
2015-04-22 15:46:35 -07:00
Matt
95e3a760e9
proxy fixes
2015-03-05 11:10:40 -08:00
Matt
db4fcb30f8
limit downloaded doc size to something
...
under the MAX_DGRAMS limit so msg13 won't core
trying to send the reply back.
2015-02-16 09:43:39 -07:00
Matt
6fc83566e2
more fixes
2015-02-02 14:06:38 -08:00
Matt
c15bd53e52
added support for supplying basic proxy authorization
...
to spider proxies. username:password@1.2.3.4:80
2015-02-02 13:23:38 -08:00
mwells
87285ba3cd
use gbmemcpy not memcpy so we can get profiler working again
...
since memcpy can't be interrupted and backtrace() called.
2015-01-13 12:25:42 -07:00
Matt
c03ba31ec2
try to reduce log spam
2015-01-05 11:03:49 -08:00
Matt
6c5ca9162c
quick fix for internal ip bug
2014-12-16 13:39:09 -08:00
Matt
329f004e74
compiler updates
2014-12-10 12:09:04 -08:00
Matt Wells
8e315504a2
fix empty rdbcache bug of not enough buf mem.
2014-11-27 13:17:00 -08:00
Matt
4e8a42e024
text replacements for bad int32_t substitutions
2014-11-17 18:24:38 -08:00
Matt
931a1c4bc6
good checkpoint. quite a few fixes.
2014-11-17 18:13:36 -08:00
Matt
4a0554c76f
more 64bit fixes
2014-11-14 17:30:32 -08:00