a4e95899bd
makefile updates for building pkgs
Matt
2015-03-10 21:43:37 -07:00
1603201b3e
Merge branch 'testing'
Matt
2015-03-10 21:02:08 -07:00
8e72d6e4cc
fix a couple critical xml parsing bugs. fixes parsing of rss feeds better and xml in general. fixed qa tests to ignore collection list when doing diff.
Matt
2015-03-10 19:13:21 -07:00
cdb1c28705
fix page parser core
Matt
2015-03-10 16:49:45 -07:00
052f2b2009
fix for bug #2092 commented out to avoid qa testing for now.
Matt
2015-03-10 15:45:33 -07:00
c6fd5571d2
if "links":[ is specified in diffbot reply then crawlbot will parse out those links as if they were on the page.
Matt
2015-03-10 14:36:44 -07:00
ef66e1e69f
speed up overflow check for firstip with a little cache thingy. reduce log spam from ssl msgs.
Matt
2015-03-10 13:24:03 -07:00
7cf549bf2a
fix spider request overflow/dropping algo.
Matt
2015-03-10 13:07:00 -07:00
3efb5bd310
fix oopsie
Matt Wells
2015-03-09 21:25:23 -07:00
e346a14a47
added logic to retry diffbot reply on connection reset, connection timed out or gateway timed out (http status 504) msgs. added logic to detect truncated json (missing final }) and not print it. also, at index time, we set a diffbot missing curly error to g_errno so the whole url can be retried later.
Matt Wells
2015-03-09 20:54:34 -07:00
b4419340f7
Merge branch 'testing'
Matt Wells
2015-03-08 21:00:28 -07:00
eccb969e5b
put in some fixes to deal with doledb tree that seems to have m_data[i] and m_data[j] pointing to the same thing. wtf? anyway, deal with that. it should fix the tree or something automatically at startup?
Matt Wells
2015-03-08 20:36:13 -07:00
5e651997e7
fixed langid based query stop words.
Matt
2015-03-08 15:44:23 -07:00
2413a9b9b1
query stop words now based on selected langid.
Matt
2015-03-08 15:16:24 -07:00
a25c358817
fix core dump caused by some corruption of some sort. for deleting json objs associated with a particular url.
Matt Wells
2015-03-08 12:18:47 -07:00
5d0b283f6f
Merge branch 'testing'
Matt
2015-03-08 09:29:27 -07:00
e368fa4fdc
fix so when shutting down merges are suspended. disable threads earlier too.
Matt Wells
2015-03-08 09:21:57 -07:00
236c4988aa
ignore late msg20 reply if we already destroyed the state. a temp fix of a core for now.
Matt Wells
2015-03-08 09:11:13 -07:00
1d0fd3d7d2
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-08 08:56:32 -07:00
79879976fa
try to fix a couple cores. one when parsing bad json. the other in reclaiming doledb tree mem.
Matt Wells
2015-03-08 08:56:10 -07:00
0f863e689b
Merge branch 'testing'
Matt
2015-03-07 20:21:01 -08:00
1cac2e282e
fix core from adding a lot of sites to the sitelist.
mwells
2015-03-07 20:57:17 -07:00
2f61f3b48c
Merge branch 'testing'
Matt
2015-03-07 16:29:37 -08:00
0752b85911
fix for infinite loop hang.
Matt Wells
2015-03-07 16:28:53 -08:00
521bcf01a2
Merge branch 'testing'
Matt
2015-03-07 15:52:41 -08:00
b593027ff6
try to fix default query lang getting reset to nothing in the search controls.
Matt
2015-03-07 15:39:04 -08:00
5fa01e285d
dont show add url and widgets tabs in serps
Matt
2015-03-07 15:32:35 -08:00
2839c38dac
warc injection fixes
Matt
2015-03-07 15:01:47 -08:00
605a384449
Merge branch 'testing'
Matt
2015-03-07 14:57:48 -08:00
91af815558
fix missing dir/adv/addurl tabs
Matt
2015-03-07 14:57:22 -08:00
f887019b28
Merge branch 'testing'
Matt Wells
2015-03-07 11:11:48 -08:00
c6a59d0810
fix rdbcache corruption bugs for winnerlistcache.
Matt Wells
2015-03-07 11:09:06 -08:00
213e430c31
Merge branch 'testing'
Matt
2015-03-06 21:34:21 -08:00
102f2c1ea0
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-06 21:33:56 -08:00
5be4dc3c28
fix bug in overflow logic
Matt
2015-03-06 21:33:40 -08:00
9cd39bb486
Merge branch 'testing'
Matt
2015-03-06 20:37:56 -08:00
7d135219ed
fix dmozparse compiler error
Matt
2015-03-06 20:37:30 -08:00
ed3c5dbc81
fix core from making a url too long after appending -diffbotxyz4000000000 to it
Matt Wells
2015-03-06 19:17:16 -08:00
501382384d
typo fix
Matt Wells
2015-03-06 09:47:40 -08:00
1f8719b0cd
Merge branch 'testing'
Matt
2015-03-06 09:40:44 -08:00
e0dd5720ac
try to fix fdisset for writes on udpserver when sendto() would block...
Matt Wells
2015-03-06 09:30:36 -08:00
4723d9eefa
Merge branch 'testing'
Matt Wells
2015-03-05 20:34:32 -08:00
4506b93ed3
fixed missing sites in sitelinks.txt
Matt Wells
2015-03-05 20:32:01 -08:00
56d65a7c55
adjust dump tagdb cmdline cmd to start at a specified site to aid us in fixing sitelinks.txt missing some sites bug.
Matt Wells
2015-03-05 19:27:36 -08:00
ccd85ada31
do not give proxy for diffbot to use just yet. need to fix https CONNECT support to download the whole page first and then send that back. need to tell diffbot phantomjs to not worry about certificates then.
Matt
2015-03-05 15:32:57 -08:00
367c4d5783
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-05 11:10:54 -08:00
95e3a760e9
proxy fixes
Matt
2015-03-05 11:10:40 -08:00
ad2e1f11fa
Merge branch 'testing'
Matt
2015-03-05 09:18:28 -08:00
56aaad3c9f
do not double devance spider priority for a coll
Matt
2015-03-05 09:11:36 -08:00
6bb9da7532
Merge branch 'testing'
Matt
2015-03-05 08:54:03 -08:00
dfd6d8b2cf
fix critical spider bug that was deleting pages because of bogus SpiderReply::m_langId values!
Matt
2015-03-05 08:49:39 -08:00
7ae5518c1d
if no spider launched for a collection, decrement priority and try again. if we hit -1 priority then advance to next coll in the linked list. reset priority to max at start of each collection's round.
Matt
2015-03-05 08:15:50 -08:00
e8e5f9e005
qa test fixes
Matt
2015-03-05 07:45:28 -08:00
459c65100c
fix querying disabled bug.
Matt Wells
2015-03-05 07:07:30 -08:00
0eafc68a13
debug msg helper
Matt
2015-03-04 12:45:06 -08:00
707497ae1d
fix new cgi parm names
Matt Wells
2015-03-04 12:06:54 -08:00
e768054bca
typo
Matt Wells
2015-03-04 10:50:45 -08:00
38caa517f2
add switches to disable injections or querying from the master controls, for all collections.
Matt Wells
2015-03-04 10:49:37 -08:00
3fc9abe222
do not show any non-public page if allow cloud users is false and no permission. return 403.
Matt Wells
2015-03-02 08:13:38 -08:00
93b505e7bb
fix isCollAdmin() function to return false if not using coll passwords. they'll have to be master admin.
Matt Wells
2015-03-02 07:47:05 -08:00
051a8f0ad6
dont attempt merge in quickpoll. just return do not core.
Matt Wells
2015-03-02 07:26:38 -08:00
a97f1fcad0
add skippedshards and totalshards to search results in xml/json so you know if you are missing results from a dead shard or not. all hosts must be dead in shard for shard to be dead.
mwells
2015-02-27 08:17:32 -07:00
f8db6288ae
ignore dead shards when doing queries so they remain fast.
mwells
2015-02-27 08:02:19 -07:00
f5383d98db
if a shard is dead skip it when searching.
Matt
2015-02-27 07:28:41 -07:00
bb5e0c9c63
another MAX_DGRAMS fix.
Matt
2015-02-27 06:30:17 -07:00
ee672bb3a3
say 'Injected url already indexed' and not 'Injection abandoned' for clarity's sake.
Matt Wells
2015-02-26 16:04:51 -07:00
064d022d6f
call mkdir on 'gb install' cmd.
Matt
2015-02-25 19:49:36 -07:00
d4f67285ce
keep spiders maxed out all the time.
mwells
2015-02-25 18:40:50 -07:00
5ae3ae60d3
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
mwells
2015-02-25 11:40:00 -07:00
4355665fa9
fix ssl handshake hanging forever. time it out now. spider performance fix for when a firstip has massive # of urls and can only have one spidered in the future.
mwells
2015-02-25 11:39:13 -07:00
3f753f8114
sometimes tree data is NULL for some reason... fix core.
Matt
2015-02-24 20:34:41 -07:00
a5e66da609
fix crawls where the url crawl pattern and page process pattern are supplied, but not the url process pattern. such was the dribbble crawl.
Matt Wells
2015-02-24 14:27:38 -07:00
f41877c2e3
compute MIN hop count over all spider requests for the same url in Spider.cpp.
Matt Wells
2015-02-24 13:41:18 -07:00
9436649531
Merge branch 'testing'
Matt
2015-02-22 13:15:57 -07:00
692c2932e8
fixed bug of gb not saving
mwells
2015-02-22 13:11:20 -07:00
4e485b6649
increase dolebuf cache time from 2 to 5 mins for better performance. cache empty dolebufs if winner tree list was not from cache, so in case we have a huge spiderdb scan list of urls we aren't spidering we can cache it, like twitter.com e.g. do not call strstr in getUrlFilterNum2() for .css? or /print/ since it was taking way too much cpu time.
mwells
2015-02-21 15:17:28 -07:00
a57de3289c
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-02-21 09:43:33 -08:00
b80a70a6fd
fix for https urls through proxies using newly updated tcp/loop code.
Matt Wells
2015-02-21 09:25:54 -08:00
cc98589da3
Merge branch 'diffbot-testing'
Matt Wells
2015-02-20 08:18:30 -07:00
0c8e3b8e62
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
mwells
2015-02-20 08:13:20 -07:00
ada18e648b
try to fix core in reclaiming doledb mem
mwells
2015-02-20 08:11:49 -07:00
856823e862
fix qa test some.
Matt
2015-02-19 20:18:30 -07:00
ac4bc2842f
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-02-19 18:09:43 -08:00
480072274d
emergency proxy fixes
Matt Wells
2015-02-19 12:49:42 -08:00
f15e7fbaf4
show total link overflows in spiderdb in the page of stats.
Matt
2015-02-18 19:18:38 -07:00
860ff24227
do not cache winner list if the # of requests from the IP is less than 25k about.
Matt
2015-02-18 19:14:06 -07:00
dbaff2dfb8
insert collection name for spider status.
Matt
2015-02-18 15:18:39 -07:00
18f8eddadd
fix core from shutting down the server.
Matt
2015-02-18 09:41:40 -07:00
c894146e29
redhat cores in xmldoc.o if compiled with -O3 so use -O2
Matt
2015-02-18 08:32:07 -07:00
913102c48c
Merge branch 'testing'
Matt
2015-02-18 07:26:01 -07:00
e7d0687f4c
Merge branch 'diffbot-matt' into diffbot-testing
mwells
2015-02-17 22:50:09 -07:00
cce2308bcf
Merge branch 'diffbot-matt' into diffbot-testing
Matt Wells
2015-02-17 20:16:49 -08:00
ce8c97e8e2
fix log spam
Matt
2015-02-17 20:55:36 -07:00
ef99aabf4d
try to fix qainject1 core in qa.cpp
Matt
2015-02-17 20:17:59 -07:00
dce8d9f930
fix qa bug of not resetting s_i. fix tcpserver.cpp bug of destroying a streaming socket after what is really not the final write.
Matt
2015-02-17 20:10:13 -07:00
d14cb2d5b0
fix debug log msgs.
Matt
2015-02-17 19:15:43 -07:00
2488c1a338
added proper write callback registration into TcpServer.cpp so we only register write callbacks when a non-blocking write does not write all the bytes requested of it, or when a connection does not complete. also fixed up the sslHandshake() function which calls SSL_connect().
Matt
2015-02-16 14:48:39 -07:00
db4fcb30f8
limit downloaded doc size to something under the MAX_DGRAMS limit so msg13 won't core trying to send the reply back.
Matt
2015-02-16 09:43:39 -07:00
c9b4dc66a8
show ignored query words in xml and json. show prettier in html.
Matt
2015-02-13 17:34:31 -08:00