d5560c3e77final fix for new delete column
mwells
2015-03-17 21:47:31 -0600
c9b14b1b89fix 'delete' checkbox in url filters. fix reading in of xml conf files that have </> tags.
Matt
2015-03-17 21:20:27 -0600
dfc069aaa1do away with filtered/banned spider priorities. add checkbox to signify force deletes to remove urls from index if in the index, or not allow them in.
mwells
2015-03-17 20:27:23 -0600
dea534827elangidbits init bug leftover from searchinput reset memset fix i think.
Matt
2015-03-17 15:04:31 -0600
f830eb43f7Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-17 14:31:33 -0600
a54471849bsitemap.xml support for harvesting loc urls. parse xml docs as pure xml again but set nodeid to TAG_LINK etc. so Linkdb.cpp can get links again. added isparentsitemap url filter to prioritize urls from sitemaps. added isrssext to url filters to prioritize new possible rss feed urls. added numinlinks to url filters to prioritize popular urls for spidering. use those filters in default web filter set. fix filters that delete urls from the index using the 'DELETE' priority. they weren't getting deleted.
Matt
2015-03-17 14:26:16 -0600
29b2707ad7fix searchinput::clear() bug. final fix for fhtqt memleak bug.
Matt Wells
2015-03-15 07:48:29 -0700
3b39b1d37afix facet mem leak from QueryTerm::m_facetHashTable and safebuf when doing federated queries over a token.
Matt Wells
2015-03-15 07:18:32 -0700
427fae7135fix log spam
Matt Wells
2015-03-12 22:31:40 -0700
e71af6d26crecompute active list every 3 secs. otherwise it seems buggy and drops collections it shouldn't.
Matt Wells
2015-03-12 22:22:19 -0700
83be5d7d46fix links parser so it harvests outlinks from rss feeds' <link> tags. it was doing this before, now it is doing it again.
Matt
2015-03-12 17:35:47 -0700
7879537ab6fix json search results formatting.
Matt Wells
2015-03-12 14:25:37 -0700
5d2a9d6d8cMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-12 14:00:03 -0700
89b61d95a8fix fix
Matt Wells
2015-03-12 13:59:42 -0700
3c2b082540gbfacetstr: is case-sensitive.
Matt Wells
2015-03-12 13:54:11 -0700
a8dfa56098Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-12 13:19:16 -0700
485c600c7cfix for changing maxtocrawl/process/rounds. fix for thinking a crawl is done when it is just taking a while to populate doledb from the waiting tree for that SpiderColl. we just call populateDoledbFromWaiting in doneSleepingWrapper avery 50ms. it loops over every coll so it could be more efficient.
Matt Wells
2015-03-12 13:15:52 -0700
a4e95899bdmakefile updates for building pkgs
Matt
2015-03-10 21:43:37 -0700
1603201b3eMerge branch 'testing'
Matt
2015-03-10 21:02:08 -0700
8e72d6e4ccfix a couple critical xml parsing bugs. fixes parsing of rss feeds better and xml in general. fixed qa tests to ignore collection list when doing diff.
Matt
2015-03-10 19:13:21 -0700
cdb1c28705fix page parser core
Matt
2015-03-10 16:49:45 -0700
052f2b2009fix for bug #2092 commented out to avoid qa testing for now.
Matt
2015-03-10 15:45:33 -0700
c6fd5571d2if "links":[ is specified in diffbot reply then crawlbot will parse out those links as if they were on the page.
Matt
2015-03-10 14:36:44 -0700
ef66e1e69fspeed up overflow check for firstip with a little cache thingy. reduce log spam from ssl msgs.
Matt
2015-03-10 13:24:03 -0700
7cf549bf2afix spider request overflow/dropping algo.
Matt
2015-03-10 13:07:00 -0700
3efb5bd310fix oopsie
Matt Wells
2015-03-09 21:25:23 -0700
e346a14a47added logic to retry diffbot reply on connection reset, connection timed out or gateway timed out (http status 504) msgs. added logic to detect truncated json (missing final }) and not print it. also, at index time, we set a diffbot missing curly error to g_errno so the whole url can be retried later.
Matt Wells
2015-03-09 20:54:34 -0700
b4419340f7Merge branch 'testing'
Matt Wells
2015-03-08 21:00:28 -0700
eccb969e5bput in some fixes to deal with doledb tree that seems to have m_data[i] and m_data[j] pointing to the same thing. wtf? anyway, deal with that. it should fix the tree or something automatically at startup?
Matt Wells
2015-03-08 20:36:13 -0700
5e651997e7fixed langid based query stop words.
Matt
2015-03-08 15:44:23 -0700
2413a9b9b1query stop words now based on selected langid.
Matt
2015-03-08 15:16:24 -0700
a25c358817fix core dump caused by some corruption of some sort. for deleting json objs associated with a particular url.
Matt Wells
2015-03-08 12:18:47 -0700
5d0b283f6fMerge branch 'testing'
Matt
2015-03-08 09:29:27 -0700
e368fa4fdcfix so when shutting down merges are suspended. disable threads earlier too.
Matt Wells
2015-03-08 09:21:57 -0700
236c4988aaignore late msg20 reply if we already destroyed the state. a temp fix of a core for now.
Matt Wells
2015-03-08 09:11:13 -0700
1d0fd3d7d2Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-03-08 08:56:32 -0700
79879976fatry to fix a couple cores. one when parsing bad json. the other in reclaiming doledb tree mem.
Matt Wells
2015-03-08 08:56:10 -0700
0f863e689bMerge branch 'testing'
Matt
2015-03-07 20:21:01 -0800
1cac2e282efix core from adding a lot of sites to the sitelist.
mwells
2015-03-07 20:57:17 -0700
2f61f3b48cMerge branch 'testing'
Matt
2015-03-07 16:29:37 -0800
0752b85911fix for infinite loop hang.
Matt Wells
2015-03-07 16:28:53 -0800
521bcf01a2Merge branch 'testing'
Matt
2015-03-07 15:52:41 -0800
b593027ff6try to fix default query lang getting reset to nothing in the search controls.
Matt
2015-03-07 15:39:04 -0800
5fa01e285ddont show add url and widgets tabs in serps
Matt
2015-03-07 15:32:35 -0800
2839c38dacwarc injection fixes
Matt
2015-03-07 15:01:47 -0800
605a384449Merge branch 'testing'
Matt
2015-03-07 14:57:48 -0800
91af815558fix missing dir/adv/addurl tabs
Matt
2015-03-07 14:57:22 -0800
f887019b28Merge branch 'testing'
Matt Wells
2015-03-07 11:11:48 -0800
c6a59d0810fix rdbcache corruption bugs for winnerlistcache.
Matt Wells
2015-03-07 11:09:06 -0800
213e430c31Merge branch 'testing'
Matt
2015-03-06 21:34:21 -0800
102f2c1ea0Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-06 21:33:56 -0800
5be4dc3c28fix bug in overflow logic
Matt
2015-03-06 21:33:40 -0800
9cd39bb486Merge branch 'testing'
Matt
2015-03-06 20:37:56 -0800
7d135219edfix dmozparse compiler error
Matt
2015-03-06 20:37:30 -0800
ed3c5dbc81fix core from making a url too long after appending -diffbotxyz4000000000 to it
Matt Wells
2015-03-06 19:17:16 -0800
501382384dtypo fix
Matt Wells
2015-03-06 09:47:40 -0800
1f8719b0cdMerge branch 'testing'
Matt
2015-03-06 09:40:44 -0800
e0dd5720actry to fix fdisset for writes on udpserver when sendto() would block...
Matt Wells
2015-03-06 09:30:36 -0800
4723d9eefaMerge branch 'testing'
Matt Wells
2015-03-05 20:34:32 -0800
4506b93ed3fixed missing sites in sitelinks.txt
Matt Wells
2015-03-05 20:32:01 -0800
56d65a7c55adjust dump tagdb cmdline cmd to start at a specified site to aid us in fixing sitelinks.txt missing some sites bug.
Matt Wells
2015-03-05 19:27:36 -0800
ccd85ada31do not give proxy for diffbot to use just yet. need to fix https CONNECT support to download the whole page first and then send that back. need to tell diffbot phantomjs to not worry about certificates then.
Matt
2015-03-05 15:32:57 -0800
367c4d5783Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-03-05 11:10:54 -0800
95e3a760e9proxy fixes
Matt
2015-03-05 11:10:40 -0800
ad2e1f11faMerge branch 'testing'
Matt
2015-03-05 09:18:28 -0800
56aaad3c9fdo not double devance spider priority for a coll
Matt
2015-03-05 09:11:36 -0800
6bb9da7532Merge branch 'testing'
Matt
2015-03-05 08:54:03 -0800
dfd6d8b2cffix critical spider bug that was deleting pages because of bogus SpiderReply::m_langId values!
Matt
2015-03-05 08:49:39 -0800
7ae5518c1dif no spider launched for a collection, decrement priority and try again. if we hit -1 priority then advance to next coll in the linked list. reset priority to max at start of each collection's round.
Matt
2015-03-05 08:15:50 -0800
e8e5f9e005qa test fixes
Matt
2015-03-05 07:45:28 -0800
459c65100cfix querying disabled bug.
Matt Wells
2015-03-05 07:07:30 -0800
0eafc68a13debug msg helper
Matt
2015-03-04 12:45:06 -0800
707497ae1dfix new cgi parm names
Matt Wells
2015-03-04 12:06:54 -0800
e768054bcatypo
Matt Wells
2015-03-04 10:50:45 -0800
38caa517f2add switches to disable injections or querying from the master controls, for all collections.
Matt Wells
2015-03-04 10:49:37 -0800
3fc9abe222do not show any non-public page if allow cloud users is false and no permission. return 403.
Matt Wells
2015-03-02 08:13:38 -0800
93b505e7bbfix isCollAdmin() function to return false if not using coll passwords. they'll have to be master admin.
Matt Wells
2015-03-02 07:47:05 -0800
051a8f0ad6dont attempt merge in quickpoll. just return do not core.
Matt Wells
2015-03-02 07:26:38 -0800
a97f1fcad0add skippedshards and totalshards to search results in xml/json so you know if you are missing results from a dead shard or not. all hosts must be dead in shard for shard to be dead.
mwells
2015-02-27 08:17:32 -0700
f8db6288aeignore dead shards when doing queries so they remain fast.
mwells
2015-02-27 08:02:19 -0700
f5383d98dbif a shard is dead skip it when searching.
Matt
2015-02-27 07:28:41 -0700
bb5e0c9c63another MAX_DGRAMS fix.
Matt
2015-02-27 06:30:17 -0700
ee672bb3a3say 'Injected url already indexed' and not 'Injection abandoned' for clarity's sake.
Matt Wells
2015-02-26 16:04:51 -0700
064d022d6fcall mkdir on 'gb install' cmd.
Matt
2015-02-25 19:49:36 -0700
d4f67285cekeep spiders maxed out all the time.
mwells
2015-02-25 18:40:50 -0700
5ae3ae60d3Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
mwells
2015-02-25 11:40:00 -0700
4355665fa9fix ssl handshake hanging forever. time it out now. spider performance fix for when a firstip has massive # of urls and can only have one spidered in the future.
mwells
2015-02-25 11:39:13 -0700
3f753f8114sometimes tree data is NULL for some reason... fix core.
Matt
2015-02-24 20:34:41 -0700
a5e66da609fix crawls where the url crawl pattern and page process pattern are supplied, but not the url process pattern. such was the dribbble crawl.
Matt Wells
2015-02-24 14:27:38 -0700
f41877c2e3compute MIN hop count over all spider requests for the same url in Spider.cpp.
Matt Wells
2015-02-24 13:41:18 -0700
9436649531Merge branch 'testing'
Matt
2015-02-22 13:15:57 -0700
692c2932e8fixed bug of gb not saving
mwells
2015-02-22 13:11:20 -0700
4e485b6649increase dolebuf cache time from 2 to 5 mins for better performance. cache empty dolebufs if winner tree list was not from cache, so in case we have a huge spiderdb scan list of urls we aren't spidering we can cache it, like twitter.com e.g. do not call strstr in getUrlFilterNum2() for .css? or /print/ since it was taking way too much cpu time.
mwells
2015-02-21 15:17:28 -0700
a57de3289cMerge branch 'diffbot' into diffbot-testing
Matt Wells
2015-02-21 09:43:33 -0800